Abstract
During the COVID-19 pandemic, forecasting COVID-19 trends to support planning and response was a priority for scientists and decision makers alike. In the United States, COVID-19 forecasting was coordinated by a large group of universities, companies, and government entities led by the Centers for Disease Control and Prevention and the US COVID-19 Forecast Hub (https://covid19forecasthub.org). We evaluated approximately 9.7 million forecasts of weekly state-level COVID-19 cases for predictions 1-4 weeks into the future submitted by 24 teams from August 2020 to December 2021. We assessed coverage of central prediction intervals and weighted interval scores (WIS), adjusting for missing forecasts relative to a baseline forecast, and used a Gaussian generalized estimating equation (GEE) model to evaluate differences in skill across epidemic phases that were defined by the effective reproduction number. Overall, we found high variation in skill across individual models, with ensemble-based forecasts outperforming other approaches. Forecast skill relative to the baseline was generally higher for larger jurisdictions (e.g., states compared to counties). Over time, forecasts generally performed worst in periods of rapid changes in reported cases (either in increasing or decreasing epidemic phases) with 95% prediction interval coverage dropping below 50% during the growth phases of the winter 2020, Delta, and Omicron waves. Ideally, case forecasts could serve as a leading indicator of changes in transmission dynamics. However, while most COVID-19 case forecasts outperformed a naïve baseline model, even the most accurate case forecasts were unreliable in key phases. Further research could improve forecasts of leading indicators, like COVID-19 cases, by leveraging additional real-time data, addressing performance across phases, improving the characterization of forecast confidence, and ensuring that forecasts were coherent across spatial scales. In the meantime, it is critical for forecast users to appreciate current limitations and use a broad set of indicators to inform pandemic-related decision making.
Author Summary As SARS-CoV-2 began to spread throughout the world in early 2020, modelers played a critical role in predicting how the epidemic could take shape. Short-term forecasts of epidemic outcomes (for example, infections, cases, hospitalizations, or deaths) provided useful information to support pandemic planning, resource allocation, and intervention. Yet, infectious disease forecasting is still a nascent science, and the reliability of different types of forecasts is unclear. We retrospectively evaluated COVID-19 case forecasts, which were often unreliable. For example, forecasts did not anticipate the speed of increase in cases in early winter 2020. This analysis provides insights on specific problems that could be addressed in future research to improve forecasts and their use. Identifying the strengths and weaknesses of forecasts is critical to improving forecasting for current and future public health responses.
Introduction
Predicting the trajectory of an epidemic to support control and mitigation planning is the primary objective of infectious disease forecasting. To this end, large-scale, collaborative forecasting efforts across multiple disease systems, such as influenza (1–3), dengue (4), West Nile (5), and Ebola viruses (6), have been integrated into routine public health workflows and emergency response (7). Researchers in academia, private institutions, and the United States (US) government built upon these frameworks to incorporate forecasts into the COVID-19 information systems used to inform pandemic response and created the US COVID-19 Forecast Hub. In April 2020, the US Centers for Disease Control and Prevention (CDC) and the COVID-19 Forecast Hub began collecting COVID-19 death forecasts (8). Compared to death reports, case reports are a leading indicator of SARS-CoV-2 infections, as the time from infection to case report is typically shorter than that between infection and death report. Hence, information gleaned from case forecasts is potentially more actionable.
Case forecasts for all US counties (n=3,143), states (n=50), territories (n=5), the District of Columbia (DC), and the nation as a whole were generated and collected beginning in July 2020, with ensemble forecasts of cases first posted on a CDC webpage on August 6, 2020 (8,9). Because of their potential utility, case forecasts were also integrated into US government web pages and situational awareness updates (10). In addition, county-level case forecasts were used to inform vaccine trial site selection (11) and COVID-19 case forecasts have been cited as useful for guiding personal risk-based decisions (12). Because these forecasts influence policies and personal decisions, accuracy and precision of the forecasts is of the utmost importance. Incorrect forecasts can lead to inappropriate policy implementation and resource allocation, and also to erosion of trust in public health institutions (13).
As part of routine use of the case forecasts in the COVID-19 response, real-time evaluation was conducted. One of the performance metrics included in the evaluation was the 95% prediction interval (PI) coverage, an estimate of the frequency at which the interval captures the eventually observed data. The 95% PI of a reliable forecast should capture eventually reported cases 95% of the time. However, the real-time evaluation indicated that case forecasts were not always reliable, with much lower 95% PI coverage than expected (14). For example, in November 2020 as the 2020-2021 winter wave began, the 95% PI coverage for all states and territories was less than 50% for even the shortest, 1-week ahead forecasts from the ensemble – generally the most reliable forecast. Repeated periods of low coverage during subsequent surges led CDC to stop posting COVID-1911case forecasts in December 2021. Though these forecasts showed poor performance, there are opportunities to develop more precise and reliable future predictions.
Evaluation of forecast performance provides an opportunity not only to assess prediction skill for the purposes of improving forecasts, but also to assess the reliability of the forecasts and foster transparency between forecast users and creators. While evaluation is recommended in forecasting research guidelines (i.e., EPIFORGE 2020 (15), a systematic review of COVID-19 models showed that half of published models did not include probabilistic predictions and that approximately one-fourth of published models did not include performance evaluations (16). We have previously evaluated forecast performance of cumulative (17) and incident (18) COVID-19 deaths submitted to the COVID-19 Forecast Hub. Given that an ensemble of submitted models provided consistently accurate probabilistic forecasts at different scales in both evaluations, here we apply similar methods to assess the prediction skill of the COVID-19 case forecasters, with particular interest in the COVIDhub ensemble model (that is, a model that combine predictions from forecasts submitted to the Forecast Hub). Specifically, we analyze prediction interval coverage and other aspects of nearly 10 million individual forecasts collected by the COVID-19 Forecast Hub for US jurisdictions between July 2020 and December 2021, the full period over which COVID-19 case forecasts were published by the CDC. We analyze relative forecast performance across spatial scales and phases of the pandemic to identify limitations and opportunities for future improvement of case forecasts.
Results
Summary of Included Team Forecasts
A total of 14,960,171 forecasts were submitted by 67 teams throughout the analysis period (see Supporting Information [S] 1 for submission patterns over time). Because forecasts were submitted at multiple geographic scales, we stratified analyses for 1) national forecasts, 2) state (including all 50 states), territory (US Virgin Islands and Puerto Rico), and DC forecasts), 3) county level forecasts (include all 3,143 counties and county equivalents), split into five equal sized groups based on county population size.
We first evaluated forecasts for inclusion criteria based on numbers of locations, horizons, and time periods forecast with the same model. Briefly, teams were included if they submitted the full range of required quantiles, included at least 50 of states/territories/DC or 75% of counties, and produced forecasts at least four weeks into the future for at least 50% of the time points in the study period. At the national level, 22 sets of team forecasts met these criteria (5,136 forecasts across dates and forecast horizons), 23 sets of team forecasts met the state/territory level criteria (280,132 forecasts across jurisdictions, dates, and forecast horizons), and 15 sets of team forecasts met the county-level criteria (9,415,460 forecasts across counties, dates, and forecast horizons). Overall, 64.8% of all submitted forecasts were included in the analysis (9,700,728 forecasts). Of the included forecasts, 11 sets of team forecasts met the inclusion criteria for analyzing submissions across all geospatial scales (8,125,220 forecasts for specific locations, date and forecast horizon).
Each team included in the analysis submitted forecasts that were generated from unique model structures, data inputs, and assumptions (S1). Two naïve models (the COVIDhub-baseline and CEID-Walk) and four ensemble models (t he COVIDhub-4_week_ensemble, the COVIDhub-trained_ensemble, LNQ-ens1, and UVA-Ensemble), which combined multiple forecasts into one, were included in the 26 models evaluated (see S1 Table 1.1). The COVIDhub-baseline model projects the number of reported cases in the most recent week as the median predicted value for the next 4 weeks. CEID-Walk is a random walk model with a simple method for removing outliers. A total of seven models included data on COVID-19 hospitalizations, 12 models incorporated demographic data, and seven models used mobility data. Of the 26 evaluated models, three assumed that social distancing and other behavioral patterns changed during the prediction period.
The evaluation period consisted of 1-4 week ahead forecasts submitted in the 73 weeks from July 28, 2020 through December 21, 2021. Multiple phases of the US epidemic were included: the late summer 2020 increase in several locations, a large late-fall/early-winter surge in 2020/2021, the rise and fall of the Delta variant in the summer and fall of 2021, and the early phase of the Omicron variant’s dominance in winter 2021 (Figure 1A). Performance of the national ensemble forecasts varied over this period (Figure 1B). For some forecasts, the median predictions were close to the cases eventually reported, and most reported numbers fell within the 95% PIs. However, forecasts made at other times, such as January 2021 or December 2021, diverged widely from the reported data. At those times, the forecasts missed substantial decreases and increases, respectively, with reported cases falling within the 95% prediction interval for only 1-week ahead forecasts.
Aggregate performance
We evaluated aggregate forecast performance with two metrics: Weighted Interval Score (WIS), a proper score considering both precision and accuracy, and prediction interval coverage, an indicator of forecast uncertainty. Lower WIS values reflect forecasts with probability mass closer to observed values. We assessed scaled pairwise WIS relative to the baseline model (referred to throughout as relative WIS, or rWIS) for national and state/territory/DC forecasts (Figure 2). A rWIS less than one indicates performance that is better than the baseline model.
Overall, seven of 22 team’s forecast models outperformed the COVIDhub-baseline model at the state/territory/DC level (i.e., had rWIS values less than 1.0), and 11 outperformed the baseline model at the national level. Six of these teams outperformed the baseline model at both scales: LNQ-ens1, COVIDhub-4_week_ensemble, USC-SI_kJalpha, LANL-GrowthRate, Microsoft-DeepSTIA, and CU-select.
PI coverage at the 95% level should be close to 95% for well calibrated forecasts. However, it was lower for most sets of team forecasts, with only one (LNQ-ens1) having coverage of at least 90% at all scales, while others were as low as 23%. PI coverage at 50% and 80% levels were also well below nominal levels for most sets of team forecasts, including the COVIDhub-4_week_ensemble (Figure 3). For the 50% prediction interval, no sets of team forecasts had coverage better than 36% at any scale. Only two sets of team forecasts had better coverage than the COVIDhub-4_week_ensemble for the geographic scales in which they submitted forecasts: LNQ-ens1 (all scales) and JHU_UNC_GAS-StatMechPool (state/territory/DC and large county levels).
Forecast skill also showed distinct patterns across jurisdictional scales, with rWIS decreasing for larger jurisdiction scales (e.g., national vs. state/territory) or population sizes (e.g., larger counties vs. smaller counties, Figure 4) for most sets of team forecasts. In contrast to this general trend, for three sets of team forecasts, that pattern was inverted, one team had no distinct pattern, and the COVIDhub-4_week_ensemble had markedly consistent rWIS across all scales. Consistent with the aggregate findings, both LNQ-ens1 and COVIDhub-4_week_ensemble had rWIS lower than 1.0 at all scales, while LANL-GrowthRate had rWIS greater than 1.0 for smaller counties.
Performance across jurisdictions
There was additional variability in forecast skill between jurisdictions. Only two team forecasts (LNQ-ens1 and COVIDhub-4_week_ensemble) performed as well as or better than the baseline for all included states and territories (Figure 5). Variation was higher between team forecasts than between specific jurisdictions, but the baseline model tended to outperform more models in some jurisdictions (e.g., the baseline was better in Colorado, Kansas, Puerto Rico) than in others (e.g., the baseline was worse in Mississippi, South Carolina, West Virginia).
Performance over time
While rWIS varied between team forecasts and jurisdictions, it varied even more over time (Figure 6). For example, all models had relatively high WIS in December 2020-January 2021 and low WIS in June 2021. Prediction interval coverage also varied between teams and over time, with most team forecasts exhibiting times of low coverage. Across most time points, the baseline model outperformed many team forecasts, including the COVIDhub-4_week_ensemble, though the ensemble more often outperformed the baseline in both metrics at the national, state/territory, and large county scales. Increased WIS and decreased prediction interval coverage generally occurred with increasing case counts, such as in the fall of 2020 and summer of 2021. The worst performance was in the early Omicron wave in the winter of 2021. For the last set of ensemble forecasts posted by CDC in December 2021 (https://www.cdc.gov/coronavirus/2019-ncov/science/forecasting/forecasts-cases.html), the WIS reached the highest level ever for all scales and the reported case numbers were outside the 95% prediction interval for most locations at every forecast horizon.
To further investigate these temporal patterns in performance, we first classified each forecast week as increasing, peak, decreasing, or nadir based on the estimated time-varying reproduction number for that given week and jurisdiction. We then fitted Gaussian generalized estimating equations (GEE) models for each set of team forecasts, using a normalized, log transformed WIS value per forecast time and location as the model outcome. The regression models were adjusted for each prediction horizon and included a natural spline with two degrees of freedom for the time/state reported case counts to adjust for intrinsic increases in WIS due to higher values in reported cases (see S6). In agreement with the aggregated results (Figure 2), we found that the expected WIS at the mean number of case counts across all jurisdictions was lower than the baseline for the better performing models (6 team forecasts and the ensemble) and higher than the baseline for others (8 team forecasts).
Forecasts skill also varied across epidemic phases (Figure 7B). Compared to the baseline model across all phases, overall skill for most models was better in nadir and peak phases and worse in increasing and decreasing phases. LNQ-ens1 and the COVIDhub ensemble outperformed the baseline model in all epidemic phases between August 1, 2020 and January 15, 2022, while several other team models outperformed the baseline in some phases.
To examine whether our results were affected by reporting anomalies, we also conducted sensitivity analyses for data revisions, when data were revised at a later date, and for outlier data points, when reported cases were outside of weekly expected ranges (see S2). We first identified weeks in which revised case counts as of April 2, 2022 differed from the case counts initially reported for that week, excluded them from the dataset, and reran the GEE models. With this partial dataset, the results were essentially unchanged. Next, we identified outliers as reported case counts outside of the expected range by at least two of the three following algorithms: a rolling median, a seasonal trend decomposition, and a seasonal trend decomposition without a seasonality term, each method over a 21-day window. Approximately 3% of weeks (686 of 27,489 total week-location combinations in the analysis period) had at least one day of reported cases identified as an outlier. We then excluded the weeks with outliers and the week following an outlier and reran the GEE models. This sensitivity analysis had comparable results to the models with the full data (see S2 Figure 2.3, Panel A.).
Discussion
We evaluated performance of 9.7 million COVID-19 case forecasts at multiple geospatial scales in the US over approximately a year and a half. Real-time analyses and those presented here revealed important limitations in these forecasts. Forecast prediction intervals were largely over-confident, that is, prediction interval coverage was lower than the nominal value, particularly when case numbers were changing rapidly and forecasts could have been most useful. Few team forecasts outperformed a relatively simple and minimally informative baseline model. Forecast skill degraded for smaller geographic scales where forecasts could potentially be most useful. Forecast skill was also lowest when case counts were changing the most, in phases of increasing or decreasing transmission. These limitations of case forecasts indicate key areas for improvement and important reasons to use case forecasts with caution.
Several technical challenges for forecasts were evident in these analyses. First, cases are a relatively early indicator of transmission, with no clear leading signal in traditional public health surveillance data (e.g., unlike for death forecasts, where case counts themselves can provide information for predicting future deaths). While non-traditional data sources may provide a useful predecessor to changing population case counts, the evidence from previous work is unclear. For example, internet searches, medical claims, and online surveys have been used to modestly improve case forecast accuracy relative to models without those data (19). Estimating case counts using both wastewater and clinical surveillance data has shown mixed results (20–23). Additional integration of temporal dynamics could also be helpful. The case forecasts analyzed here were developed and evaluated based on the date when cases were reported, not when individuals were infected, became ill, sought care, or were tested. Additional detail on those dates could enable models to better capture the current dynamics using nowcasting approaches giving earlier signals of change.
Second, and likely related to the challenge of cases being an early indicator, the models had substantial variation in skill between epidemic phases. In general, forecast skill was worst for the increasing phase followed by the decreasing phase. In many of these periods of low performance (e.g., the 2020-2021 winter, Delta, and Omicron waves), the COVIDhub ensemble predicted possible or probable increases or decreases, but not at the rate that actually occurred. This effect may be even stronger than our results show as they rely on a comparison to the baseline which, by definition, does not predict change. While epidemic phase is unknown in real time, it too can be estimated, and these results and others suggest that accounting for epidemic phase when making predictions could improve the forecast skill of ensemble models (24,25). Additional data, as discussed above, or model components associated with distinct phases could also help improve predictive capabilities. Seasonal changes in transmission biology and human behavior, emergence of variants, and changing mitigation behavior all contribute to transmission dynamics. While some forecasting models incorporate seasonality and variants, integration of human behavior to characterize the link between behavior and transmission has lagged (13,26– 28). Ensemble approaches offer another opportunity to mitigate phase-specific differences. Team modeling skill across phases was highly heterogeneous, but two ensemble approaches were better than the baseline in all phases.
Another challenge across most forecasts was overconfidence, a pattern seen with other infectious disease forecasts (4,18). The baseline model predicted a flat trend, yet it outperformed many sets of team forecasts in the increasing and decreasing phase only because its predictions had high uncertainty around that flat trend. The COVIDhub ensemble performance, on the other hand, benefitted by combining uncertainty across multiple models, yet, like the constituent models, also exhibited overconfidence. The temporal and phase-specific analyses suggest that it is, during rapid increases and decreases, that model overconfidence is most pronounced. Previous infectious disease forecasting work has shown that ensembles tend to have wider prediction intervals that are more likely to capture the eventually reported outcome and thus reduce overconfidence compared to their constituent models (4,18). Wider prediction intervals, reflecting increased uncertainty, can mediate some impacts of overconfidence. However, forecasts would be most useful if they were both reliable and informative - that is, if they could accurately capture the uncertainty, while also providing more precise estimates, rather than merely increased uncertainty (29,30).
Finally, while forecasts would be most actionable at local scales, performance was generally worse for smaller than larger jurisdictions. Other infectious disease forecasting systems have found better forecast skill at smaller geographic scales, likely because local transmission dynamics (e.g., a county) are a better predictor of local than aggregate transmission (e.g., a state) (31). We compared WIS across scales by comparison to the baseline model to adjust for missing forecasts and for WIS scaling relative to the magnitude of observed outcomes. After those adjustments, population size had a clear association with forecast, likely reflecting the relative role of stochastic dynamics. For better local forecasts, models may need to explicitly account for stochasticity. Forecasts could also be improved by better leveraging spatial information, such as dynamics in neighboring counties or nearest urban centers. Local forecasts remain a key public health need, as local forecasts are more likely to reflect local conditions and motivate local mitigation action.
Overall, these findings, as well as the real-time evaluations, indicated that COVID-19 case forecasts were not reliable as a single indicator for pandemic response of a novel pathogen. Similar to other forecasting studies, we found that the ensemble was among the most reliable forecasts (3,4,18,32), outperformed only by LNQ-ens1 across the metrics evaluated here. Thus, while the overall best forecasts had poor performance at key times, other forecasts were often even worse at these same time points. Weighted (or trained) ensembles offer another potential avenue for improvement (33–35), but the version implemented here did not outperform the simple, median ensemble, likely reflecting limited historical data (36) and variation in team forecast submissions (37,38).
While COVID-19 deaths are a more lagging indicator of infections than case reports, and so may be less useful as an input to public health decision making, forecasts of deaths have generally been more reliable (18). Similarly, COVID-19 hospitalization forecasts in France have also shown high forecast skill (39). Better performing US death and French hospitalization forecasts share one factor in common: models generally used local case reports as an input to inform their forecasts. While public health decision making should not rely on case forecasts alone, they may still be helpful in the context of other important indicators, such as the case, hospitalizations, and death reports. Nowcasts of reports and real-time estimates of the effective reproductive number can also provide insight on current dynamics (40–43). Together, a suite of indicators is more informative for outbreak response than a leading indicator alone.
The analysis presented here includes important findings about real-time applied forecasting in an emerging pandemic to inform pandemic response rather than to address specific research aims of improving predictions. Several factors limit the strength of our findings and ability to understand underlying mechanisms of predictive performance. Notably, we compared the forecasts to a changing record of reported cases. Throughout the COVID-19 outbreak, cases have been reported with jurisdiction- and time-varying delays and have been revised over time, resulting in varying forecast targets. In addition, the definition of a reported COVID-19 case also changed over time and varied between states. These changes were a result of many factors, including laboratory capacity and implementation of home-based testing, and may have affected forecast skill in other ways. Our sensitivity analyses found no qualitative differences in our main findings when we excluded forecasts for time points with revised data or when we excluded outlier data points. Nevertheless, forecasting teams were greatly impacted by the evolving landscape of COVID-19 case surveillance. More timely and consistent reports likely would improve both the process of making forecasts and forecast skill.
The overall goal of the COVID-19 Forecast Hub was to provide forecasts in near real-time for decision making. While the collaborative efforts of the Hub achieved this goal despite a changing epidemic landscape, nevertheless, the e open nature of COVID-19 forecasting also limits understanding the drivers of forecast performance. Many teams participated at different times, some intermittently, and provided varied and limited descriptions of their forecast methods. While we were able to adjust our evaluation for differences in in varying submissions, we are unable to assess the underlying impact of modeling approaches on performance since we do not have the granular details on forecast methods and how they evolved over time for all team forecasts. For example, the LNQ-ens1, which outperformed all other forecasts by most metrics, only submitted forecasts for approximately two thirds of the analysis period and stopped in June 2021 (prior to the Delta wave). The model is described as a combination of three machine learning models, leveraging other embedded models and datasets, with weights that “are chosen by hand each week based on performance in the previous week” (see LNQ-ens1 metadata, https://github.com/reichlab/covid19-forecast-hub/blob/b12f916abc859bf59ea584b64f53afc2982042fd/data-processed/LNQ-ens1/metadata-LNQ-ens1.txt, at (44)). The ensemble approach used in the LNQ-ens1 model building likely contributed to the overall performance. However, several other ensemble models had lower performance than the LNQ-ens1 model; we are unable to assess whether LNQ-ens1 performance gains were due to a particular component model or dataset, the hand weighting procedure, or something else. The brief descriptions submitted to the COVID-19 Forecast Hub, such as for the LNQ-ens1, must include a summary of the methods used and may indicate a variety of unique features such as input data, parameters, model fitting, etc. (44). However, the level of detail provided in these descriptions varies between teams, and we do not have enough information to determine which aspects of individual models were important determinants of forecast performance. To elucidate associations between modeling approaches and forecast skill, additional research is needed. Future work to support improved forecasting will require assessing the impact of specific features (e.g., through ablation analyses) using retrospective, sable data systems and retrospective evaluation of the full forecasting process (e.g., from data wrangling to final forecast production).
Infectious disease forecasting continues to present many challenges and opportunities for improving outbreak response. Forecasts should be leading indicators of future activity and, while the COVID-19 case ensemble forecasts were good leading indicators at many points in time; they were unreliable, especially during periods of rapid change. Case data were integrated in COVID-19 mortality forecasts, which proved to be more reliable, likely in part due to reported cases being leading indicators of reported deaths (18,45). However, because deaths are a lagging indicator, death forecasts are less useful for short-term outbreak responses. Evaluation of the case forecasts provided insight on limitations of early forecasts and research avenues for improving them. These insights and the real-time forecasts provided by this effort were the product of large-scale collaboration between researchers and public health responders to confront the COVID-19 pandemic. Learning from and improving forecasting for COVID-19, other infectious diseases, and future pandemics will benefit from continuing and expanding these collaborative efforts.
Methods
The US COVID-19 Forecast Hub (46) is a consortium of researchers that develop and share forecasts of COVID-19 reported cases, hospitalizations, and deaths with the goal of leveraging information from individual models that predict the near-term burden of COVID-19 in the United States. Teams that submitted models to the US COVID-19 Forecast Hub used a wide variety of methodology and data (S1, Table S1). Beyond serving as a repository for forecasts, submitted data were also aggregated by scientists at the COVID-19 Forecast Hub to generate two models that we included in this analysis: the COVIDhub-4_week_ensemble and the COVIDhub-trained_ensemble. Since the beginning of the COVID-19 Forecast Hub, the quantile predictions from each week’s submitted models were used as input data for the COVIDhub-4_week_ensemble. Ensemble aggregation methods evolved over time; for this analysis period, the ensemble forecast was calculated as the median across forecasts from all models at each quantile level. Additionally, beginning on February 1, 2021, the COVID-19 Forecast Hub also generated a weighted ensemble (COVIDhub-trained_ensemble). Models were selected for weighted ensemble inclusion based on their past performance over various window period and given a weight prior to aggregation. The methodology evolved over time and details are available on the model’s metadata file on the COVID-19 Forecast Hub GitHub repository (see Data and code availability and reporting guidelines).
The COVID-19 Forecast Hub, and death forecasts submitted to the Hub have been described in detail elsewhere (8,17,18). The Hub’s incident COVID-19 case forecasts, which were first solicited in July 2020, have similar submission requirements to the death forecasts. Important differences include an expanded geographical scale (national; state, territory, and DC; and county levels) and reduced number of required quantiles in the probability distribution (7 quantiles in total: 0.025, 0.10, 0.25, 0.50, 0.75, 0.90, and 0.975). Predictions for weekly incident COVID-19 cases can be submitted for up to 8 weeks in the future, although our analysis only includes predictions made for 1-4 weeks into the future.
We evaluated submitted forecasts between July 28, 2020, and December 21, 2021 (2020 epi week [EW] 31 – 2021 EW 51), which encompasses 73 weeks. Because forecasts were submitted at multiple geographic scales, we conducted separate analyses for 1) national forecasts, 2) state, territory, and DC forecasts, 3) county level forecasts, and 4) sets of team forecasts for all three geographic scales. When appropriate, we compared forecast performance to that of a naïve model, created by the COVID-19 Forecast Hub, the COVIDhub-baseline. The COVIDhub-baseline model, created each week, was designed to be a neutral model to provide a simple reference point of comparison for all models. This baseline model forecasts a predictive median incidence equal to the number of reported cases in the most recent week, with uncertainty based on the empirical distribution of previous differences between the median and observed values (18).
Inclusion criteria
Teams were included in the evaluation when they submitted forecasts with a complete set of quantiles for each 1-through 4-week ahead target predictions. Additionally, teams must have met the following inclusion criteria:
had predictions for at least 50 locations (states, territories, or DC) for the state, territory, and DC level analyses; and for at least 75% of counties included in each population size quantile per submission week for the county-level analyses;
had submissions for at least 50% of the weeks included in the analysis period per location forecasted. Teams meeting these inclusion criteria, and their submissions over time, are depicted in S1, Figure S1.
Ground Truth
Forecasts were evaluated against the reported COVID-19 case reports collated by the Johns Hopkins Center for Systems Science and Engineering (CSSE) (47). To calculate weekly incident reported cases, we subtracted the cumulative count for each Saturday from the cumulative count for the next Saturday, such that each incident weekly count reflects the number of cases reported from Sunday through Saturday in a given week. We aggregated reported counts from smaller geographic units into their larger unit. For example, counts in a given state are the aggregate of the county level reported counts and national counts are the sum of all states, territories, and DC.
CSSE reports data in real-time. Thus, data may be revised if the reporting health system makes public updates to their surveillance data. At times, such revisions may result in negative daily counts or in increases to case counts if the date of cases is shifted from one day to another or the definition of a reportable case is changed. We examined the percent change between the first reported cases in each state, DC, and territory per date relative to the counts in the surveillance file from April 2, 2022. We also assessed the influence of revised data on the final model outcomes (see S2) and the presence of negative case counts in the timeseries. Less than 1 percent of time points in the analysis period had negative daily case counts in the largest US counties. Negative counts were observed at the state/territory level only twice: in Missouri during the week of April 17, 2021, and Virgin Islands during the week ending October 10, 2020. The state of Florida reported 0 cases on November 27, 2021. We excluded all weeks and locations with negative counts as well as the week with 0 incidence in Florida in our primary analyses.
Additionally, we also examined whether a reported case count was an outlier in the case trend for each state. Anomalies in case data trends have not been uncommon throughout the pandemic, as reporting entities have uploaded large batches of surveillance data on a single day. To assess whether cases were outside of the expected range of reported cases over time, we applied three outlier detection algorithms, each with a 21-day window: a rolling median, a seasonal trend decomposition, and a seasonal trend decomposition without a seasonality term. We then classified a given count as an outlier if it was detected as such by at least two of the three algorithms. Using these data, we ran several sensitivity analyses to assess the likely impact of anomalous data points on model performance. Sensitivity analyses examining the robustness of our findings to reporting anomalies are presented in S2.
Additional information about the CSSE data, and revisions to the dataset, is publicly available on a GitHub repository: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data.
Forecast locations
Forecasts for incident cases were submitted for the national level, 50 states, 5 territories (American Samoa, Guam, the Northern Mariana Islands, Puerto Rico, and the US Virgin Islands), the DC, and 3,142 US counties. We excluded two counties in Alaska because they were not forecasted by most sets of team forecasts (Federal Information Processing Standard code 02063 and 02066 were excluded). Because fewer teams submitted forecasts for American Samoa, Guam, the Northern Mariana Islands, we excluded these territories from the analysis. Some teams treated DC as both a county and a jurisdiction, so we excluded DC from the county forecasts. In addition, because county population size and transmission are correlated and case counts and forecast performance are also correlated, we grouped counties into 5 quantiles based on their population sizes, with cut points at 8,908; 18,662; 36,742; and 93,230 people; most analyses used forecasts from the quantile with the largest population size (n=628). We hypothesized that small counties would be more likely to have better forecast accuracy because they had zero or very few reported cases. We thus chose to stratify counties by size to minimize any bias from aggregation. Performance results for most county forecasts are presented in S3.
Defining epidemic phases
For every state and DC, we independently classified each forecast week based on the estimated time-varying reproduction number (R t) for that given week. State-level R t estimates were obtained from https://github.com/epiforecasts/epiforecasts.github.io (48). We extracted the R t estimate for the Wednesday of each week from all available files. Because R t estimates were updated on a rolling basis in near real time, there were multiple estimates generated for the same date; we calculated the median estimated R t per date for the upper and lower 90% credible interval and the median value (August 1, 2020 – January 15, 2022, or 2020 EW 31 – 2022 EW 2, reflecting 77 weeks in total). Each forecast week was then classified into one of the following categories based on the Rt estimates: increasing, peak, decreasing, nadir.
Increasing and decreasing phases reflect weeks in which R t had a 90% probability of being greater than or less than 1.0, respectively. There were several periods of rapid transmission in certain jurisdictions where R t dipped above/below the 1.0 threshold but did not remain on an upward or downward trajectory. Thus, we classified weeks between two increasing phases as increasing and weeks between two decreasing periods as decreasing. Weeks between increasing and decreasing phases were classified as peaks, whereas nadirs were defined as periods between decreasing and increasing phases. Periods at the beginning or the end of an analysis period were classified as a continuation of whichever phase preceded or followed them. Graphical depictions of R t are provided in S4 and show general concordance between Rt and reported cases.
Evaluation methodology
We evaluated probabilistic forecast accuracy using two different metrics, empirical prediction interval coverage rates and weighted interval scores (WIS) (49). Coverage was calculated by determining the frequency with which the prediction interval contained the eventually observed outcome for the 50%, 80% and 95% intervals. WIS reflects a weighted estimate of sharpness (i.e., the range of the predicted interval) and calibration (i.e., precision or error) across the three prediction intervals and the median prediction, with higher WIS and indicating lower forecast skill. Importantly, WIS is highly correlated with the magnitude of observed and forecasted values. We used mean absolute WIS to assess forecast accuracy over time and mean relative WIS (rWIS) to access forecast accuracy over space. Relative WIS was estimated by calculating the geometric mean of WIS across all sets of team forecasts and scaling that value to the WIS of a naïve model, the COVID-hub baseline. This approach eases interpretation, where values greater than 1.0 reflected worse accuracy than the baseline model and values below 1.0 reflected better model performance. Additionally, the pairwise relative comparison helps account for missing forecasts. Both coverage and WIS have been described in detail elsewhere (18,49). Horizon specific results for national, state/territory/DC, and large counties are presented in S5.
To assess the association between WIS and epidemic phase for each team, we fitted separate Gaussian generalized estimating equation (GEE) models per team (equation 1) with an independent working correlation structure at the state-level. This structure assumes that observations are not correlated over time in a state (denoted as l in the equations below). Cases and weighted interval scores were log transformed and then standardized (subtracting the mean and dividing by the standard deviation) prior to fitting the model, as this transformation yielded more computationally and numerically stable estimates. We define those resulting variables as stdWIS and stdCases. The expected value for a standardized WIS for time (rt) and location (l), with forecasts from a given team’s model, is as follows: Where p[t,l] is an index that reflects the phase of each time (t) and location (l), (h) is the horizon of the forecast in weeks, and ns(·) represents a natural spline with two degrees of freedom. Using a regression model allows us to summarize patterns of overall average performance between teams while accounting for high correlation and variation in the scores. Comparisons of rWIS, in contrast, do not allow for formal inference on the differences in performance between teams. Prior to applying this regression model structure, our model building approach included exploratory analysis of several structures appropriate for longitudinal analysis. We examined model residuals, influential observations, goodness of fit metrics, and the impact of changing the functional form of the variables included in the model.
The inclusion of reported cases in models permitted flexible adjustment for the wide range in cases between and within jurisdictions, which led to a wide range of possible WIS values, as WIS values tend to be higher when counts are higher. Expected WIS values were computed by first obtaining a marginal mean from the GEE model and then undoing the transformations by exponentiating and un-standardizing the marginal mean. This was done separately for each team for all phases and for each team and each phase individually (see S6 for estimated team-specific marginal mean WIS relative to reported case counts). Additionally, we calculated whether the 80% confidence interval (based on Gaussian distributional assumptions) for each team’s expected WIS outcome (on the log-scale and normalized, as described above) was less than the baseline model for all phases (i.e., the marginal mean WIS for the baseline model).
Data and code availability and reporting guidelines
The forecasts from models used in this paper are available from the COVID-19 Forecast Hub GitHub repository (https://github.com/reichlab/covid19-forecast-hub) (8) and the Zoltar forecast archive (https://zoltardata.com/project/44) (50). The code used to generate all figures and tables in the manuscript is available in a public repository (https://github.com/cdcepi/Evaluation-of-case-forecasts-submitted-to-COVID19-Forecast-Hub). All analyses were conducted using the R language for statistical computing (v 4.0.3) (51), and the following packages were used for the main analyses: scoringutils (52), covidhubUtils (53), geepack (54). Additionally, we included the EPIFORGE 2020 reporting guideline checklist in S7 to indicate each page in this evaluation that corresponds to each specific recommendation (15).
This activity was reviewed by CDC and was conducted consistent with applicable federal law and CDC policy.
Data Availability
The forecasts from models used in this paper are available from the COVID-19 Forecast Hub GitHub repository (https://github.com/reichlab/covid19-forecast-hub) and the Zoltar forecast archive (https://zoltardata.com/project/44). The code used to generate all figures and tables in the manuscript is available in a public repository (https://github.com/cdcepi/Evaluation-of-case-forecasts-submitted-to-COVID19-Forecast-Hub).
CDC disclaimer
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Supporting information captions
Supporting Information 1: Team submissions, methods, and data
SI Figure 1.1. Forecasts submitted over time at the national, state-territory-DC level in panel A and at the country scale in Panel B. The number of forecasted locations submitted each week nationally or at the state, territory and DC level is included, while the country level forecast submissions shows the percent of counties per quantile that were submitted each week. Sets of team forecasts meeting the inclusion criteria for this main analysis are labeled with an asterisk (*).
S1 Table 1.1. List of models evaluated, including sources for case, hospitalization, death, demographic, and mobility data when used as inputs for the given model. We evaluated 26 models contributed by 24 teams. The COVIDhub team submitted three models including the baseline model and the ensemble model. A brief description is included for each model, with a reference where available. The last column indicates whether the model made assumptions about how and whether social distancing measures were assumed to change during the period for which forecasts were made.
Supporting Information 2: Revision and outlier sensitivity analyses
S2 Figure 2.1. To assess the influence of data revisions on our evaluation of forecast skill, we compared daily differences in cumulative reported cases during the week they were first reported to reported case counts for the same week in the complete data as of April 2, 2022. In total 721 weeks had at least one day with a revised case count (17% of all weeks, n=4,241 weeks) and revisions occurred in 43 of 51 jurisdictions. These jurisdiction specific plots compare cases reported as of the date in the subtitle (in red) compared to cases reported as of April 2, 2022 (in black).
S2 Figure 2.2. After identifying weeks with revised case counts, we then excluded them from the dataset and reran the GEE models and estimated the marginal mean Weighted Interval Score (WIS). Panel A shows the estimated marginal mean WIS and 95% confidence intervals for mean cases from team-specific GEE models for all 48 jurisdictions from this sensitivity analysis. The 95% confidence intervals for the COVIDhub-baseline model are shown in dashed red vertical lines. Panel B presents each team’s estimated marginal mean WIS per phase, scaled to the COVIDhub-baseline model’s estimated marginal mean WIS for all epidemic phases, using the dataset with excluded week. Teams with higher estimated marginal mean WIS values (i.e., greater than 1.0) are presented in shades of orange while teams with lower estimated marginal mean WIS (i.e., less than 1.0) are shown in shades of green. Team forecasts are denoted with an asterisk (*) if the 80% confidence interval of the expected WIS outcome (normalized and on the log scale) was estimated by a model to be lower than the expected WIS of the COVIDhub-baseline model for all phases.
S2 Figure 2.3. Outliers were defined as non-revised reported case counts that were outside of the expected range by at least two of the three algorithms: a rolling median, a seasonal trend decomposition, and a seasonal trend decomposition without a seasonality term. Each method used a 21-day window. Approximately three percent of weeks (686 of 27,489 total weeks in the analysis period) had at least one day of reported cases identified as an outlier.
Supporting Information 3: Incident COVID-19 case forecasts were submitted for all US counties. The plots shown here depicted average, scaled pairwise Weighted Interval Score (WIS; seeMethods for description), 95% coverage, and submissions (S3 Figure 3.1), average 50%, 80% and 95% coverage for eligible submitted forecasts (S3 Figure 3.2), and average WIS and 95% coverage over time (S3 Figure 3.2). Each figure shows spatial disaggregated results, with increasing population size and quantile numbers. For example, counties with the smallest population are grouped in Quantile 1 and the largest population sizes are grouped in Quantile 5. The following teams are included in these figures: CEID-Walk, LNQ-ens1, Microsoft_DeepSTIA, COVIDhub-4_week_ensemble, COVIDhub-trained_ensemble, COVIDhub-baseline, CU-select, FAIR-NRAR, FRBSF_Wilson-Econometric, IowasStateLW-STEM, JHU_IDD-CovidSP, JHU_CSSE-DECOM, JHUAPL-Bucky, LANL-GrowthRate, LNQ-esn1, UVA-Ensemble.
S3 Figure 3.1. Percent of weeks with complete submissions for all sets of team forecasts, scaled, pairwise relative Weighted Interval Score (rWIS), 95% coverage, and by geographical scale of submitted forecasts. Teams are sorted by increasing rWIS values.
S3 Figure 3.2. Expected and observed coverage rates aggregated over time and horizon for county forecasts. The dashed line represents optimal expected coverage. Team forecasts that outperformed the COVIDhub-4_week_ensemble model at all coverage levels are labeled on the right hand side of the plots.
S3 Figure 3.3. Mean Weighted Interval Score (WIS) over time, aggregated by geographic units and forecast horizon in A and 95% coverage over time, aggregated by geographic units and forecast horizon in B. The black, dashed vertical line in all panels shows the date that public communication of the case forecasts was paused. The black, dashed horizontal line in panels B show nominal 95% interval coverage
Supporting Information 4: Estimated time-varying reproduction number and epidemic phase classifications. For each state, the top panel shows the median Rt and median upper and lower 90% credible interval over time in red. The bottom panel shows reported case counts over time. Both plots have vertical bands representing the epidemic phase of each forecast week:increasing, peak, decreasing, nadir.
Supporting Information 5: Each location specific forecast submitted to the COVID19 Forecast Hub included at least 4 weeks of future predictions. Here, we present disaggregated 1 and 4 week ahead predictions of model performance for each team model that submitted national and state/territory/DC forecasts and were included in the main analyses. Specific plots include the average 50%, 80% and 95% coverage for eligible submitted forecasts (S5 Figure 5.1), average absolute Weighted Interval Score (WIS) and 95% coverage over time (S5 Figure 5.2), and scaled, pairwise rWIS by location (S5 Figure 5.3)
S5 Figure 5.1. Expected and observed coverage rates aggregated for 1 and 4 week ahead forecasts over time for national forecasts in A, state/territory/DC forecasts in B, the largest country forecasts in C. The dashed line represents optimal expected coverage. Teams that outperformed the COVIDhub-4_week_ensemble model at all coverage levels are labeled on the right-hand side of the plots.
S5 Figure 5.2. Mean Weighted Interval Score (WIS) over time for 1 and 4 week ahead forecasts, aggregated by geographic units, and 95% coverage over time for 1 and 4 week ahead forecasts, aggregated by geographic units. The black, dashed vertical line in all panels shows the date that public communication of the case forecasts was paused. The black, dashed horizontal line in panels D, E, and F show nominal 95% interval coverage. Teams that submitted national forecasts are presented in A. and D., state/territory/DC forecasts presented in B. and E., and teams that submitted large county level forecasts are presented in C. and F.
S5 Figure 5.3. Scaled, pairwise relative Weighted Interval Score (rWIS; see Methods for description) for all teams that submitted national and state/territory/DC forecasts by location for 1 and 4 week ahead horizon. National estimates are displayed first, followed by jurisdictions in alphabetical order. Teams are displayed by decreasing average rWIS across all forecast horizons and locations.
Supporting Information 6: Each team model’s estimated marginal mean Weighted Interval Score (WIS) over range of reported case counts per epidemic phase. Marginal mean WIS was estimated from GEE model results and reflect values across the 95% confidence interval of mean reported cases. Case counts differ per team model as each team forecasted a different set of locations over a different range of possible dates.
Supporting Information 7: EPIFORGE 2020 guidelines outline 19 recommended reporting items for epidemic forecasting and prediction research (15). These items are included in the checklist below, which also include the page number where each item is described or presented within this evaluation.