National and subnational short-term forecasting of COVID-19 in Germany and Poland during early 2021 =================================================================================================== * Johannes Bracher * Daniel Wolffram * Jannik Deuschel * Konstantin Görgen * Jakob L. Ketterer * Alexander Ullrich * Sam Abbott * Maria V. Barbarossa * Dimitris Bertsimas * Sangeeta Bhatia * Marcin Bodych * Nikos I. Bosse * Jan Pablo Burgard * Lauren Castro * Geoffrey Fairchild * Jochen Fiedler * Jan Fuhrmann * Sebastian Funk * Anna Gambin * Krzysztof Gogolewski * Stefan Heyder * Thomas Hotz * Yuri Kheifetz * Holger Kirsten * Tyll Krueger * Elena Krymova * Neele Leithäuser * Michael L. Li * Jan H. Meinke * Błażej Miasojedow * Isaac J. Michaud * Jan Mohring * Pierre Nouvellet * Jedrzej M. Nowosielski * Tomasz Ozanski * Maciej Radwan * Franciszek Rakowski * Markus Scholz * Saksham Soni * Ajitesh Srivastava * Tilmann Gneiting * Melanie Schienle ## Abstract **Background** During the COVID-19 pandemic there has been a strong interest in forecasts of the short-term development of epidemiological indicators to inform decision makers. In this study we evaluate probabilistic real-time predictions of confirmed cases and deaths from COVID-19 in Germany and Poland for the period from January through April 2021. **Methods** We evaluate probabilistic real-time predictions of confirmed cases and deaths from COVID-19 in Germany and Poland. These were issued by 15 different forecasting models, run by independent research teams. Moreover, we study the performance of combined ensemble forecasts. Evaluation of probabilistic forecasts is based on proper scoring rules, along with interval coverage proportions to assess forecast calibration. The presented work is part of a pre-registered evaluation study and covers the period from January through April 2021. **Results** We find that many, though not all, models outperform a simple baseline model up to four weeks ahead for the considered targets. Ensemble methods (i.e., combinations of different available forecasts) show very good relative performance. The addressed time period is characterized by rather stable non-pharmaceutical interventions in both countries, making short-term predictions more straightforward than in previous periods. However, major trend changes in reported cases, like the rebound in cases due to the rise of the B.1.1.7 (alpha) variant in March 2021, prove challenging to predict. **Conclusions** Multi-model approaches can help to improve the performance of epidemiological forecasts. However, while death numbers can be predicted with some success based on current case and hospitalization data, predictability of case numbers remains low beyond quite short time horizons. Additional data sources including sequencing and mobility data, which were not extensively used in the present study, may help to improve performance. **Plain language summary** The goal of this study is to assess the quality of forecasts of weekly case and death numbers of COVID-19 in Germany and Poland during the period of January through April 2021. We focus on real-time forecasts at time horizons of one and two weeks ahead created by fourteen independent teams. Forecasts are systematically evaluated taking uncertainty ranges of predictions into account. We find that combining different forecasts into ensembles can improve the quality of predictions, but especially case numbers proved very challenging to predict beyond quite short time windows. Additional data sources, in particular genetic sequencing data, may help to improve forecasts in the future. ## Introduction Short-term forecasts of infectious diseases and longer-term scenario projections provide complementary perspectives to inform public health decision making. Both have received considerable attention during the COVID-19 pandemic and are increasingly embraced by public health agencies. This is illustrated by the US COVID-19 Forecast1, 2 and Scenario Modelling Hubs,3 supported by the US Centers for Disease Control and Prevention, as well as the more recent European COVID-19 Forecast Hub,4 supported by the European Center for Disease Prevention and Control (ECDC). The Forecast Hub concept, building on pre-pandemic collaborative disease forecasting projects like FluSight,5 the DARPA Chikungunya Challenge6 or the Dengue Forecasting Project7 aims to provide a broad picture of existing short-term projections in real time, making the agreement or disagreement between different models visible. Also, it forms the basis for a systematic evaluation of performance. This is a prerequisite for model consolidation and improvement, and a need repeatedly expressed.8 It has been highlighted that such modelling studies should be prospective9 and ideally follow pre-registered protocols10 in order to prevent selective reporting and hindsight bias (i.e., the tendency to overstate the predictability of past events in hindsight). We here report on the second part of a prospective disease forecasting study, pre-registered on 8 October 202011 and including forecasts made between 11 January 2021 and 29 March 2021 (with last observed values running through April; twelve weeks of forecasting). It is based on the German and Polish COVID-19 Forecast Hub ([https://kitmetricslab.github.io/forecasthub/](https://kitmetricslab.github.io/forecasthub/)), which gathers and stores forecasts in real time. This platform was launched in close exchange with the US COVID-19 Forecast Hub in June 2020. In April 2021 it was largely merged into the European COVID-19 Forecast Hub, shortly after the latter had been initiated by ECDC. During our study period, fifteen independent modelling teams provided forecasts of cases and deaths by appearance in publicly available national-level data, provided either by national health authorities (Robert Koch Institute, RKI12 or the Polish Ministry of Health, MZ;13 the primary data source) or the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE;14 and15). As specified in our study protocol, we report results on forecasts up to a horizon of four weeks, but focus on forecasts one and two weeks ahead. While we acknowledge the relevance of longer horizons for planning purposes, we argue that factors like changing non-pharmaceutical interventions and emergence of new variants limit meaningful forecasts (as opposed to scenarios) to rather short time horizons, especially for cases. Also, we focus almost exclusively on incident quantities, as their cumulative counterparts have almost completely vanished from any public discussion. The time series of cases and deaths in both countries are displayed in panels (a) and (b) of Figure 1. The study period covered in this paper is marked in dark grey, while the light grey area represents the time span addressed in the first part of our study.16 Our study period contains the transition from the original wild type variant of the virus to the B.1.1.7 variant (later called Alpha). Panel (c) of Figure 1 shows the estimated weekly percentages of all cases which were due to the B.1.1.7 variant in Germany17 and Poland18, 19 in calendar weeks 4–12. Panel (d) shows the proportion of all performed PCR tests which turned out positive. While in Germany the curve follows a U-shape similar to the case incidence curve, the test positivity rate continuously increased in Poland, peaking at 33%. Panel (e) shows the Oxford Coronavirus Government Response Tracker (OxCGRT) Stringency Index.20 It can be seen that compared to the first part of our study, the level of non-pharmaceutical interventions was rather stable at a high level during the second period. We note, however, that on 27 March a new set of restrictions was added in Poland (closure of daycare centers, hair salons and sports facilities, among others), which is not reflected very strongly in the stringency index. The start of vaccination rollout in both countries coincides with the start of our study period. However, by its end only roughly one sixth of the population of both countries had received a first dose, and roughly one twentieth had received two doses (with the role of the one-dose Johnson and Johnson vaccine negligible in both countries); see panel (f). ![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F1.medium.gif) [Figure 1:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F1) Figure 1: Overview of relevant epidemiological time series. Reported cases (a) and deaths (b) in Germany (black) and Poland (red) according to Robert Koch Institute, the Polish Ministry of Health (MZ; solid lines) and Johns Hopkins CSSE (dashed). Additional panels show (c) the share of cases due to the B.1.1.7 (Alpha) variant, (d) the proportion of all performed PCR tests which turned out positive, (e) the overall level of non-pharmaceutical interventions as measured by the Oxford Coronavirus Government Response Tracker (OxCGRT) Stringency Index, and (f) the population shares having received at least one vaccination dose (dotted) and complete vaccination (solid). The dark grey area indicates the period addressed in the present manuscript, the light grey area the one from Bracher et al.16 We find that averaged over the second evaluation period, most though not all of the compared models were able to outperform a naïve baseline model. Heterogeneity between forecasts from different models was considerable. Ensemble forecasts combining different available predictions achieved very good performance relative to single-model forecasts. However, most models, including the ensemble, did not anticipate changes in trend well, in particular for cases. Pooling results over both evaluation periods we find that ensemble forecasts for deaths were well-calibrated (i.e., prediction intervals contained the true value roughly as often as intended) even at longer prediction horizons and clearly outperformed baseline and individual models, while for cases this was only the case for one- and to a lesser degree two-week-ahead forecasts. ## Methods The methods described in the following are largely identical to those in the first part16 of our study, but are presented to ensure self-containedness of the present work. ### Targets and submission system Teams submitted forecasts for weekly incident and cumulative confirmed cases and deaths from COVID-19 via a dedicated public GitHub repository ([https://github.com/KITmetricslab/covid19-forecast-hub-de](https://github.com/KITmetricslab/covid19-forecast-hub-de)). For certain teams running public dashboards, software scripts were put in place to transfer forecasts to the Forecast Hub repository. Weeks were defined to run from Sunday through Saturday. Each week, teams were asked to submit forecasts using data available up to Monday, with submission possible until Tuesday 3pm Berlin/Warsaw time (the first two daily observations were thus already available at the time of forecasting). Forecasts could either refer to the time series provided by JHU CSSE or those from Robert Koch Institute and the Polish Ministry of Health. All data streams were aggregated by time of appearance in national data, see also Supplementary Note 4 of Bracher et al.16 Submissions consisted of a point forecast and 23 predictive quantiles (1%, 2.5%, 5%, 10%, …, 95%, 97.5%, 0.99) for the incident and cumulative weekly quantities. As in previous work16 we focus on the targets on the incidence scale. These are easier to compare across the different data sources than cumulative numbers which sometimes show systematic shifts. ### Evaluation metrics As forecasts were reported in the form of 11 nested central prediction intervals (plus the predictive median), a natural choice for evaluation is the interval score.21 For a central prediction interval [*l, u*] at the level (1 *- α*), thus reaching from the *α/*2 to the 1 *- α/*2 quantile, it is defined as ![Formula][1] where *χ* is the indicator function and *y* is the realized value. Here, the first term characterizes the spread of the forecast distribution, the second penalizes overprediction (observations fall below the prediction interval) and the third term penalizes underprediction. To assess the full predictive distribution we use the weighted interval score (WIS;22). The WIS is a weighted average of interval scores at different nominal levels and the absolute error. For *N* nested prediction intervals it is defined as ![Formula][2] where *m* is the predictive median and in our setting *N* = 11. The WIS is a well-known approximation of the continuous ranked probability score (CRPS;21) and generalizes the absolute error to probabilistic forecasts. Its values can be interpreted on the natural scale of the data and measure how far the observed value *y* is from the predictive distribution (lower values are thus better). For deterministic one-point forecasts the WIS reduces to the absolute error. A useful property of the WIS is that it inherits the decomposition of the interval score into forecast spread, overprediction and underprediction, which makes average scores more interpretable. As secondary measures of forecast quality we use the absolute error to assess the central tendency of forecasts and interval coverage rates of 50% and 95% prediction intervals to assess calibration. As specified in our study protocol, whenever forecasts from a model were missing for a given week, we imputed the score with the worst (largest) value achieved by any other model for the respective week and target. However, almost all teams provided complete sets of forecasts and very few scores needed imputation. ### Submitted models and baselines During the evaluation period, forecasts from fifteen different models run by fourteen independent teams of researchers were collected. Thirteen of these were already available during the first part of our study, see Table 3 and Supplementary Note 3 of Bracher et al16 for detailed descriptions. Table 1 provides a slightly extended summary of model properties, including the two new models, itwm-dSEIR and Karlen-pypm; a more detailed description of the latter can be found in Supplement S1. View this table: [Table 1:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T1) Table 1: Forecast models contributed by independent external research teams. During the evaluation period, only the ICM-agentModel explicitly accounted for vaccinations (given the low realized vaccination coverage by the end of the study period this aspect likely had limited impact). Only four models (ICM-agentModel, Karlen-pypm, LeipzigIMISE-SECIR and MOCOS-agent1, all only for certain weeks) explicitly accounted for the presence of multiple variants. In contrast to other related projects,2 none of the models used mobility data or social media data. To put the results achieved by the submitted models into perspective, the Forecast Hub team generated forecasts from three simple reference models (see also Bracher et al,16 Supplementary Note 2): * KIT-baseline: This is a simple last-observation-carried-forward model, i.e., it predicts the last observed value indefinitely into the future. Predictive quantiles are obtained by assuming a negative binomial observation model with a dispersion parameter estimated via maximum likelihood from five recent observations * KIT-extrapolation baseline: This model extrapolates exponential growth or decrease if the last three observations are monotonically increasing or decreasing, with a weekly growth rate equal to the one observed between the last and second to last week; if the last three observations are not ordered, it predicts a plateau. Predictive quantiles are again obtained using a negative binomial observation model and five recent observations. * KIT-time series baseline: This is an exponential smoothing time series model with multiplicative errors as used by Petropoulos and Makridakis23 to predict COVID-19 cases and deaths. It is implemented using the R package forecast, version 8.12.24 As a further external comparison we added publicly available death forecasts by the Institute for Health Metrics and Evaluation (IHME, University of Washington;25 available under the CC BY-NC 4.0 license). Here, we always used the most recent prediction available on a given forecast date. ### Forecast ensembles The Forecast Hub team used the submitted forecasts to generate three different ensemble forecasts: * KITCOVIDhub-median ensemble The *α*-quantile of the ensemble forecast is obtained as the median of the *α*-quantiles of the member forecasts. * KITCOVIDhub-mean ensemble The *α*-quantile of the ensemble forecast is obtained as the mean of the *α*-quantiles of the member forecasts. * KITCOVIDhub-inverse wis ensemble The *α*-quantile of the ensemble forecast is a convex combination of the *α*-quantiles of the member forecasts. The weights are chosen inversely proportional to the mean WIS value obtained by the member models over the last six evaluated forecasts (last three one-week-ahead, last two two-week-ahead, last three-week-ahead). This is done separately for each time series to be predicted. Missing scores are imputed by the worst score achieved by any model for the respective target, meaning that irregularly submitted models will be penalized and receive less weight. In the study protocol, the median ensemble was defined as our primary ensemble approach11 as it can be assumed to be more robust to occasional misguided forecasts (e.g., due to technical errors). We therefore display this version in all figures and focus our discussion on it. Note that all forecast aggregations are performed directly at the level of quantiles rather than density functions as in other work.26 This approach is referred to as Vincentization (in reference to Vincent,27 see e.g., Busetti28). A broader discussion of Vincentization approaches and their application to epidemiological forecasts, including numerous other weighting schemes, can be found in recent works by Taylor and Taylor29 and Ray et al.30 Notably, Taylor and Taylor29 used a similar inverse score weighting approach and found it to perform well in a re-analysis of forecasts from the US COVID-19 Forecast Hub. In this context we note that our inverse-WIS ensemble does not involve any estimation or optimization of weights, but simply uses the inverse of an average of past scores as heuristic weights. A more flexible approach with one tuning parameter estimated from the data has been used in Ray et al.30 There were no formal inclusion criteria other than completeness of the submitted set of 23 quantiles. The Forecast Hub team did, however, occasionally exclude forecasts with highly implausible central tendency or degree of dispersion manually. These exclusions have been documented in the Forecast Hub platform. ## Results Figures 2 and 3 show the forecasts made by the median ensemble (KIT-median ensemble; our pre-specified main ensemble approach; see Materials and Methods); a naïve model always using the last observed value as the expectation for the following weeks (KIT-baseline); and five contributed models with above-average over-all performance across locations and targets (i.e., quantities to be predicted). The forecasts are probabilistic, and we display the 50% and 95% prediction intervals (PIs) along with the respective median. Forecasts by the remaining teams are illustrated in Supplementary Figures S1 and S2, and forecasts at horizons of three and four weeks are shown in Supplementary Figures S3–S6. In the following, we discuss the performance of these forecasts, starting with a formal statistical evaluation before directing attention to the behaviour at inflection points. ![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F2.medium.gif) [Figure 2:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F2) Figure 2: One-week ahead forecasts of cases and deaths from COVID-19 in Germany and Poland. One-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland. The figure shows forecasts from a baseline model, the median ensemble of all submissions and a subset of submitted models with above-average performance. The black line shows observed data. Colored points represent predictive medians, dark and light bars show 50% and 95% prediction intervals, respectively. Asterisks mark intervals exceeding the upper plot limit. The remaining submitted models are displayed in Supplementary Figure S1. The right column shows the empirical coverage rates of the different models. The dark and light bars represent the proportion of cases where the 50% and 95% prediction intervals, respectively, contained the observed values. Dotted lines show the desired nominal levels 0.5 and 0.95. ![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F3.medium.gif) [Figure 3:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F3) Figure 3: Two-week ahead forecasts of cases and deaths from COVID-19 in Germany and Poland. Two-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland. The figure shows forecasts from a baseline model, the median ensemble of all submissions and a subset of submitted models. The remaining submitted models are displayed in Supplementary Figure S2. The black line shows observed data. Colored points represent predictive medians, dark and light bars show 50% and 95% prediction intervals, respectively. The right column shows the empirical coverage rates of the different models. See caption of Figure 2 for a detailed explanation of plot elements. ### Formal evaluation, January–April 2021 Table 2 and Figure 4 summarize the performance of the submitted, baseline and ensemble models over the twelve-week study period. Performance is measured via the average weighted interval score (WIS, see Methods section) and the mean absolute error of the predictive median. For both measures lower values indicate better predictive performance. We here show the average scores on the absolute scale, where they can be interpreted as the average distance between the observed and predicted value (the WIS taking into account forecast uncertainty). A summary table of relative scores standardized by the performance of the naïve KIT-baseline model is available in Supplementary Table S1. The WIS can moreover be decomposed into components representing underprediction, forecast spread and overprediction (see Methods), which we show in Supplementary Figure S7. Detailed results in tabular form at horizons of three and four weeks ahead can be found in Supplementary Table S2. As specified in the study protocol, we also provide results for cumulative cases and deaths (Supplementary Tables S3 and S4) and based on JHU rather than RKI/MZ data (Supplementary Tables S5 and S6; evaluation against JHU data leads to slightly higher WIS and absolute errors, but quite similar relative performance of models). A graphical display of individual scores can be found in Supplementary Figure S8. View this table: [Table 2:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T2) Table 2: Forecast evaluation for Germany and Poland (incidence scale, based on RKI/MZ data). View this table: [Table 3:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T3) Table 3: Forecast evaluation at the regional level, Germany and Poland (incidence scale, RKI/MZ data). ![Figure 4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F4.medium.gif) [Figure 4:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F4) Figure 4: Formal evaluation results in terms of mean weighted interval scores. Average weighted interval scores (bars) and absolute errors (diamonds) achieved by models across countries, targets and forecast horizons (12 weekly forecasts). The bottom end of the grey area represents the mean WIS of the baseline model KIT-baseline and the grey horizontal line its mean absolute error. Values are shown on a square-root scale to enhance readability. Only models covering all four horizons are shown. Both for incident cases and deaths, a majority, but not all models outperformed the naïve baseline model KIT-baseline (a model outperforms the baseline for a given target whenever its bar in Figure 4 does not reach into the grey area). As one would expect, the performance of all models considerably deteriorated for longer forecast horizons. The pre-specified median ensemble was consistently among the best-performing methods, outperforming most individual forecasts for all targets. The KITCOVIDhub-inverse wis ensemble, which is an attempt to weigh member models based on recent performance, does not yield any clear benefits over the unweighted median and mean ensembles. As can be seen from Supplementary Figures S9 and S10, the weights fluctuate substantially, implying that the relative performance of different models may be too variable for performance-based weights to pay off. The KIT-extrapolation baseline model shows quite reasonable relative performance for cases in both countries. Given the relatively long stretches of continued upward or downward trends in cases, this simple heuristic was not easy to beat and is rather close to the performance of the ensemble forecasts. For deaths, too, there are rather clear trends over the study period. Nonetheless, the different ensemble forecasts achieve substantial improvements over KIT-extrapolation baseline, meaning that the deviations from the previous trends were predicted with some success. The most striking cases of individual models outperforming the ensemble occurred for longer-range case forecasts in Poland. Here, the two microsimulation models MOCOS-agent1 and ICM-agentModel performed considerably better. These two models were arguably among the ones which were most meticulously tuned to the specific national context. It seems that this yielded benefits for longer horizons, while at shorter horizons the ensemble and some considerably simpler models were at least on par (the best performance at the one week horizon being achieved by the compartmental model MIMUW-StochSEIR). There were considerable differences in the forecast uncertainty of the different models. This can be seen from the quite variable forecast interval widths in Figures 2 and 3, and resulted in large differences in the empirical coverage rates of 50% and 95% prediction intervals (Table 2 and right column in the aforementioned figures). The ensemble methods performed quite favourably in terms of coverage, typically with slight undercoverage (i.e., prediction intervals cover the observations less frequently than intended) for cases and slight overcoverage (intervals cover more often than intended) for deaths. The differences in forecast dispersion are also reflected by the components of the weighted interval score shown in Supplementary Figure S7 (see Materials and Methods for an explanation of the decomposition). Some models, most strikingly ITWW-county repro, issued very sharp predictions, leading to very small dispersion components of the weighted interval score (the darkest block in the middle of the stacked bar). In turn, this model received rather large penalties for both over- and underprediction. Other models, like LANL-GrowthRate, epiforecasts-EpiNow2 and ICM-agentModel issued comparatively wide forecasts, leading to WIS values with large dispersion components. While there is no clear rule on what the score decomposition of an ideal forecast should look like, comparisons of the components provide useful indications on how to improve a model (e.g., the ITWW-county repro model might benefit from widening the uncertainty intervals). A subset of models also provided forecasts at the subnational level (states in Germany, voivodeships in Poland). Table 3 provides a summary of the respective results at the one and two week horizons (results for three and four weeks can be found in Supplementary Table S7). Despite the rather low number of available models, the ensembles generally achieved improvements over the individual models and, with exceptions for case forecasts in Germany, clearly outperformed the baseline model KIT-baseline. The mean WIS values are lower for the regional forecasts than for the national-level forecasts in Table 2 primarily because the numbers to be predicted are lower at the regional level; the WIS – like the absolute error – scales with the order of magnitude of the predicted quantity and cannot be compared directly across different forecasting tasks. Coverage of the ensemble forecasts was close to the nominal level for deaths and somewhat lower for cases. Note that in this comparison part of the forecasts from the FIAS FZJ-epi1Ger model were created retrospectively (using only the data available up to the forecast date) as the team only started issuing forecasts for all German federal states on 22 February 2021. As specified in the study protocol,11 we also report evaluation results at the national level pooled across the two study periods for those models which covered both. These are summarized in Supplementary Tables S8 and S9. For deaths, ensemble forecasts clearly outperformed individual models, the four-week-ahead horizon in Poland being the only one at which an individual model (epiforecasts-EpiExpert) meaningfully outperformed the pre-specified median ensemble. While most contributed and baseline models were somewhat overconfident, the ensemble showed close to nominal coverage even at the four-week-ahead horizon. For cases, the median ensemble achieved good relative performance (comparable to the best individual models) one and two weeks ahead, but was outperformed by a number of other models at three and four weeks. Notably, it failed to beat the naïve last-observation-carried-forward model KIT-baseline. Its coverage of prediction intervals was acceptable one week ahead, but substantially below nominal at higher horizons (e.g., 13/19 and 10/19 four weeks ahead in Germany and Poland, respectively, at the 0.95 level), which reflects the severe difficulties in predicting cases in Fall 2020 as discussed in Bracher et al.16 ### Behaviour at inflection points From a public health perspective, there is often a specific interest in how well models anticipated major inflection points (changes in trend). We therefore discuss these instances separately. However, we note that, as will be detailed in the discussion, post-hoc conditioning of evaluation results on the occurrence of unusual events comes with important conceptual challenges. #### Shift from wild type to B.1.1.7 variant The renewed increase in cases in both Germany and Poland (third wave) in late February 2021 was due to the shift from the wild-type variant of the virus to the B.1.1.7 (or Alpha) variant, see Figure 1, panel (c) for estimated shares of the new variant over time. Given earlier observations about the spread of the B.1.1.7 variant in the UK31 and Denmark, there was public discussion about the likelihood of a re-surgence, but there was considerable uncertainty about the timing and strength (see e.g., a German newspaper article32 from early February 2021). This was largely due to the limited availability of representative sequencing data. In certain regions of Germany, specifically the city of Cologne33 and the state of Baden-Württemberg,34 large-scale sequencing had been adopted by late January, but results were considered difficult to extrapolate to the whole of Germany. An updated RKI report35 on virus variants from 10 February 2020 described a “continuous increase in the share of the VOC B.1.1.7”, but cautioned that the data were “subject to biases, e.g., with respect to the selection of samples to sequence” (our translation). Given the limited available data, and the fact that many approaches had not been designed to accommodate multiple variants, only two of the teams submitting forecasts for Germany opted to account for this aspect (a question which was repeatedly discussed during coordination calls). These exceptions were the Karlen-pypm and LeipzigIMISE-SECIR models, which starting from 1 March 2021 explicitly accounted for the presence of two variants. As a result, most models did not anticipate the change in trend well and only reacted implicitly once the change became apparent in the data on 27 February 2021. Figure 5 shows the case forecasts of all submitted models and the median ensemble from 15 February, 22 February and 1 March 2021. We also show the two short time series of shares of the B.1.1.7 variant available from Robert Koch Institute at the respective prediction time points. ![Figure 5:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F5.medium.gif) [Figure 5:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F5) Figure 5: Case forecasts in Germany preceding the upward trend change in March 2022. Point forecasts of cases in Germany, as issued on (a) 15 February, (b) 22 February and (c) 1 March 2021. These dates, shown as vertical dashed lines, mark the start of a renewed increase in overall case counts due to the new variant of concern B.1.1.7. Panel (d): Data by RKI on the share of the B.1.1.7 variant as available on the different forecast dates (the next data release by RKI occurred on 3 March). The models Karlen-pypm and LeipzigIMISE-SECIR accounted for the presence of multiple variants from 1 March onwards. The ITWW-county repro model was the only one to anticipate a change in trend on 15 February (though slower than the observed one), and adapted quickly to the upward trend in the following week. This model extrapolates recently observed growth or decline at the county-level and aggregates these fine-grained forecasts to the state or national level. Therefore it may have been able to catch a signal of renewed growth, as a handful of German states had already experienced a slight increase in cases in the previous week (e.g., Thuringia and Saxony-Anhalt, see panel (b) of Supplementary Figure S11). However, as illustrated in panel (a) of the same Figure, the ITWW model had also predicted turning points earlier during the same phase of decline in cases, and might generally have a tendency to produce such patterns. Another noteworthy observation in this context is the change in the predictions of the Karlen-pypm model. After the extension of the model to account for the B.1.1.7 variant on 1 March, its forecasts changed from the most optimistic to the most pessimistic among all included models (panels b and c of Figure 5). The other model including variant data, LeipzigIMISE-SECIR, likewise was among the first to adopt an upward trend. In Poland, availability of sequencing data was very limited during our study period; the GISAID database19 only contained 2271 sequenced samples for Poland by 29 March 2021.18 Nonetheless, the ICM-agentModel and MOCOS-agent1 models explicitly took the presence of a new variant into account to the degree possible. Again, the ITWW-county repro model was the first to predict a change in overall trends (in this case without having predicted turning points already in the preceding weeks; see Supplementary Figure S1). #### Peak of the third wave (cases) In Poland, the third wave reached its peak in the week ending on 3 April 2021. Despite the fact that it coincided with the Easter weekend and thus somewhat unclear data quality, this turnaround was predicted quite well by two Poland-based teams, MOCOS-agent1 and ICM-agentModel. As can be seen from Figure 6, the trajectory of these two models differed substantially from those of most other models, including the ensemble, which predicted a sustained increase. This successful prediction of the turning point was in large part responsible for the good relative performance of MOCOS-agent1 and ICM-agentModel at longer horizons (Table 2). In retrospective discussions, the respective teams noted that the tightening of non-pharmaceutical interventions (NPIs) on 27 March (which they had anticipated) in combination with possible seasonal effects had led them to expect a downward turn. ![Figure 6:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F6.medium.gif) [Figure 6:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F6) Figure 6: Case forecasts in Poland surrounding the peak in April 2022. Point forecasts of cases in Poland from (a) 22 March, (b) 29 March and (c) 5 April 2021, surrounding the peak week. In each panel, the date at which forecasts were created is marked by a dashed vertical line. The models ICM-agentModel and MOCOC-agent1 anticipated the trend change correctly, while the remaining models show more or less pronounced overshoot. For Germany, the peak of the third wave occurred only after the end of our pre-specified study period, but we note that numerous models showed strong overshoot as they expected the upward trend to continue. The exact mechanisms underlying the turnaround remain poorly understood. A new set of restrictions referred to as the Bundesnotbremse in German (federal emergency break) was introduced too late to explain the change on its own. #### Changes in trend of deaths In Germany, the study period coincided almost perfectly with a prolonged period of decline in deaths. In Figure 7, panels (a) and (b) show the behaviour of the median ensemble at the beginning and end of this phase. The ensemble had already anticipated a downward turn on 4 January, two weeks before it actually occurred. Following the unexpected strong increase in the following week, it went to extending the upward tendency, before switching back to predicting a turnaround. It seems likely that the irregular pattern in late December and early January is partly due to holiday effects in reporting, and forecast models may have been disturbed by this aspect. ![Figure 7:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F7.medium.gif) [Figure 7:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F7) Figure 7: Death forecasts preceding trend changes. Point forecasts of the median ensemble during changing trends in deaths. Panel (a): Downward turn in Germany, January 2021. Panel (b): Upward turn in Germany, March 2021. Panel (c): Upward turn in Poland, February/March 2021. Different colours and point/line shapes represent forecasts made at distinct time points (marked by dashed vertical lines). At the end of the downward trend in late March, the ensemble again anticipated the turnaround to arrive earlier than it did, and predicted a more prolonged rise than was observed. Nonetheless, in both cases the ensemble to some degree anticipated qualitative change, and the observed trajectories were well inside the respective 95% prediction intervals (with the exception of the forecast from 4 January; however, this forecast had prospectively been excluded from the analysis as we anticipated reporting irregularities). In Poland, deaths started to increase in early March after a prolonged period of decay. As can be seen in panel c of Figure 7, the median ensemble had anticipated this change (22 February 2021), but in terms of its point forecast did not initially expect a prolonged upward trend as later observed. Nonetheless, the observed trajectory was contained in the relatively wide 95% prediction intervals (Figures 2 and 3). ## Discussion We presented results from the second and final part of a pre-registered forecast evaluation study conducted in Germany and Poland (January–April 2021). During the period covered in this paper, ensemble approaches yielded very good performance relative to contributed individual models and baseline models. The majority of contributed models was able to outperform a simple last-observation-carried-forward model for most targets and forecast horizons up to four weeks. The results in this manuscript differ in important aspects from those for our first evaluation period (October–December 2020), when most models struggled to meaningfully outperform the KIT-baseline model for cases. Fall 2020 was characterized by rapidly changing non-pharmaceutical intervention measures, making it hard for models to anticipate the case trajectory. Pooled across both study periods, we found ensemble forecasts of deaths to yield satisfactory reliability and clear improvements over baseline models. For cases, however, coverage was clearly below nominal from the two-week horizon onward, and in terms of mean weighted interval scores the ensemble failed to outperform the KIT-baseline model three and four weeks ahead. This strengthens our previous conclusion16 that meaningful case forecasts are only feasible at very short horizons. It also agrees with recent results from the US COVID-19 Forecast Hub,36 which led the organizers to temporarily suspend ensemble case forecasts beyond the one-week horizon. The differences between our two study periods illustrate that performance relative to simple baseline models is strongly dependent on how good a fit these are for a given period. Cases in Germany plateaued during November and early December 2020, making the last-observation-carried-forward strategy of KIT-baseline difficult to beat. The second evaluation period was characterized by longer stretches of continued upward or downward trends, making it much easier to beat that baseline. In this situation, however, many models did not achieve strong improvements over the extrapolation approach KIT-extrapolation baseline. Ideally one would wish complex forecast models to outperform each of these different baseline models. However, there are many ways of specifying a simple baseline,37 and post-hoc at least one of them will likely be in acceptable agreement with the observed trajectory. While the choice of the most meaningful reference remains subject to debate, we believe that the use of a small set of pre-specified baselines as in the present study is a reasonable approach. An observation made for both the first and the second part of our study is that predicting changing trends in cases is very challenging; turnarounds in death counts are less difficult to anticipate. This finding is shared by works on real-time forecasts of COVID-19 in the UK38 and the US.39 To interpret these insights we note that, in principle, there are two ways of forecasting epidemiological time series: 1. Applying a mechanistic model to project future spread based on recent trends and other relevant factors like NPIs, population behaviour, spread of different variants or vaccination. Models can then predict trend changes based on classical epidemiological mechanisms (depletion of susceptibles) or observed/anticipated changes in surrounding factors, which depending on the model may be treated as exogenous or endogenous. 2. Establishing a statistical relationship (often with a mechanistic motivation) to a leading indicator, i.e. a data stream which is informative on the trajectory of the quantity of interest, but available earlier. Changes in the trend of the leading indicator can then help anticipate future turning points in the time series of interest. Death forecasts belong into the realm of category (ii), with cases and hospitalizations serving as leading indicators. This prediction task has been addressed with considerable success. Case forecasts, on the other hand, typically are based on approach (i), which largely reduces to trend extrapolation, unless models are carefully tuned to changing NPIs (see Table 1). Theoretical arguments on the limited predictability of turning points in such curves have been brought forward,40, 41 and empirical work including ours confirms that this is a very difficult task. While the success of the two microsimulation models MOCOS-agent1 and ICM-agentModel in anticipating the downward turn in cases in Poland remains a rather rare exception, it shows that careful mechanistic modelling combined with in-depth knowledge of national specificities has the potential to anticipate the impact of changing NPIs. As both groups heavily drew from experience from past NPIs in Poland, there is hope that predictions of the effects of NPIs will further improve as experience accumulates. An alternative strategy to improve case forecasts would be to identify appropriate leading indicators. These could for instance be trajectories in other countries42 or additional data streams on e.g., mobility, insurance claims or web searches. However, the benefits of such data for short-term forecasting thus far have been found to be modest.43 Changes in dominant variants may make changes in overall trends predictable as they arise from the superposition of adverse but stable trends for the different variants. The availability of sequencing data has improved considerably since our study period, and we consider the extension of models to accommodate multiple strains a key step towards improved prediction of trend changes. Other relevant aspects include seasonal effects, which during our study period remained poorly understood due to limited historical data, and population immunity. As more data on seroprevalence become available, predictability of saturation effects may increase, though this will likely be complicated by the further evolution of the pathogen. Another difficulty of case forecasts is incomplete case ascertainment, which must be assumed to vary over time.9, 44 As a consequence, data can be difficult to compare across different phases of the pandemic, and modellers often choose to only use a recent subset of the available data to calibrate their models. While data on testing volumes and test positivity rates are available, estimation of the reporting fractions and anticipation of its future development is challenging. Even if models correctly reflect underlying epidemic dynamics, this may thus not translate to accurate forecasts of the observed number of confirmed cases. This is a limitation of the considered forecasts and their evaluation, which inherit the difficulties of the underlying truth data sources. Nonetheless, we argue that a distinguishing feature of forecasts is that they refer to observable quantities, and forecasters should take into account all relevant aspects of the system producing them. Indeed, many of the considered models (e.g., MOCOS-agent1 and FIAS FZJ-Epi1Ger) attempt to reconstruct the underlying infection dynamics, which are then linked to the number of reported cases via time-varying reporting probabilities. We have extensively discussed the difficulties models encountered at turning points. In the aftermath of such events, epidemic forecasts typically receive increased attention in the general media (e.g., following the rapid downward turn in cases in Germany in May 202145). While important from a subject-matter perspective, this is not without problems from a formal forecast evaluation standpoint. Major turning points are rare events and as such difficult to forecast. Focusing evaluation on solely these instances will benefit models with a strong tendency to predict change, and adapting scoring rules to emphasize these events in a principled way is not straightforward. This problem is known as the forecaster’s dilemma46 in the literature and likewise occurs in, e.g., economics and meteorology (see illustrations in Table 1 from Lerch et al46). An interesting question for future work is whether turning points are preceded by stronger disagreement between models, which might then serve as an alert; or whether, on the contrary, trend changes are followed by increased disagreement. Especially the latter question has received considerable attention in economic forecasting.47 In this paper we only applied unweighted ensembles and a heuristic, rather unflexible weighting scheme based directly on past average performance. More sophisticated weighting schemes have been explored in29 and39 using data from the US COVID-19 Forecast Hub. Their results indicate that when some contributing forecasters have a stable record of good performance, giving these more weights can result in improved performance. In particular, restricting the ensemble to a set of well-performing models may be beneficial, a strategy employed in the so-called relative WIS weighted median ensemble39 used by the US COVID-19 Forecast Hub since November 2021. The present paper marks the end of the German and Polish COVID-19 Forecast Hub as an independently run platform. In April 2021, the European Center for Disease Prevention and Control (ECDC) announced the launch of a European COVID-19 Forecast Hub,4 which has since attracted submissions from more than 30 independent teams. The German and Polish COVID-19 Forecast Hub has been synchronized with this larger effort, meaning that all forecasts submitted to our platform are forwarded to the European repository, while forecasts submitted there are mirrored in our dashboard. In addition, we still collect regional-level forecasts, which are not currently covered in the European Forecast Hub. The adoption of the Forecast Hub concept by ECDC underscores the potential of collaborative forecasting systems with combined ensemble predictions as a key output, along with continuous monitoring of forecast performance. We anticipate that this closer link to public health policy making will enhance the usefulness of this system to decision makers. An important step will be the inclusion of hospitalization forecasts. Due to unclear data access, these had not been tackled in the framework of the German and Polish COVID-19 Forecast Hub, but have been added in the new European version. ## Data Availability All data produced are available at [https://github.com/KITmetricslab/covid19-forecast-hub-de](https://github.com/KITmetricslab/covid19-forecast-hub-de) [https://github.com/KITmetricslab/covid19-forecast-hub-de](https://github.com/KITmetricslab/covid19-forecast-hub-de) ## Data availability The forecast data generated in this study have been deposited in a GitHub repository ([https://github.com/KITmetricslab/covid19-forecast-hub-de](https://github.com/KITmetricslab/covid19-forecast-hub-de)), with a stable Zenodo release available under accession code 5608390 [https://zenodo.org/record/5608390#.YYFxdJso9H4](https://zenodo.org/record/5608390#.YYFxdJso9H4). This repository also contains all truth data used for evaluation. Forecasts can be visualised interactively at [https://kitmetricslab.github.io/forecasthub/](https://kitmetricslab.github.io/forecasthub/). Should any further data be required to reproduce the results these can be obtained from the corresponding authors upon reasonable request. ## Code availability Codes to reproduce figures and tables are available at [https://github.com/KITmetricslab/analyses](https://github.com/KITmetricslab/analyses) de pl2, with a stable version at [https://zenodo.org/record/5639514#.YYF1aZso9H4](https://zenodo.org/record/5639514#.YYF1aZso9H4). 61 The results presented in this paper have been generated using the release preprint2 of the repository [https://github.com/KITmetricslab/covid19-forecast-hub-de](https://github.com/KITmetricslab/covid19-forecast-hub-de), see above for the link to the stable Zenodo release. The codes require the R packages colorspace (version 1.8-9),62 plotrix (version 3.8-1),63 xtable (version 1.8-4)64 and zoo (version 1.8-965). ## Competing interests The authors declare no competing interests. ## Author contributions JB, DW, TG and MSe conceived the study with advice from AU. JB, DW, JD, KG and JK put in place and maintained the forecast submission and processing system. AU coordinated the creation of an interactive visualization tool. JB performed the evaluation analyses with inputs from DW, TG, MSe and members of various teams. SA, MVB, DB, SB, MB, NIB, JPB, LC, GF, JFr, JFn, SF, AG, KG, SH, TH, YK, HK, TK, EK, NL, MLL, JHM, BM, IJM, JMe, JMg, PN, JMN, TO, MR, FR, MSz, SS, and AS contributed forecasts (see list of contributors by team). JB, TG and MSe wrote the manuscript. All teams and members of the coordinating team provided feedback on the manuscript and descriptions of the respective models. ## List of contributors by team CovidAnalytics-DELPHI: Michael Lingzhi Li (Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA, USA), Dimitris Bertsimas, Saksham Soni (both Sloan School of Management, Massachusetts Institute of Technology, Cambridge, USA) epiforecasts-EpiExpert and epiforecasts-EpiNow2: Sam Abbott, Nikos I. Bosse, Sebastian Funk (all London School of Hygiene and Tropical Medicine, London, UK) FIAS FZJ-Epi1Ger: Maria Vittoria Barbarossa (Frankfurt Institute for Advanced Studies, Frankfurt, Germany), Jan Fuhrmann (Institute of Applied Mathematics, University of Heidelberg, Heidelberg, Germany), Jan H. Meinke (Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany) SDSC ISG-TrendModel: Antoine Flahault, Elisa Manetti, Kristen Namigai (all Institute of Global Health, Faculty of Medicine, University of Geneva, Geneva, Switzerland), Christine Choirat, Benjamin Bejar Haro, Ekaterina Krymova, Gavin Lee, Guillaume Obozinski, Tao Sun (all Swiss Data Science Center, ETH Zurich and EPFL Lausanne, Switzerland), Dorina Thanou (Center for Intelligent Systems, EPFL, Lausanne Switzerland) ICM-agentModel: Filip Dreger, L ukasz Górski, Magdalena Gruziel-Slomka, Artur Kaczorek, Antoni Moszyński, Karol Niedzielewski, Jedrzej Nowosielski, Maciej Radwan, Franciszek Rakowski, Marcin Semeniuk, Jakub Zieliński (all Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warsaw, Poland), Rafal Bartczuk (Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warsaw and Institute of Psychology, John Paul II Catholic University of Lublin, Lublin, Poland), Jan Kisielewski (Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warsaw and Faculty of Physics, University of Bialystok) Imperial-ensemble2: Sangeeta Bhatia (MRC Centre for Global Infectious Disease Analysis, Abdul Latif Jameel Institute for Disease and Emergency Analytics (J-IDEA), Imperial College, London, UK), Pierre Nouvellet (School of Life Sciences, University of Sussex, Brighton, UK) itwm-dSEIR: Michael Burger, Robert Feßler, Jochen Fiedler, Michael Helmling, Karl-Heinz Küfer, Neele Leithäuser, Jan Mohring, Johanna Schneider, Anita Schöbel, Michael Speckert, Raimund Wegener, Jaroslaw Wlazlo (all Fraunhofer Institute for Industrial Mathematics, Kaiserslautern, Germany) ITWW-county repro: Przemyslaw Biecek (Warsaw University of Technology, Warsaw, Poland), Viktor Bezborodov, Marcin Bodych, Tyll Krueger (all Wroclaw University of Science and Technology, Poland), Jan Pablo Burgard (Economic and Social Statistics Department, University of Trier, Germany), Stefan Heyder, Thomas Hotz (both Institute of Mathematics, Technische Universität Ilmenau, Ilmenau, Germany) LANL-GrowthRate: Dave A. Osthus, Isaac J. Michaud (both Statistical Sciences Group, Los Alamos National Laboratory, Los Alamos, USA), Lauren Castro, Geoffrey Fairchild (both Information Systems and Modeling, Los Alamos National Laboratory, Los Alamos, USA) LeipzigIMISE-SECIR: Yuri Kheifetz, Holger Kirsten, Markus Scholz (all Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Leipzig, Germany) MIMUW-StochSEIR: Anna Gambin, Krzysztof Gogolewski, Blaz?ej Miasojedow, Ewa Szczurek (all Institute of Informatics, University of Warsaw, Warsaw, Poland), Daniel Rabczenko, Magdalena Rosińska (Polish National Institute of Public Health – National Institute of Hygiene) MOCOS-agent1: Marek Bawiec, Viktor Bezborodov, Marcin Bodych, Radoslaw Idzikowski, Tyll Krueger, Tomasz Oz?ański, Ewaryst Rafajllowicz, Ewa Skubalska-Rafajlowicz, Wojciech Rafajllowicz (all Wroclaw University of Science and Technology, Poland), Barbara Pabjan (Institute of Sociology,, University of Wroclaw, Poland,), Przemyslaw Biecek (Warsaw University of Technology), Agata Migalska (Wroclaw University of Science and Technology, Poland and Nokia Solutions and Networks, Wroclaw, Poland), Ewa Szczurek (University of Warsaw) USC-SIkJalpha: Ajitesh Srivastava, Frost Tianjian Xu (both University of Southern California, Los Angeles, USA) ## Supplementary Materials for Bracher et al (2021) ### S1 Detailed description of new models We only provide detailed descriptions of models which were added to our project for the second evaluation period. Descriptions for the other models can be found in Supplementary Note 3 of Bracher et al (2021). A more detailed documentation of the LeipzigIMISE-SECIR and SDSC ISG-TrendModel models which had not been available at the appearance of Bracher (2021) can be found in Kheifetz et al (2021) and Krymova et al (2021), respectively. #### itwm-dSEIR Fraunhofer-ITWM’s predictions are based on a cohort model that groups people according to four age groups and according to the status infected, detected and since 19 April successfully vaccinated (i.e., this extension was added after the evaluation period). The dynamics of the epidemic are described by integral equations, assuming an infectious period with fixed onset, end and infectivity. The most important parameters are contact rates between age groups, detection rates and times, and death rates and times, which are adjusted to the historical data of the RKI. For forecasts, the simulation is continued with the parameters determined for the last week. In principle, the forecast quality could be improved by anticipating the effects of events such as the end of public holidays on contact and detection rates. However, this is not yet done in the automatic submissions. All calculations use automatic differentiation. This speeds up parameter adjustment and allows for error estimates. The latter are determined by comparing counted and simulated cases and by matching the empirical standard deviations with the standard deviations predicted by the calculated sensitivities. The model is described in detail in [https://www.itwm.fraunhofer.de/de/presse-publikationen/presseinformationen/2021/2021-06-22](https://www.itwm.fraunhofer.de/de/presse-publikationen/presseinformationen/2021/2021-06-22) Dritte Welle Starker-Effekt-von-Schnelltests-an-Schulen.html. #### Karlen-pypm The python Population Modeller (pyPM, Karlen 2020) is a mechanistic modeling framework to describe viral spread via discrete-time difference equations. In a pyPM model, different population objects are connected by a list of directional connector objects. The adjustable parameters of the model are stored in parameter objects. The core of the model consists of a model of the infection cycle involving the susceptible, infected (but not yet contagious) and contagious parts of the population. The contagious population is modelled in more detail by introducing symptomatic, test-positive, hospitalized (normal ward and ICU) and deceased populations. The model takes time series of cases, deaths and intensive care occupancy as data inputs. Forecasts are generated at the regional level (German states) first and subsequently aggregated to the national level. Starting from 1 March 2021, the model was stratified into spread of the wild type of the virus and the B.1.1.7 variant, and integrated genetic sequencing data on their respective importance. ### S2 Additional forecast visualizations ![Figure S1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F8.medium.gif) [Figure S1:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F8) Figure S1: One-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland. Asterisks mark prediction intervals exceeding the upper plot limit. The figure shows forecasts from models not displayed in Figure 2. Colored points represent predictive medians, dark and light bars show 50% and 95% prediction intervals, respectively. Asterisks mark prediction intervals exceeding the upper plot limit. ![Figure S2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F9.medium.gif) [Figure S2:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F9) Figure S2: Two-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland, same models as displayed in Figure 3. Colored points represent predictive medians, dark and light bars show 50% and 95% prediction intervals, respectively. Asterisks mark prediction intervals exceeding the upper plot limit. ![Figure S3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F10.medium.gif) [Figure S3:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F10) Figure S3: Three-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland, same models as displayed in Figure 3. Colored points represent predictive medians, dark and light bars show 50% and 95% prediction intervals, respectively. Asterisks mark prediction intervals exceeding the upper plot limit. ![Figure S4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F11.medium.gif) [Figure S4:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F11) Figure S4: Four-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland, same models as displayed in Figure 3. Colored points represent predictive medians, dark and light bars show 50% and 95% prediction intervals, respectively. Asterisks mark prediction intervals exceeding the upper plot limit. ![Figure S5:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F12.medium.gif) [Figure S5:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F12) Figure S5: Three-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland, same models as displayed in Figure S1. Colored points represent predictive medians, dark and light bars show 50% and 95% prediction intervals, respectively. Asterisks mark prediction intervals exceeding the upper plot limit. ![Figure S6:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F13.medium.gif) [Figure S6:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F13) Figure S6: Four-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland, same models as displayed in Figure S1. Colored points represent predictive medians, dark and light bars show 50% and 95% prediction intervals, respectively. Asterisks mark prediction intervals exceeding the upper plot limit. ### S3 Decomposition of average WIS values ![Figure S7:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F14.medium.gif) [Figure S7:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F14) Figure S7: Average weighted interval score and absolute error achieved by models across countries, targets and forecast horizons. The grey area represents the performance of the baseline model KIT-baseline. WIS values are decomposed into components for forecast spread, overprediction and underprediction. ### S4 Individual WIS values ![Figure S8:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F15.medium.gif) [Figure S8:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F15) Figure S8: Individual weighted interval scores achieved by models across countries, targets and forecast horizons. Each dot represents one score achieved over the 12-week evaluation period. ### S5 Weights in inverse-WIS ensembles ![Figure S9:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F16.medium.gif) [Figure S9:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F16) Figure S9: Weights in KITCOVIDhub-inverse wis ensemble for incident cases in Germany and Poland, October 2020–March 2021 (i.e., combined for the study periods of Bracher et al (2021) and the present manuscript). ![Figure S10:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F17.medium.gif) [Figure S10:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F17) Figure S10: Weights in KITCOVIDhub-inverse wis ensemble for incident deaths in Germany and Poland, October 2020–March 2021 (i.e., combined for the study periods of Bracher et al (2021) and the present manuscript). ### S6 Visualization of behaviour at turning points ![Figure S11:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/08/26/2021.11.05.21265810/F18.medium.gif) [Figure S11:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/F18) Figure S11: **a** Forecasts of cases in Germany by the ITWW-county repro model, 25 January to 22 February 2021. **b** Forecasts for cases in selected German states by the ITWW-county repro model, 22 February 2021. ### S7 Additional summary tables on forecast evaluation View this table: [Table S1:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T4) Table S1: Forecast evaluation for Germany and Poland in terms of relative AE and WIS, 1–4 weeks ahead. The relative values are obtained by dividing the mean absolute error or WIS of a given model by the respective value achieved by the baseline model. Values below 1 indicate better performance than the baseline, values above worse performance. View this table: [Table S2:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T5) Table S2: Forecast evaluation for Germany and Poland, 3 and 4 weeks ahead (incidence scale, based on RKI/MZ data). *C*0.5 and *C*0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. View this table: [Table S3:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T6) Table S3: Forecast evaluation for Germany and Poland, 1 and 2 weeks ahead (cumulative scale, based on RKI/MZ data). *C*0.5 and *C*0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. View this table: [Table S4:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T7) Table S4: Forecast evaluation for Germany and Poland, 3 and 4 weeks ahead (cumulative scale, based on RKI/MZ data). *C*0.5 and *C*0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. View this table: [Table S5:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T8) Table S5: Forecast evaluation for Germany and Poland, 1 and 2 weeks ahead (incidence scale, based on JHU data). *C*0.5 and *C*0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. View this table: [Table S6:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T9) Table S6: Forecast evaluation for Germany and Poland, 3 and 4 weeks ahead (incidence scale, based on JHU data). *C*0.5 and *C*0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. View this table: [Table S7:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T10) Table S7: Forecast evaluation at the regional level, Germany and Poland, 3 and 4 weeks ahead (incidence scale, based on RKI/MZ data). Results are averaged over the different regions (states in Germany, voivodeships in Poland). *C*0.5 and *C*0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. View this table: [Table S8:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T11) Table S8: Forecast evaluation for Germany and Poland, pooled across evaluation periods, 1 and 2 weeks ahead (incidence scale, based on RKI/MZ data). *C*0.5 and *C*0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. View this table: [Table S9:](http://medrxiv.org/content/early/2022/08/26/2021.11.05.21265810/T12) Table S9: Forecast evaluation for Germany and Poland, pooled across evaluation periods, 3 and 4 weeks ahead (incidence scale, based on RKI/MZ data). *C*0.5 and *C*0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. ## Acknowledgements J. Bracher, M. Schienle and T. Gneiting acknowledge support from the Helmholtz Foundation via the SIM-CARD Information and Data Science Pilot Project. J. Bracher and M. Schienle were moreover supported by the German Ministry of Reserach and Education via the project RESPINOW. T. Gneiting and D. Wolffram are grateful for support by the Klaus Tschira Foundation. D. Wolffram’s contribution was moreover supported by the Helmholtz Association under the joint research school HIDSS4Health – Helmholtz Information and Data Science School for Health as well as the German Federal Ministry of Education and Research (BMBF) and the Baden-Württemberg Ministry of Science as part of the Excellence Strategy of the German Federal and State Governments. N.I. Bosse was supported by the Health Protection Research Unit (grant code NIHR200908). S. Funk and S. Abbott were supperted by the Wellcome Trust (210758/Z/18/Z). The itwm-dSEIR forecasting team (J. Fiedler, N. Leithäuser, J. Mohring) was supported by the Ministry of Health and Science of Rhineland Palatinate and the Fraunhofer Anti-Corona Program. The LANL-GrowthRate forecasting team (L. Castro, G. Fairchild, I.J Michaud) was supported by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project number 20200700ER. S. Bhatia acknowledges funding from the Wellcome Trust (219415). Work on the ICM UW epidemiological model (J.M Nowosielski, M. Radwan, F. Rakowski) was supported by the Polish Minister of Science and Higher Education grant 51/WFSN/2020 given to the University of Warsaw. Development of the IMISE-SECIR model (Y. Kheifetz, H. Kirsten, M. Scholz) was funded in the framework of the project SaxoCOV (Saxonian COVID-19 Research Consortium). SaxoCOV was co-financed with tax funds on the basis of the budget passed by the Saxon state parliament. Model presentation was funded by the NFDI4Health Task Force COVID-19 ([www.nfdi4health.de/task-force-covid-19-2](http://www.nfdi4health.de/task-force-covid-19-2)) within DFG project LO-342/17-1. Furthermore, modeling of this group is funded by the Federal Ministry of Education and Research Germany (BMBF) within project PROGNOSIS (FKZ 031L0296A). We thank Dean Karlen for contributing forecasts and the Institute for Health Metrics and Evaluation, University of Washington, for making forecasts publicly available under a free license. We are moreover grateful for support and advice from the organizing team of the US COVID-19 Forecast Hub. The content of this manuscript is solely the responsibility of the authors and does not necessarily represent the official views of the institutions they are affiliated with. ## Footnotes * This version has a slightly different structure and addresses a few weaknesses of the previous version (e.g., by including a discussion on case ascertainment rates). The main messages of the paper remain unchanged. The title has been slightly changed and a plain language summary has been included. Three additional authors (Castro, Fairchild, Michaud) have been included who due to clearance questions had been removed from the first version. * Received November 5, 2021. * Revision received August 25, 2022. * Accepted August 26, 2022. * © 2022, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), CC BY-NC 4.0, as described at [http://creativecommons.org/licenses/by-nc/4.0/](http://creativecommons.org/licenses/by-nc/4.0/) ## References 1. [1].Ray, E. L. et al. Ensemble forecasts of coronavirus disease 2019 (COVID-19) in the U.S. medRxiv (2020). URL [https://www.medrxiv.org/content/early/2020/08/22/2020.08.19.20177493](https://www.medrxiv.org/content/early/2020/08/22/2020.08.19.20177493). 2. [2].Cramer, E. Y. et al. Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States. Proceedings of the National Academy of Sciences 119, e2113561119 (2022). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1073/pnas.2113561119&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=35394862&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F08%2F26%2F2021.11.05.21265810.atom) 3. [3].Borchering, R. K. et al. Modeling of future COVID-19 cases, hospitalizations, and deaths, by vaccination rates and nonpharmaceutical intervention scenarios – United States, April–September 2021. Morbidity and Mortality Weekly Report 70, 719–724 (2021). 4. [4].Sherratt, K. et al. Predictive performance of multi-model ensemble forecasts of COVID-19 across European nations. medRxiv (2022). URL [https://www.medrxiv.org/content/early/2022/06/16/2022.06.16.22276024](https://www.medrxiv.org/content/early/2022/06/16/2022.06.16.22276024). 5. [5].McGowan, C. J. et al. Collaborative efforts to forecast seasonal influenza in the United States, 2015–2016. Scientific Reports 9, 683 (2019). 6. [6].Del Valle, S. et al. Summary results of the 2014–2015 DARPA Chikungunya Challenge. BMC Infectious Diseases 18, 245 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12879- 018-3124-7. PubMed PMID: 29843621; PubMed Central PMCID: PMCPMC5975673.&link_type=DOI) 7. [7].Johansson, M. A. et al. An open challenge to advance probabilistic forecasting for dengue epidemics. Proceedings of the National Academy of Sciences 116, 24268–24274 (2019). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTE2LzQ4LzI0MjY4IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjIvMDgvMjYvMjAyMS4xMS4wNS4yMTI2NTgxMC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 8. [8].Nature Publishing Group. Editorial: Developing infectious disease surveillance systems. Nature Communications 11, 4962 (2020). 9. [9].Arik, S. et al. A prospective evaluation of AI-augmented epidemiology to forecast COVID-19 in the USA and Japan. npj Digital Medicine 4, 146 (2021). 10. [10].Dirnagl, U. Politikberatung, bis der Elefant mit dem Rüssel wackelt! Laborjournal 5/2021, 22–24 (2021). 11. [11].Bracher, J., the German and Polish COVID-19 Forecast Hub Team & Participants. Study protocol: Comparison and combination of real-time COVID19 forecasts in Germany and Poland. Deposited 8 October 2020, Registry of the Open Science Foundation, [https://osf.io/k8d39](https://osf.io/k8d39) (2020). 12. [12]. Robert Koch Institut. CSV mit den aktuellen Covid-19 Infektionen pro Tag (Zeitreihe). Available online, [https://www.arcgis.com/home/item.html?id=f10774f1c63e40168479a1feb6c7ca74](https://www.arcgis.com/home/item.html?id=f10774f1c63e40168479a1feb6c7ca74), last accessed on 18 August 2022. (2022). 13. [13].Polish Ministry of Health. Dane historyczne dla województw. Available online, [https://www.arcgis.com/home/item.html?id=a8c562ead9c54e13a135b02e0d875ffb](https://www.arcgis.com/home/item.html?id=a8c562ead9c54e13a135b02e0d875ffb), last accessed on 18 August 2022. (2022). 14. [14].Johns Hopkins University Center for Systems Science and Engineering. COVID-19 Data Repository. Available online, [https://github.com/CSSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19), last accessed on 18 August 2022. (2022). 15. [15].Dong, E., Du, H. & Gardner, L. An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases 20, 533–534 (2020). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S1473-3099(20)30120-1&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F08%2F26%2F2021.11.05.21265810.atom) 16. [16].Bracher, J. et al. A pre-registered short-term forecasting study of COVID-19 in Germany and Poland during the second wave. Nature Communications 12, 5173 (2021). 17. [17].Robert Koch Institut. Bericht zu Virusvarianten von SARS-CoV-2 in Deutschland, insbesondere zur Variant of Concern (VOC) B.1.1.7, 31 March 2021. Available at [https://www.rki.de/DE/Content/InfAZ/N/Neuartiges\_Coronavirus/DESH/Bericht_VOC 2021-03-31.pdf](https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/DESH/Bericht_VOC_2021-03-31.pdf) (2021). 18. [18].MI2 Data Lab, Warsaw University of Technology. Monitor of SARS-CoV-2 variants, version 2021-05-05 (2021). Available at [https://monitor.crs19.pl/2021-05-05/poland/?lang=en](https://monitor.crs19.pl/2021-05-05/poland/?lang=en). 19. [19].GISAID Initiative. Enabling rapid and open access to epidemic and pandemic virus data – tracking of variants (2021). Available at [https://www.gisaid.org/hcov19-variants/](https://www.gisaid.org/hcov19-variants/). 20. [20].Hale, T. et al. A global panel database of pandemic policies (Oxford COVID-19 government response tracker). Nature Human Behaviour 5, 529–538 (2021). 21. [21].Gneiting, T. & Raftery, A. E. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102, 359–378 (2007). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1198/016214506000001437&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000244361000032&link_type=ISI) 22. [22].Bracher, J., Ray, E. L., Gneiting, T. & Reich, N. G. Evaluating epidemic forecasts in an interval format. PLOS Computational Biology 17, e1008618 (2021). 23. [23].Petropoulos, F. & Makridakis, S. Forecasting the novel coronavirus COVID-19. PLOS ONE 15, e0231236 (2020). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0231236&link_type=DOI) 24. [24].Hyndman, R. et al. forecast: Forecasting functions for time series and linear models. R package version 8.12.0, URL [https://pkg.robjhyndman.com/forecast/](https://pkg.robjhyndman.com/forecast/). (2021). 25. [25].IHME COVID-19 Forecasting Team. Modeling COVID-19 scenarios for the United States. Nature Medicine 27, 94–105 (2021). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33097835&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F08%2F26%2F2021.11.05.21265810.atom) 26. [26].Reich, N. G. et al. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proceedings of the National Academy of Sciences 116, 3146–3154 (2019). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMDoiMTE2LzgvMzE0NiI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIyLzA4LzI2LzIwMjEuMTEuMDUuMjEyNjU4MTAuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 27. [27].Vincent, S. The function of the viborissae in the behavior of the white rat. Behavioral Monographs 1, 1–82 (1912). 28. [28].Busetti, F. Quantile aggregation of density forecasts. Oxford Bulletin of Economics and Statistics 79, 495–512 (2017). URL [https://onlinelibrary.wiley.com/doi/abs/10.1111/obes.12163](https://onlinelibrary.wiley.com/doi/abs/10.1111/obes.12163). 29. [29].Taylor, J. W. & Taylor, K. S. Combining probabilistic forecasts of COVID-19 mortality in the United States. European Journal of Operational Research (2021). URL [https://www.sciencedirect.com/science/article/pii/S0377221721005609](https://www.sciencedirect.com/science/article/pii/S0377221721005609). 30. [30].Ray, E. L. et al. Comparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States. International Journal of Forecasting (2022). URL [https://doi.org/10.1016/j.ijforecast.2022.06.005](https://doi.org/10.1016/j.ijforecast.2022.06.005). 31. [31].Davies, N. et al. Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England. Science 372, eabg3055 (2021). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjE3OiIzNzIvNjUzOC9lYWJnMzA1NSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIyLzA4LzI2LzIwMjEuMTEuMDUuMjEyNjU4MTAuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 32. [32].Berndt, C., Endt, C. & Müller-Hansen, S. Die unsichtbare Welle. Süddeutsche Zeitung (2021). Published online, 5 February 2021, [https://www.sueddeutsche.de/wissen/coronavirus-mutante-b117-daten-1.5197700](https://www.sueddeutsche.de/wissen/coronavirus-mutante-b117-daten-1.5197700). 33. [33].Fischer-Fels, J. Erste Hochrechnung zur Verbreitung der Coronamutationen. Ärzteblatt (2021). Published online, 3 February 2021, [https://www.aerzteblatt.de/nachrichten/120768/Erste-Hochrechnung-zur-Verbreitung-der-Corona-Mutationen](https://www.aerzteblatt.de/nachrichten/120768/Erste-Hochrechnung-zur-Verbreitung-der-Corona-Mutationen). 34. [34].Landesgesundheitsamt Baden Württemberg. Tagesbericht COVID-19, Montag 8.2.2021 (2021). Available at [https://www.gesundheitsamt-bw.de/fileadmin/LGA/DocumentLibraries/SiteCollectionDocuments/05\_Service/LageberichtCOVID19/COVID\_Lagebericht\_LGA\_210208.pdf](https://www.gesundheitsamt-bw.de/fileadmin/LGA/DocumentLibraries/SiteCollectionDocuments/05\_Service/LageberichtCOVID19/COVID_Lagebericht_LGA_210208.pdf). 35. [35].Robert Koch Institute. Bericht zu Virusvarianten von SARS-CoV-2 in Deutschland, insbesondere zur Variant of Concern (VOC) B.1.1.7, update 10 February 2021. Available at [https://www.rki.de/DE/Content/InfAZ/N/NeuartigesCoronavirus/DESH/BerichtVOC 2021-02-10.pdf](https://www.rki.de/DE/Content/InfAZ/N/NeuartigesCoronavirus/DESH/BerichtVOC2021-02-10.pdf) (2021). 36. [36].Reich, N., Tibshirani, R., Ray, E. & Rosenfeld, R. On the predictability of COVID-19. Blog post, International Institute of Forecasters, [https://forecasters.org/blog/2021/09/28/on-the-predictability-of-covid-19/](https://forecasters.org/blog/2021/09/28/on-the-predictability-of-covid-19/) (2021). 37. [37].Keyel, A. C. & Kilpatrick, A. M. Probabilistic evaluation of null models for West Nile Virus in the United States (2021). URL [https://www.biorxiv.org/content/early/2021/07/26/2021.07.26.453866](https://www.biorxiv.org/content/early/2021/07/26/2021.07.26.453866). 38. [38].Funk, S. et al. Short-term forecasts to inform the response to the Covid-19 epidemic in the UK. medRxiv (2020). URL [https://www.medrxiv.org/content/early/2020/11/13/2020.11.11.20220962](https://www.medrxiv.org/content/early/2020/11/13/2020.11.11.20220962). 39. [39].Ray, E. L. et al. Challenges in training ensembles to forecast COVID-19 cases and deaths in the United States. Blog post, International Institute of Forecasters, [https://forecasters.org/blog/2021/04/09/challenges-in-training-ensembles-to-forecast-covid-19-cases-and-deaths-in-the-united-states/](https://forecasters.org/blog/2021/04/09/challenges-in-training-ensembles-to-forecast-covid-19-cases-and-deaths-in-the-united-states/) (2021). 40. [40].Castro, M., Ares, S., Cuesta, J. & Manrubia, S. The turning point and end of an expanding epidemic cannot be precisely forecast. Proceedings of the National Academy of Sciences 117, 26190–26196 (2020). URL [https://doi.org/10.1073/pnas.2007868117](https://doi.org/10.1073/pnas.2007868117). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTE3LzQyLzI2MTkwIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjIvMDgvMjYvMjAyMS4xMS4wNS4yMTI2NTgxMC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 41. [41].Wilke, C. O. & Bergstrom, C. T. Predicting an epidemic trajectory is difficult. Proceedings of the National Academy of Sciences 117, 28549–28551 (2020). URL [https://doi.org/10.1073/pnas.2020200117](https://doi.org/10.1073/pnas.2020200117). [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTE3LzQ2LzI4NTQ5IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjIvMDgvMjYvMjAyMS4xMS4wNS4yMTI2NTgxMC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 42. [42].Harvey, A. Time series modelling of epidemics: Leading indicators, control groups and policy assessment. National Institute Economic Review 257, 83–100 (2021). 43. [43].McDonald, D. J. et al. Can auxiliary indicators improve COVID-19 forecasting and hotspot prediction? Proceedings of the National Academy of Sciences 118, e2111453118 (2021). URL [https://www.pnas](https://www.pnas). org/doi/abs/10.1073/pnas.2111453118. [https://www.pnas.org/doi/pdf/10.1073/pnas.2111453118](https://www.pnas.org/doi/pdf/10.1073/pnas.2111453118). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxODoiMTE4LzUxL2UyMTExNDUzMTE4IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjIvMDgvMjYvMjAyMS4xMS4wNS4yMTI2NTgxMC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 44. [44].Fuhrmann, J. & Barbarossa, M. The significance of case detection ratios for predictions on the outcome of an epidemic - a message from mathematical modelers. Archives of Public Health 78, article number 63 (2020). 45. [45].Berndt, C., Hametner, M., Kruse, B., Müller-Hansen, S. & Witzenberger, B. Ist die dritte Welle überstanden? Süddeutsche Zeitung (2021). Published online, 4 May 2020, [https://www.sueddeutsche](https://www.sueddeutsche). de/gesundheit/corona-infektionen-trendwende-modellierungen-1.5284545. 46. [46].Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F. & Gneiting, T. Forecaster’s dilemma: Extreme events and forecast evaluation. Statistical Science 32, 106–127 (2017). 47. [47].Coibion, O. & Gorodnichenko, Y. What can survey forecasts tell us about information rigidities? Journal of Political Economy 120, 116–159 (2012). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1086/665662&link_type=DOI) 48. [48].Rakowski, F., Gruziel, M., Bieniasz-Krzywiec, L. & Radomski, J. P. Influenza epidemic spread simulation for Poland – a large scale, individual based model study. Physica A: Statistical Mechanics and its Applications 389, 3149–3165 (2010). 49. [49].Adamik, B. et al. Mitigation and herd immunity strategy for COVID-19 is likely to fail. medRxiv (2020). URL [https://doi.org/10.1101/2020.03.25.20043109](https://doi.org/10.1101/2020.03.25.20043109). 50. [50].Li, M. L. et al. Forecasting COVID-19 and analyzing the effect of government interventions. medRxiv (2020). URL [https://www.medrxiv.org/content/early/2020/06/24/2020.06.23.20138693](https://www.medrxiv.org/content/early/2020/06/24/2020.06.23.20138693). 51. [51].Barbarossa, M. V. et al. Modeling the spread of COVID-19 in Germany: Early assessment and possible scenarios. PLOS ONE 15, e0238559 (2020). 52. [52].Karlen, D. Characterizing the spread of CoViD-19. arXiv preprint arxiv:2007.07156 (2020). URL [https://arxiv.org/abs/2007.07156](https://arxiv.org/abs/2007.07156). 53. [53].Kheifetz, Y., Kirsten, H. & Scholz, M. On the parametrization of epidemiologic models – lessons from modelling COVID-19 epidemic (2022). Viruses 14, 1468. 54. [54].Srivastava, A., Xu, T. & Prasanna, V. K. Fast and accurate forecasting of COVID-19 deaths using the SIkJα model. arXiv preprint arxiv:2007.05180 (2020). URL [https://arxiv.org/abs/2007.05180](https://arxiv.org/abs/2007.05180). 55. [55].Abbott, S. et al. Estimating the time-varying reproduction number of SARS-CoV-2 using national and subnational case counts. Wellcome Open Research 5 (2020). URL [https://doi.org/10.12688/wellcomeopenres.15842.3](https://doi.org/10.12688/wellcomeopenres.15842.3). 56. [56].Krymova, E. et al. Trend estimation and short-term forecasting of COVID-19 cases and deaths worldwide. Proceedings of the National Academy of Sciences of the USA 119, e2112656119 (2022). 57. [57].Burgard, J. P., Heyder, S., Hotz, T. & Krueger, T. Regional estimates of reproduction numbers with application to COVID-19. arXiv preprint arxiv:2108.13842 (2021). URL [https://arxiv.org/abs/2108.13842](https://arxiv.org/abs/2108.13842). 58. [58].Castro, L., Fairchild, G., Michaud, I. & Osthus, D. COFFEE: COVID-19 forecasts using fast evaluations and estimation. arXiv preprint arxiv:2110.01546 (2021). URL [https://arxiv.org/abs/2110.01546](https://arxiv.org/abs/2110.01546). 59. [59].Bosse, N. I. et al. Comparing human and model-based forecasts of COVID-19 in Germany and Poland. medRxiv (2021). URL [https://www.medrxiv.org/content/early/2021/12/05/2021.12.01.21266598](https://www.medrxiv.org/content/early/2021/12/05/2021.12.01.21266598). 60. [60].Bhatia, S. et al. Global predictions of short-to medium-term COVID-19 transmission trends : a retrospective assessment. medRxiv (2021). URL [https://www.medrxiv.org/content/early/2021/07/22/2021.07.19.21260746](https://www.medrxiv.org/content/early/2021/07/22/2021.07.19.21260746). 61. [61].Bracher, J., Wolffram, D., & the German and Polish COVID-19 Forecast Hub Team. Codes underlying the analyses in Bracher, Wolffram et al: National and subnational short-term forecasting of COVID-19 in Germany and Poland during early 2021. Available online, [https://zenodo.org/record/5639514#.Yv5fUmFBxH5](https://zenodo.org/record/5639514#.Yv5fUmFBxH5), [https://doi.org/10.5281/zenodo.5639514](https://doi.org/10.5281/zenodo.5639514) last accessed on 18 August 2022. (2022). 62. [62].Zeileis, A. et al. colorspace: A Toolbox for Manipulating and Assessing Colors and Palettes. Journal of Statistical Software 96(1), 1–49 (2020). URL [https://doi.org/10.18637/jss.v096.i01](https://doi.org/10.18637/jss.v096.i01). 63. [63].Lemon, J. Plotrix: a package in the red light district of R. R News 6(4), 8–12 (2006). URL [https://cran.r-project.org/doc/Rnews/Rnews](https://cran.r-project.org/doc/Rnews/Rnews) 2006-4.pdf. 64. [64].Dahl, D.B. et al. xtable: Export Tables to LaTeX or HTML. R package version 1.8-4, URL [https://cran.r-project.org/web/packages/xtable/](https://cran.r-project.org/web/packages/xtable/). (2019). 65. [65].Zeileis, A. & Grothendieck, G. zoo: S3 Infrastructure for Regular and Irregular Time Series. Journal of Statistical Software 14(6), 1–27 (2005). URL [https://doi.org/10.18637/jss.v014.i06](https://doi.org/10.18637/jss.v014.i06). [1]: /embed/graphic-2.gif [2]: /embed/graphic-3.gif