Harnessing the Wisdom of the Crowd to Forecast Incident and Cumulative COVID-19 Mortality in the United States ============================================================================================================== * Kathryn S. Taylor * James W Taylor ## Abstract **Background** Forecasting models have played a pivotal role in decision making during the COVID-19 pandemic, predicting the numbers of cases, hospitalisations and deaths. However, questions have been raised about the role and reliability of models. The aim of this study was to investigate the potential benefits of combining probabilistic forecasts from multiple models for forecasts of incident and cumulative COVID mortalities. **Methods** We considered 95% interval and point forecasts of weekly incident and cumulative COVID-19 mortality between 16 May 2020 and 8 May 2021 in multiple locations in the United States. We compared the accuracy of simple and more complex combining methods, as well as individual models. **Results** The average of the forecasts from the individual models was consistently more accurate than the average performance of these models, which provides a fundamental motivation for combining. Weighted combining performed well for both incident and cumulative mortalities, and for both interval and point forecasting. Inverse score with tuning was the most accurate method overall. The median combination was a leading method in the last quarter for both mortalities, and it was consistently more accurate than the mean combination for point forecasting of both mortalities. For interval forecasts of cumulative mortality, the mean performed better than the median. The leading individual models were most competitive for point forecasts of incident mortality. **Conclusions** We recommend that harnessing the wisdom of the crowd can improve the contribution of probabilistic forecasting of epidemics to health policy decision making, and report that the relative performance of the different combining methods depends on several factors including the type of data, type of forecast and timing. ## Introduction The coronavirus disease-2019 (COVID-19) pandemic has created major planning and resource allocation challenges, as well as pressures on health services, which have prompted governments to impose extreme restrictions in attempts to control the spread of the virus (1-3). These measures have resulted in multiple problems beyond COVID-19, including increased hospital treatment delays, damage to economies, higher levels of unemployment, declining mental health, and widening of the pre-existing health and educational disparities, which will persist beyond the rollout of vaccines (4, 5). Subsequently, the pandemic has generated intense debate among experts about the best way forward (6). Governments and their advisors have relied upon forecasts from models of the numbers of COVID-19 cases, people hospitalized and deaths to help decide what actions to take (7). However, using modelling to lead health policy during the pandemic has been controversial and criticised on various grounds, including the overreliance on models and questionable model assumptions. Nevertheless, it is recognised that models are potentially valuable tools when used appropriately (2, 8-10). The most useful models are parsimonious, that is, including only sufficient detail to answer a particular question (11), and parameterised with evidence-based data, rather than being based on assumptions. Using models that provide frequent forecasts will incorporate the latest evidence into the model estimates, as well as realign with the latest mitigating measures by governments and the responses of their populations. Furthermore, forecasts will be more nuanced if modelling is carried out at the local, rather than national level (2). Models may be constructed for prediction or scenario analysis. For example, the model from Imperial College London that drove the United Kingdom and United States (U.S.) governments to impose the first lockdowns assumed scenarios where between 50% and 75% of people would comply with the government restrictions (12). Extreme assumptions may be made to provide policy insight into a broad set of possible scenarios (13). When employed for prediction, models forecast the most likely outcome in the current circumstances. Different models are based on different approaches to answering the same question, and conflicting forecasts may arise. Rather than questioning which model is best (13), a forecast combination can be used to harness the wisdom of the crowd (14). Combining produces a collective forecast from multiple models that is typically more accurate than forecasts from individual models. The mean combination (simple average of all the forecasts) is an example of a combined forecast, which is often used and hard to beat (15, 16). Forecast combining has many advantages (17). It synthesises information underlying different prediction methods in a pragmatic way, diversifying the risk inherent in relying on an individual model, and it can offset the statistical bias associated with individual models, potentially cancelling out overestimation and underestimation from individual models. In many applications, from economics and business (18, 19), to weather and climate prediction (20, 21), the advantages of forecast combining are well-established. This has encouraged the more recent application of combining to the prediction of infectious diseases (22-26). When considering forecasting, attention is often placed on point forecasts, but they have inherent uncertainty, and subsequently, there has been increasing calls for probabilistic forecasting (9, 22). An interval forecast is a common form of probabilistic forecast, which conveys the uncertainty in an intuitively appealing way (27). For example, a 95% interval forecast is a range that will, ideally, contain the true value with 95% probability. Previously, we compared the accuracy of probabilistic forecast combining methods applied to predictions of weekly cumulative COVID-19 mortality in U.S. locations over the 40-week period up to 23 January 2021 (26). In this paper, we extend our earlier study by considering both weekly cumulative and incident COVID-19 mortalities, where incident deaths for a week is defined as the number of deaths occurring in that week. The numbers of cumulative deaths and incident deaths are clearly related, but although observed values of one can be derived from the other, the same is not true for probabilistic forecasts beyond one step-ahead. Furthermore, different sets of models may be used to predict incident and cumulative mortality. We consider forecast combining for point forecasts and 95% interval forecasts for the 52-week period up to 15 May 2021, which constitutes a longer and more recent period of data than in our previous work. Additional differences from our earlier study are that we report the results for individual models in our comparison, and assess the impact of reporting delays of death counts on forecast accuracy. ## COVID-19 mortality dataset ### Data sources Forecasts of weekly incident and cumulative COVID-19 mortalities were downloaded from the COVID-19 Forecast Hub (23). The Hub is an ongoing collaboration between the U.S. Centers for Disease Control and Prevention (CDC), with forecasting teams from academia, industry and government-affiliated groups (28). Teams are invited to submit forecasts of incident and cumulative mortalities for 1, 2, 3 and 4-week horizons, in the form of estimates of quantiles corresponding to 23 probability points along the probability distribution of possible values, and also point forecasts. Forecasts of the 2.5% and 97.5% quantiles bound a 95% interval forecast. In this paper, in addition to 95% interval forecasts, we consider point forecasts, which, for each model, we obtain as the 50% quantile (i.e., the median) at the centre of the probability distribution. The numbers of actual incident COVID-19 deaths each week were inferred from the reference data from the Hub on the numbers of actual weekly cumulative COVID-19 deaths. These data are provided by the Centre for Systems Science and Engineering (CSSE) at John Hopkins University (28). ### Analysis dataset The dataset for our analysis included forecasts projected from forecast origins each week between the weeks ending Saturday 16 May 2020 and Saturday 8 May 2021 (52 weeks of data). The COVID-19 Forecast Hub numbers these weeks as Epidemic Weeks 21 to 72, where Week 0 was the week ending 21 December 2019. These forecasts were compared with actual weekly COVID-19 mortality up to the week ending 15 May 2021 (Week 73). We studied forecasts of COVID-19 mortality for the whole of the U.S. and 51 U.S. jurisdictions, which included the 50 states and the District of Columbia. For simplicity, in the rest of the paper, we refer to the jurisdictions as states. For each state, Figs. S1 and S2 in the supplementary information show the numbers of weekly incident and cumulative COVID-19 deaths respectively, from Weeks 22 to 73. These figures show different histories of the pandemic across the states. The Hub carries out various screening tests for inclusion in their forecast combination, which they refer to as their ‘ensemble’ model. We included forecasts that passed the Hub’s screening tests, as well as forecasts that were not submitted in time to be screened. We followed the Hub by excluding, for any given week, a model for which forecasts for all 23 quantiles and for all four forecast horizons were not provided. The Hub also excluded forecasts deemed to be outlying. We did not exclude outliers, primarily because the actual number of COVID-19 deaths in previous weeks may have been updated, and therefore the assessment of outliers in the past, by the Hub, would not be consistent with our retrospective assessments of outliers at Week 73. Delays in reporting COVID-19 deaths is a well-recognised problem. Updates, typically increases, which may involve sharp increases in death counts, will result in forecasting models underestimating, and when updates are backdated, this will lead to forecasting methods being penalised in retrospective analyses of forecast accuracy. Updates that decrease death counts will produce overestimates. We downloaded 13 data files of actual death counts at different points during our 53-week study period (one file each at Weeks 26, 30, 32, 34, 37, 41, 43, 45, 47, 54, 59, 66 and 73). We observed updates to the historical death counts in 14 locations between Weeks 26 and 73 (Alaska, California, Delaware, Indiana, Ohio, Oklahoma, Missouri, New Jersey, New York, Rhode Island, Texas, Washington, Wisconsin and the U.S.). Fig. 1 presents superimposed plots of the numbers of cumulative COVID-19 deaths, using the 13 data files, for six states in which the effects of updates were particularly notable. We evaluate the effect of reporting delays on forecast accuracy when we compare the forecasting methods. ![Fig. 1](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/07/16/2021.07.11.21260318/F1.medium.gif) [Fig. 1](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/F1) Fig. 1 Numbers of reported deaths in states where there were notable effects of reporting delays: (a) Delaware, (b) Indiana, (c) Ohio, (d) Rhode Island, (e) Oklahoma and (f) West Virginia. Updates with backdating (a-d) are shown as deviations (dashed lines) from the most recent plot of cumulative COVID-19 mortality reported at Week 73 (black lines), and updates without backdating are shown as a sharp upward step (e) and a decrease (f) between Weeks 66 and 73. Our analysis included forecasts from 53 individual forecasting models and the Hub’s ensemble model. Details about these models are given in the appendix in the supplementary information. In the early weeks of our dataset, the traditional SEIR (susceptible-exposed-infected-removed) compartmental models were in the majority, but as the weeks passed, other types of models became more common (Fig. 2). These other models involved various statistical techniques and methods including neural networks, agent-based models, time series modelling, ridge regression and curve fitting techniques. ![Fig. 2](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/07/16/2021.07.11.21260318/F2.medium.gif) [Fig. 2](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/F2) Fig. 2 Number and types of models at each forecast origin for each mortality dataset Fig. 3 shows the timeline of forecasts from each model, illustrating the extent of missing data across the 52 locations, including the frequent ‘entry and exit’ of forecasting teams. This figure also shows that there were differences between the sets of models providing forecasts of incident and cumulative mortality. Screening by the Hub removed 14.4% of forecasts of incident deaths and 22.3% of forecasts of cumulative deaths. The large amount of missing data was such that we felt imputation would not be appropriate. ![Fig. 3](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/07/16/2021.07.11.21260318/F3.medium.gif) [Fig. 3](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/F3) Fig. 3 Extent of missing data for forecasts of incident mortality (left) and cumulative mortality (right) from Models 1 to 53 for forecast origins between Epidemic Weeks 21 and 72 ### Comparison of forecast accuracy #### Forecast evaluation methods The accuracy of the 95% interval forecasts was evaluated in terms of calibration and the interval score. Calibration was assessed by the percentage of actual deaths that fell within the bounds of the interval forecasts, with the ideal being 95%. The interval score was calculated by the following expression (29, 30): ![Formula][1] where *l**t* is the interval’s lower bound, *u**t* is its upper bound, *y**t* is the observation in period *t, I* is the indicator function (1 if the condition is true and 0 otherwise), and, for a 95% interval, *α* =5%. The bounds *l**t* and *u**t* are the 2.5% and 97.5% quantiles, respectively. Lower values of the interval score reflect greater interval forecast accuracy, and for this study, the unit of the quantile score is the number of deaths. The score rewards narrow intervals, with observations that fall outside the interval incurring a penalty, the magnitude of which depends on the value of *α* (29). The accuracy of the point forecasts was evaluated using the absolute value of the forecast errors. Averaging each of these two scores across an out-of-sample period provides two measures of forecast accuracy – the mean absolute error (MAE) and the mean interval score (MIS). In most of our reporting of the results, we average across horizons. We do this for conciseness and because we are using a relatively short out-of-sample period, which is a particular problem when evaluating forecasts of extreme quantiles. To show the consistency across horizons, we present results by horizon for interval forecasts. For these MIS results, we carried out statistical tests for the difference in forecast accuracy of different methods and models. We adapted the Diebold-Mariano test (31) in order to test across multiple series, and this was applied to the results of each prediction horizon separately. To summarise results averaged across the four horizons, we were unable to use the Diebold-Mariano test, so we applied the statistical test proposed by Koning et al. (32), which, for each method, compares the rank, averaged across multiple series, with the corresponding average rank of the most accurate method. Statistical testing was based on a 5% significance level. #### Interval forecast combining methods The comparison included several combining methods that do not rely on the availability of records of past accuracy. These methods are useful in the early stages of a pandemic, and in later stages when a new forecasting team starts to submit forecasts, or there is an uneven record of past accuracy among the different models, which is the case in our study. Fig. 4 presents a visual summary of these combining methods. Combining is applied to each interval bound separately. The *mean* combination (a, in Fig.4) and the *median* combinations (b, in Fig.4) are two well established benchmark methods (33-35). The median has the appeal of being robust to outliers. ![Fig. 4](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/07/16/2021.07.11.21260318/F4.medium.gif) [Fig. 4](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/F4) Fig. 4 Illustration of non-score-based combining methods of interval forecasts from five individual models, showing (a) mean combination, (b) median combination, (c) symmetric trimmed mean, (d) asymmetric exterior trimmed mean, (e) asymmetric interior trimmed mean and (f) envelope. Each interval forecast is represented by a different coloured pair of dots. More novel methods of combining involve trimming (excluding) a particular percentage *β* of forecasts of each interval bound (shown by shading in Fig. 4), and then averaging the remaining forecasts of that bound. *Symmetric trimming* (c, in Fig.4) deals with outliers. For each bound, it involves trimming the *N* lowest-valued and *N* highest-valued forecasts, where *N* is the largest integer less than or equal to the product of *β*/2 and the total number of forecasts (36). The median combination is an extreme form of symmetric trimming. We also implemented asymmetric exterior trimming, asymmetric interior trimming, and the envelope methods (33). *Asymmetric exterior trimming* (d, in Fig.4) is suitable for addressing underconfidence, which is reflected by overly wide intervals. It involves removing the *N* lowest-valued lower bound forecasts, as well as the *N* highest-valued upper bound forecasts, where *N* is the largest integer less than or equal to the product of *β* and the number of forecasts. When trimming resulted in a lower bound being above the upper bound, we replaced the two bounds by their average. *Asymmetric interior trimming* (e, in Fig.4) deals with overconfidence, which is reflected by overly narrow intervals. For this, we removed the *N* highest-valued lower bound forecasts and the *N* lowest-valued upper bound forecasts, where *N* is defined as for exterior trimming. The *envelope* method (f, in Fig.4) is an extreme form of interior trimming, whereby the interval is constructed using the lowest-valued lower bound forecast and highest-valued upper bound forecast. For our comparison, we divided the 52 weeks, from which forecasts were made, into an expanding in-sample period, starting with length 13 weeks, and a 39-week out-of-sample period. For each week, location and method, we optimised the trimming percentage *β* by finding the value that minimised the sum of the MIS averaged over all four horizons and all periods up to and including the latest forecast origin. Optimisation occurred each week, using the expanding in-sample period, to produce combined forecasts for the 39 weeks of the out-of-sample period. We included the COVID-19 Hub’s *ensemble* forecast. This was initially the mean combination of the forecasts that they considered to be eligible, but in late July, 12 weeks into our 52-week analysis period, it became the median combination of these forecasts (23). The use of eligibility screening implies that the ensemble is constructed with the benefit of a degree of subjective trimming. The ensemble provided forecasts of cumulative mortality for all of our 52-week period, while for incident mortality, forecasts were available from all except the first three weeks of the 52-week period. In our comparison of methods, we also included three methods that considered the accuracy of the forecasts from the individual models. The first of these three methods simply selected the model with the best historical accuracy (37). We refer to this method as *previous best*. For this method, the interval forecast was obtained from the model for which the MIS was the lowest when computed using the weeks up to and including the current forecast origin (i.e., the in-sample period). We also investigated two weighted average combinations, which have similarities with inverse-variance weighting, which is a common approach used in meta-analysis (38). For one weighted method, the *inverse interval score* method, the weights were inversely proportional to the MIS (26). For the other weighted method, *inverse interval score with tuning*, a tuning parameter, *λ* > 0, is incorporated to control the influence of the score on the combining weights, using the following expression for the weight on forecasting model *i* at forecast origin *t* (26): ![Formula][2] where *MIS**i,t* is the historical MIS computed at forecast origin *t* from model *i*, and *J* is the number of forecasting models included in the combination. If *λ* is close to zero, the combination reduces to the mean combination, whereas a large value for *λ* leads to the selection of the model with best historical accuracy. The parameter *λ* was optimised using the same expanding in-sample periods, as for the trimming combining methods. Due to the extent of missing forecasts (Fig.3), we pragmatically computed *MIS**i,t* using all available past forecasts, rather than limit the computation to periods for which forecasts from all models were available. For the models for which forecasts were not available for at least 5 past periods, we set *MIS**i,t* to be equal to the mean of *MIS**i,t* for all other models. An alternative approach, employed in (26), is to omit from the combination any model for which there is only a very short or non-existent history of accuracy available. The disadvantage of this is that it omits potentially useful forecast information, and this was supported by our empirical forecasting results. #### Point forecast combining methods For the point forecasts, we considered analogous combining methods to those for the interval forecasts, with the exception of the asymmetric interior trimming, asymmetric exterior trimming and envelope combining methods, which are only of use for interval forecast combining. #### Inclusion of individual models in the comparison Alongside the results for the combining methods, we also report the results for any individual model for which forecasts were available for all 52 locations and all 39 out-of-sample periods. There were six models fulfilling this criterion for incident mortality (numbered 1, 14, 20, 21, 22 and 33), and two models for cumulative mortality (numbered 21 and 33). The numbering of models corresponds to that shown in Fig.3. ## Empirical results ### Performance of methods overall We averaged the MIS and MAE across all four horizons and all 52 locations: the 51 states and the whole U.S. As the unit of these two scores was deaths, to avoid scores for some locations dominating, we also present results averaged for the following four categories: high, medium and low mortality states, and the U.S. as a whole. The high, medium and low categories were decided by ordering the 51 states by the number of cumulative deaths at the end of the final week of our dataset, and then dividing the states into three groups of 17 states. Tables 1 and 2 present the MIS and mean ranks for 95% interval forecasts, for each method for the 32-week out-of-sample period for incident and cumulative mortality, respectively. Tables 3 and 4 present the corresponding MAE results for point forecasts. View this table: [Table 1](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T1) Table 1 For 95% interval forecasts of incident mortality, MIS and mean ranks for the 39-week out-of-sample period. Boxed numbers indicate the best method in each column. View this table: [Table 2](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T2) Table 2 For 95% interval forecasts of cumulative mortality, MIS and mean ranks for the 39-week out-of-sample period. Boxed numbers indicate the best method in each column. View this table: [Table 3](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T3) Table 3 For point forecasts of incident mortality, MAE and mean ranks for the 39-week out-of-sample period. Boxed numbers indicate the best method in each column. View this table: [Table 4](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T4) Table 4 For point forecasts of cumulative mortality, MAE and mean ranks for the 39-week out-of-sample period. Boxed numbers indicate the best method in each column. Considering methods that performed well in terms of either being the leading method or competitive against the leading method, inverse score combining performed best overall, with the method benefitting from the incorporation of tuning (Tables 1 to 4). The poorest results were produced by the envelope method (Tables 1 to 4). The asymmetric interior trimmed mean performed well for interval forecasting for low and medium mortality states, producing the most accurate interval forecasts for the low mortality category for both the incident and cumulative mortality data (Tables 1 and 2). The leading individual model (Model 33) performed well in forecasting the numbers of deaths for all series, the U.S. as a whole, and for high mortality states (Tables 1 to 4). Its performance was stronger in point forecasting than interval forecasting, and it was more accurate in forecasting incident deaths than cumulative deaths (Tables 1 to 4). The final row in each of Tables 1 to 4 does not provide the forecasting results for a method. Instead, it reports the average score of all the individual methods. In each table, we see that this is substantially worse than the performance of the mean combining method (i.e., the average performance is much worse than the performance of the average). This gives fundamental support for forecast combining, because the average performance can be viewed as the statistical expectation for the performance of an individual model, chosen when there is no information regarding their accuracy, which was the case at the start of the pandemic. Comparing the two simple combining methods, for interval forecasts of incident deaths, the median was more accurate than the mean for all series and for high mortality states, whilst the performance was similar for medium and low mortality states (Table 1). For interval forecasts of cumulative mortality, the mean was more accurate than the median across all categories (Table 2). The median was consistently more accurate than the mean for point forecasts of both cumulative and incident mortality (Tables 3 and 4). Tables S1 and S2 in the supplementary information provide a broad summary of the results of Tables 1 to 4 in terms of the frequency with which a method is ranked in the top three, based on the scores and mean ranks, respectively, for the different types of mortality and forecast. These tables reflect the dominance of the inverse score methods, for both mortalities and both forecast types, although the superiority was less for point forecasts. The inverse score with tuning is best overall according to both the scores and mean ranks. Looking at the results for the statistical testing of the mean ranks, we see that the methods that performed poorly were identified as being statistically significantly worse than the best methods. Tables 1 to 4 report results averaged across the four forecast horizons (1 to 4 weeks ahead). We found similar relative performances of the methods when looking at each forecast horizon separately (see Tables S3 and S4 in the supplementary information). ### Changes in performance over time Tables 5 and 6 show the MIS and MAE for the four quarters of our 1-year sample of data. Note that, for the first 13-week period, results were not obtainable for the methods for which a previous in-sample period was needed to estimate method parameters. The relative performances of the methods in Tables 5 and 6 are reasonably consistent with the results in Tables 1 to 4, where we averaged across the 39-week out-of-sample period. In Tables 5 and 6, we note the much higher values of the scores in the third quarter, caused by the much higher levels of mortality during that period. It is interesting to see from Tables 5 and 6 that, for forecasts of both incident and cumulative mortalities, the leading methods in the most recent 13-week period were the median combination and the ensemble, which, after the first quarter, was computed using the median. In the second and third quarters, the mean was more accurate than the median for both incident and cumulative mortalities. View this table: [Table 5](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T5) Table 5 For incident mortality, scores for each quarter of our 52-week dataset. The unit of the scores is deaths. For each quarter, boxed numbers indicate the best method for each column. View this table: [Table 6](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T6) Table 6 For cumulative mortality, scores for each quarter of our 52-week dataset. The unit of the scores is deaths. For each quarter, boxed numbers indicate the best method in each column. ### Effect of type of model on performance Tables 7 and 8 compare the accuracy of the interval forecasts from compartmental models and non-compartmental models of incident and cumulative mortalities, respectively. For most combining methods, for the category including all series, the combined forecasts of incident mortality from forecasts of compartmental models were more accurate than the combined forecasts from forecasts of all models (Table 7), while for most combining methods for cumulative mortality, the reverse was true (Table 8). View this table: [Table 7](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T7) Table 7 For 95% interval forecasts of incident mortality, MIS for all models, compartmental models and non-compartmental models. The unit of the scores is deaths. Boxed numbers indicate the best method in each series. View this table: [Table 8](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T8) Table 8 For 95% interval forecasts of cumulative mortality, MIS for all models, compartmental models and non-compartmental models. The unit of the scores is deaths. Boxed numbers indicate the best method in each series. ### Performance of methods at the state level Fig. 5 presents the MIS for the overall leading method (combining with inverse score with tuning), the popular simple benchmark (the mean combination) and the leading model (Model 33) for forecasts at the individual state level, of incident and cumulative mortalities. In each plot, the states are ordered on the x-axis by the number of cumulative deaths at the end of the final week of our dataset (from low to high). Lower scores are better. ![Fig. 5](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/07/16/2021.07.11.21260318/F5.medium.gif) [Fig. 5](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/F5) Fig. 5 MIS for 95% interval forecasts of incident and cumulative mortality at (a) high, (b) medium and (c) low mortality U.S. locations for the inverse score with tuning combining (black line), mean combining (dashed line) and Model 33 (dotted line). Fig.6 shows the corresponding plots for the MAE. Model 33 is shown, across the states, as being more competitive with the inverse score with tuning in point forecasting (Fig.6) compared to interval forecasting (Fig. 5). The adverse effects of reporting delays of death counts (highlighted in Fig. 1) on forecast performance is apparent for the states of Ohio, Oklahoma, Delaware and West Virginia, with both combining methods and the individual model underperforming for these states. Figs. 5 and 6 also show the mean combination having similar performance to the leading method for most states in the medium and low mortality categories for both interval and point forecasts of both incident and cumulative mortalities. This is not reflected in Tables 1 to 4 due to adverse effects of reporting delays on performance being greatest for the mean, particularly in the states of Ohio (Figs. 5, 7 and 8) and Delaware (Fig.6). The impact of reporting delays on interval forecasts differed across mortalities. In particular, the inverse score with tuning method was less affected than the mean and Model 33 for forecasts of incident mortality in Ohio (a, in Fig.5, Incident mortality) and forecasts of cumulative mortality in Delaware (c, in Fig.6). Also, the performance of the mean was adversely affected for forecasts of cumulative mortality in Delaware (c, in Fig. 6) but not forecasts of incident mortality in this state (c, in Fig. 5, Incident mortality). The adverse effects of reporting delays on the performance of the other simple benchmark, the median, were lower (not shown). ![Fig. 6](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/07/16/2021.07.11.21260318/F6.medium.gif) [Fig. 6](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/F6) Fig. 6 MAE for point forecasts of incident and cumulative mortality at (a) high, (b) medium and (c) low mortality U.S. locations, for the inverse score with tuning combining (black line), mean combining (dashed line) and Model 33 (dotted line). As a sensitivity analysis, we evaluated the impact on our results when excluding all the six states, shown in Fig.1, that had notable effects of reporting delays on death counts. The revised results corresponding to those presented in Tables 1 and 2 are shown in Tables S5 and S6, respectively, in the supplementary information. There were improvements in the performance for all methods, and only slight changes in the ranking of methods. The relative performance of the simple combining methods (mean and median) improved, in comparison with inverse score with tuning. With regard to our significance testing results, there were no changes in our conclusions for the methods that were significantly worse than the method with the best mean rank. ### Calibration of interval forecasting methods The calibration of the interval forecasting methods for incident and cumulative mortalities is summarised in Table 9. The table shows that there was a general tendency for underestimation in the widths of the 95% intervals. This is quite common in studies of interval forecasting (see, for example, (39)). When models underestimate the interval width, combining with the asymmetric interior trimmed mean is useful, as it leads to wider than average intervals. It is, therefore, not surprising that this combining method performs well in Table 9. The envelope method also performed well, which is understandable, because this method also leads to wider than average intervals (as shown in Fig 4). However, this method is not particularly appealing, as it seems likely to deliver intervals that are too wide, and this can be seen in Table 9. We should also note that very sizeable overestimation will lead to a result of 100% for the calibration in Table 9, which is quite close to the ideal of 95%, while sizeable underestimation can lead to calibration far from 95%. This highlights a limitation of calibration for evaluating interval forecasts, and supports our far more extensive use of the interval score in this paper. In Table 9, we also note the relatively good performance of the inverse score methods, which were the best performing combining methods overall, when judged in terms of the MIS. View this table: [Table 9](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T9) Table 9 Average calibration of the 95% interval forecasts for incident and cumulative mortality in the 39 week out-of-sample period. Boxed numbers indicate the best method in each column. Ideal is 95%. ## Discussion We have provided an empirical comparison of combining methods for point and interval forecasts of incident and cum0075lative mortality due to COVID-19. Our main findings are that weighted combining performed well for both mortalities, and for both interval and point forecasting. Inverse score with tuning was the most accurate method overall. In the most recent quarter, the median combination and ensemble were leading methods for both mortalities and both interval and point forecasts. The median performed better than the mean in point forecasting of both mortalities. The mean was better than the median in interval forecasts of cumulative mortality, while for interval forecasts of incident deaths, neither was notably better than the other. We were able to include only a few individual models in the comparison, and of these, only two were competitive against the leading combining methods. The leading individual model performed well, particularly for forecasts for the high mortality category, and this model was most accurate in point forecasting incident mortality. For most combining methods, overall, the combined incident forecasts from compartmental models were more accurate than the combined forecasts from all models, whilst the reverse was the case for forecasts of cumulative mortality. The adverse effects on performance of updates in death counts, due to reporting delays, was greatest for the mean combination, with different impacts on the predictions of incident and cumulative mortalities, but, overall, the effects on the comparison were minor. Drawing comparisons with other studies, this research follows our earlier analysis of point and 95% interval forecasts of cumulative COVID-19 mortality from Weeks 18 to 57 (26). Our current results are based on a later and extended period from Weeks 21 to 72. The overall superiority of the inverse score methods and the relative performance of the mean and the median combinations are consistent with the findings from our earlier analysis of forecasting cumulative mortality, although our current analysis shows that, later in the pandemic, the median becomes more dominant than the mean. In contrast with our earlier study, we have considered individual models and the impact of reporting delays on forecast accuracy. We have highlighted some similarities of results of forecasts of incident mortality and cumulative mortality, and several differences, in terms of the relative performance of the mean and median, the performance of individual models, the impacts of model diversity and reporting delays on forecast accuracy. Our conclusion that forecasts from the combining methods are able to outperform forecasts from individual models is consistent with findings in other studies within and outside the field of epidemic prediction (18-22, 40). The strengths of this study include our consideration of two sets of data, cumulative and incident mortalities, extending our earlier analysis, which considered only cumulative deaths. Another study strength relating to the scope of this study is our comparison of a broad range of forecasting methods involving individual models, simple standard benchmark methods and more complex combinations based on trimming or weighting according to historical accuracy. A further study strength relates to our choice of data. The dataset of forecasts, produced for multiple locations from multiple models, presented an opportunity to study the wisdom of the crowd. The Hub provided the necessary conditions for the crowd being ‘wise’ (14) and without distortion, such as by social pressure (41) or restrictions against forecasting teams applying their own judgement (23). These conditions include independent contributors, diversity of opinions, and a trustworthy central convener to collate the information provided (14). Study limitations include the retrospective design, being based on the most recent version of the ‘truth data’ for all the weeks at the time of analysis, instead of the numbers of COVID-19 deaths that were reported at the time the forecasts were submitted. We have shown that there were several states for which there were notable effects of updates in death counts, due to reporting delays, and this adversely affected the accuracy of the forecasts of all the combining methods and models. However, this issue only had a minor effect on the relative performances of the methods, and did not alter our overall conclusions. Our reported findings are limited to U.S. data and the forecasts from the COVID-19 Forecast Hub, and so it is possible that different results may arise when applying the combining methods to forecasts from a different set of models, or using other data, such as forecasts for other locations, or predictions of COVID-19 cases or hospitalisations. These are interesting potential avenues for future research. The forecasts in our dataset were produced weekly for 1 to 4 week ahead horizons, and we acknowledge that conclusions could be different for different time-scales. Our ability to detect statistical differences was limited by the small sample sizes, with only 17 locations in each category, missing data and a relatively short out-of-sample period. This research has important policy implications as forecasts from models have been placed at the forefront of public health decision making during the COVID-19 crisis. The reliance of governments on forecasts from COVID-19 models have brought these models under increased scrutiny. The limitations of the models and the need to appreciate their limits have been discussed extensively (2, 8, 9, 11, 42). It is suggested that relying on modelling alone leads to “missteps and blind spots”, and that the best approach to support public policy decision making would involve a triangulation of insights from modelling with other information, such as analyses of international case studies and previous outbreaks, policy documents, and discussions with frontline staff (42). It is essential that modelling offers the most accurate forecasts. The associated uncertainty should be accurately represented in forecasts from models. The 95% prediction intervals present a range of possible outcomes, which can be used to support situational awareness (43). Given the benefits of combining, the most accurate probabilistic forecasts will most often be based on multiple models, rather than an individual model, as illustrated by the results of our study. Although individual models can sometimes be more accurate than combined methods, relying on forecasts from combined methods provides a more risk-averse approach, as the best model will not be clear until records of historical accuracy are available, and also the best performing model will typically change over time. In particular, our finding that the performance of the average (mean combination) was substantially better than the average performance of the individual models suggests that, at the start of an epidemic, when it is not clear which model has the best performance, the statistical expectation is that the average method will score better than a model chosen randomly, or chosen on the basis of no prior history. This study follows on from our previous analysis to provide further confirmation that the weighted methods provided the greatest accuracy, but the performance of the mean and median combination were often reasonable. An obvious advantage of these benchmark methods is that they are simple, transparent and pragmatic approaches to combining forecasts that do not rely on the need to optimise a parameter or collect data on historical accuracy. This is advantageous early in a pandemic, and also later on, if records of historical accuracy are uneven for the individual forecasting models. To conclude, we recommend that for COVID-19 models to play the most effective role in supporting health policy decision making, probabilistic forecasts should be harnessed from multiple models. In our study of mortality data, the intuitive weighted forecasting methods were the most accurate overall, but that did not rule out considering the more pragmatic simple combining methods. We identified that the relative performance of the different combining methods depends on several factors including the type of data, type of forecast and timing. Future research could focus on seeking further clarification of the relative performance of the different methods, by studying combined forecasts for other types of data, such as COVID-19 infections and hospitalisations, combined forecasts in other locations and for diseases beyond the COVID-19 pandemic. Another possible avenue for further research would be to investigate the impact of incorporating combined forecasts into health policy decision making. ## Data Availability This study is based on publically available data from the COVID-19 Forecast Hub. [https://covid19forecasthub.org/](https://covid19forecasthub.org/) ## Funding No funding was received for conducting this study. ## Competing interests The authors have no relevant financial or non-financial interests to disclose. ## Ethical approval No ethical approval was required for this study. ## Data sharing This study is based on publically available data from the COVID-19 Forecast Hub. [https://covid19forecasthub.org/](https://covid19forecasthub.org/) ## SUPPLEMENTARY INFORMATION ![Fig S1](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/07/16/2021.07.11.21260318/F7.medium.gif) [Fig S1](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/F7) Fig S1 Numbers of weekly incident COVID-19 deaths in U.S. locations between weeks ending 23 May 2020 and 15 May 2021 ![Fig S2](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/07/16/2021.07.11.21260318/F8.medium.gif) [Fig S2](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/F8) Fig S2 Numbers of weekly cumulative COVID-19 deaths in U.S. locations between weeks ending 23 May 2020 and 15 May 2021 View this table: [Table S1](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T10) Table S1 Frequency of being one of the top three methods in the columns for the scores in Tables 1 to 4 View this table: [Table S2](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T11) Table S2 Frequency of being one of the top three methods in the columns for the mean ranks in Tables 1 to 4 View this table: [Table S3](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T12) Table S3 For incident mortality, scores in the 39 week out-of-sample period for each of the four lead times. The unit of the scores is deaths. Lower values are better. For each horizon, boxed numbers indicate the best method for each of the following four categories of series: all, U.S., high, medium and low. View this table: [Table S4](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T13) Table S4 For cumulative mortality, scores in the 39 week out-of-sample period for each of the four lead times. The unit of the scores is deaths. For each horizon, boxed numbers indicate the best method for each of the following four categories of series: all series, U.S., high, medium and low. View this table: [Table S5](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T14) Table S5 Sensitivity analysis for incident mortality, MIS and mean ranks for 95% interval forecasts for the 39 week out-of-sample period excluding the six U.S. states with notable effects of reporting delays on COVID-19 death counts. Boxed numbers indicate the best method in each column. View this table: [Table S6](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T15) Table S6 Sensitivity analysis for cumulative mortality, MIS and mean ranks for 95% interval forecasts for the 39 week out-of-sample period excluding the six U.S. states with notable effects of reporting delays on COVID-19 death counts. Boxed numbers indicate the best method in each column. ## Acknowledgements We are very grateful to the forecasting groups who generously made their forecasts available on the COVID-19 Forecast Hub, Nicholas Reich and his team, for acting as curators for the Hub, and Nia Roberts for clarifying our understanding of the license terms for the forecast data. ## Appendix Forecasting models View this table: [Table16](http://medrxiv.org/content/early/2021/07/16/2021.07.11.21260318/T16) * Received July 11, 2021. * Revision received July 11, 2021. * Accepted July 16, 2021. * © 2021, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/) ## References 1. 1.Phelan AL, Katz R, Gostin LO. The Novel Coronavirus Originating in Wuhan, China: Challenges for Global Health Governance. Jama. 2020;323(8):709–10. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F07%2F16%2F2021.07.11.21260318.atom) 2. 2.Jewell NP, Lewnard JA, Jewell BL. Predictive Mathematical Models of the COVID-19 Pandemic: Underlying Principles and Value of Projections. Jama. 2020;323(19):1893–4. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2020.6585&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32297897&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F07%2F16%2F2021.07.11.21260318.atom) 3. 3.Looi M-K. Covid-19: Is a second wave hitting Europe? BMJ. 2020;371:m4113. [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNzEvb2N0MjhfNi9tNDExMyI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzA3LzE2LzIwMjEuMDcuMTEuMjEyNjAzMTguYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 4. 4.Melnick ER, Ioannidis JPA. Should governments continue lockdown to slow the spread of covid-19? BMJ. 2020;369:m1924. [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNjkvanVuMDNfNS9tMTkyNCI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzA3LzE2LzIwMjEuMDcuMTEuMjEyNjAzMTguYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 5. 5.Policy brief: Education during COVID-19 and beyond [press release]. 2020. 6. 6.Wise J. Covid-19: Experts divide into two camps of action—shielding versus blanket policies. BMJ. 2020;370:m3702. [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE4OiIzNzAvc2VwMjFfMTQvbTM3MDIiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wNy8xNi8yMDIxLjA3LjExLjIxMjYwMzE4LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 7. 7.Adam D. Special report: The simulations driving the world’s response to COVID-19. Nature. 2020;580(7803):316–8. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/d41586-020-01003-6&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32242115&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F07%2F16%2F2021.07.11.21260318.atom) 8. 8.Holmdahl I, Buckee C. Wrong but Useful — What Covid-19 Epidemiologic Models Can and Cannot Tell Us. New England Journal of Medicine. 2020;383(4):303–5. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMp2016822&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32412711&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F07%2F16%2F2021.07.11.21260318.atom) 9. 9.Ioannidis JPA, Cripps S, Tanner MA. Forecasting for COVID-19 has failed. International Journal of Forecasting. 2020. 10. 10.Buckee CO, Johansson MA. Individual model forecasts can be misleading, but together they are useful. Eur J Epidemiol. 2020;35(8):731–2. 11. 11.George EPB. Science and Statistics. Journal of the American Statistical Association. 1976;71(356):791–9. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2307/2286841&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1976CP07800001&link_type=ISI) 12. 12.Ferguson N, Laydon D, Nedjati-Gilani G, al e. Report 9: Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand 28 Dec 2020. Available from: [http://hdl.handle.net/10044/1/77482](http://hdl.handle.net/10044/1/77482). 13. 13.Panovska-Griffiths J. Coronavirus: we’ve had ‘Imperial’, ‘Oxford’ and many more models – but none can have all the answers 2020 28 Dec 2020. Available from: [https://theconversation.com/coronavirus-weve-had-imperial-oxford-and-many-more-models-but-none-can-have-all-the-answers-135137](https://theconversation.com/coronavirus-weve-had-imperial-oxford-and-many-more-models-but-none-can-have-all-the-answers-135137). 14. 14.Surowiecki J. The Wisdom of Crowds: Why the Many are Smarter Than the Few and how Collective Wisdom Shapes Politics,Business, Economies, Societies, and Nations: Doubleday & Co; 2004. 15. 15.Claeskens G, Magnus JR, Vasnev AL, Wang W. The forecast combination puzzle: A simple theoretical explanation. International Journal of Forecasting. 2016;32(3):754–62. 16. 16.Smith J, Wallis KF. A Simple Explanation of the Forecast Combination Puzzle*. Oxford Bulletin of Economics and Statistics. 2009;71(3):331–55. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1468-0084.2008.00541.x&link_type=DOI) 17. 17.Bates JM, Granger CWJ. The Combination of Forecasts. OR. 1969;20(4):451–68. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2307/3008764&link_type=DOI) 18. 18.Busetti F. Quantile Aggregation of Density Forecasts. Oxford Bulletin of Economics and Statistics. 2017;79(4):495–512. 19. 19.1. Elliott G, 2. Granger CWJ, 3. Timmermann A Timmermann A. Chapter 4 Forecast Combinations. In: Elliott G, Granger CWJ, Timmermann A, editors. Handbook of Economic Forecasting. 1: Elsevier; 2006. p. 135–96. 20. 20.Krishnamurti TN, Kishtawal CM, LaRow TE, Bachiochi DR, Zhang Z, Williford CE, et al. Improved Weather and Seasonal Climate Forecasts from Multimodel Superensemble. Science. 1999;285(5433):1548–50. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIyODUvNTQzMy8xNTQ4IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMDcvMTYvMjAyMS4wNy4xMS4yMTI2MDMxOC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 21. 21.Leutbecher M, Palmer TN. Ensemble forecasting. J Comput Phys. 2008;227(7):3515–39. 22. 22.Johansson MA, Apfeldorf KM, Dobson S, Devita J, Buczak AL, Baugher B, et al. An open challenge to advance probabilistic forecasting for dengue epidemics. Proceedings of the National Academy of Sciences. 2019;116(48):24268–74. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTE2LzQ4LzI0MjY4IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMDcvMTYvMjAyMS4wNy4xMS4yMTI2MDMxOC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 23. 23.Ray EL, Wattanachit N, Niemi J, Kanji AH, House K, Cramer EY, et al. Ensemble Forecasts of Coronavirus Disease 2019 (COVID-19) in the U.S. medRxiv. 2020:2020.08.19.20177493. 24. 24.Reich NG, McGowan CJ, Yamana TK, Tushar A, Ray EL, Osthus D, et al. Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the U.S. PLoS Comput Biol. 2019;15(11):e1007486–e. 25. 25.Yamana TK, Kandula S, Shaman J. Individual versus superensemble forecasts of seasonal influenza outbreaks in the United States. PLoS Comput Biol. 2017;13(11):e1005801. 26. 26.Taylor JW, Taylor KS. Combining Probabilistic Forecasts of COVID-19 Mortality in the United States European Journal of Operational Research. European Journal of Operational Research. 2021;[in press]. 27. 27.Bracher J, Ray E, Gneiting T, Reich NG. Evaluating epidemic forecasts in an interval format. arXiv: Applications. 2020. 28. 28.Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect Dis. 2020;20(5):533–4. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S1473-3099(20)30120-1&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F07%2F16%2F2021.07.11.21260318.atom) 29. 29.Gneiting T, Raftery AE. Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association. 2007;102(477):359–78. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1198/016214506000001437&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000244361000032&link_type=ISI) 30. 30.Winkler RL, Grushka-Cockayne Y, Jr. Kcl, Jose VRR. Probability Forecasts and Their Combination: A Research Perspective. Decision Analysis. 2019;16(4):239–60. 31. 31.Diebold FX, Mariano RS. Comparing Predictive Accuracy. Journal of Business & Economic Statistics. 2002;20(1):134–44. 32. 32.Koning AJ, Franses PH, Hibon M, Stekler HO. The M3 competition: Statistical tests of the results. International Journal of Forecasting. 2005;21(3):397–409. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ijforecast.2004.10.003&link_type=DOI) 33. 33.Gaba A, Tsetlin I, Winkler RL. Combining Interval Forecasts. Decision Analysis. 2017;14(1):1–20. 34. 34.Hora SC, Fransen BR, Hawkins N, Susel I. Median Aggregation of Distribution Functions. Decision Analysis. 2013;10(4):279–91. 35. 35.Jose VRR, Grushka-Cockayne Y, Lichtendahl KC. Trimmed Opinion Pools and the Crowd’s Calibration Problem. Management Science. 2014;60(2):463–75. 36. 36.Park S, Budescu D. Aggregating multiple probability intervals to improve calibration. Judgment and decision making. 2015;10:130–43. 37. 37.Capistrán C, Timmermann A. Forecast Combination With Entry and Exit of Experts. Journal of Business & Economic Statistics. 2009;27(4):428–40. 38. 38.Borenstein M, Hedges L, Higgins J, Rothstein H. Introduction to Meta-Analysis: John Wiley & Sons. 39. 39.Grushka-Cockayne Y, Jose VRR. Combining prediction intervals in the M4 competition. International Journal of Forecasting. 2020;36(1):178–85. 40. 40.McGowan CJ, Biggerstaff M, Johansson M, Apfeldorf KM, Ben-Nun M, Brooks L, et al. Collaborative efforts to forecast seasonal influenza in the United States, 2015–2016. Scientific Reports. 2019;9(1):683. 41. 41.Yeh RW. Academic Cardiology and Social Media: Navigating the Wisdom and Madness of the Crowd. Circ Cardiovasc Qual Outcomes. 2018;11(4):e004736. [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiY2lyY2N2b3EiO3M6NToicmVzaWQiO3M6MTI6IjExLzQvZTAwNDczNiI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzA3LzE2LzIwMjEuMDcuMTEuMjEyNjAzMTguYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 42. 42.Sridhar D, Majumder MS. Modelling the pandemic. BMJ. 2020;369:m1567. [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNjkvYXByMjFfOS9tMTU2NyI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzA3LzE2LzIwMjEuMDcuMTEuMjEyNjAzMTguYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 43. 43.Lipsitch M, Santillana M. Enhancing Situational Awareness to Prevent Infectious Disease Outbreaks from Becoming Catastrophic. Curr Top Microbiol Immunol. 2019;424:59–74. [1]: /embed/graphic-4.gif [2]: /embed/graphic-6.gif