Uncertainty quantification for epidemiological forecasts of COVID-19 through combinations of model predictions

by   V. E. Bowman, et al.

A common statistical problem is prediction, or forecasting, in the presence of an ensemble of multiple candidate models. For example, multiple candidate models may be available to predict case numbers in a disease epidemic, resulting from different modelling approaches (e.g. mechanistic or empirical) or differing assumptions about spatial or age mixing. Alternative models capture genuine uncertainty in scientific understanding of disease dynamics, and/or different simplifying assumptions underpinning each model derivation. While the analysis of multi-model ensembles can be computationally challenging, accounting for this 'structural uncertainty' can improve forecast accuracy and reduce the risk of over-estimated confidence. In this paper we look at combining epidemiological forecasts for COVID-19 daily deaths, hospital admissions, and hospital and ICU occupancy, in order to improve the predictive accuracy of the short term forecasts. We combining models via combinations of individual predictive densities with weights chosen via application of predictive scoring, as commonly applied in meteorological and economic forecasting.



There are no comments yet.


page 5

page 6


Evaluating epidemic forecasts in an interval format

For practical reasons, many forecasts of case, hospitalization and death...

A Hierarchical Bayesian Model for Stochastic Spatiotemporal SIR Modeling and Prediction of COVID-19 Cases and Hospitalizations

Most COVID-19 predictive modeling efforts use statistical or mathematica...

Statistical methods used to combine the effective reproduction number, R(t), and other related measures of COVID-19 in the UK

In the COVID-19 pandemic, a range of epidemiological models have been us...

Robust Real-Time Delay Predictions in a Network of High-Frequency Urban Buses

Providing transport users and operators with accurate forecasts on trave...

Financial density forecasts: A comprehensive comparison of risk-neutral and historical schemes

We investigate the forecasting ability of the most commonly used benchma...

On the derivation of the renewal equation from an age-dependent branching process: an epidemic modelling perspective

Renewal processes are a popular approach used in modelling infectious di...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Scoring forecasts

Forecast quality is often assessed via loss functions known as

scoring rules 4, 5, that take a predictive density and outcome as arguments. The scores are typically calculated using out-of-sample observations, e.g. future data. A proper scoring rule ensures that minimum loss is obtained by choosing the data generating process as the predictive density. Common rules include:

  1. log score: ;

  2. continuous ranked probability score (CRPS): , with

    and having finite first moment. enumerate For deterministic predictions, i.e.

    being a point mass density with support on , the CRPS reduces to the mean absolute error , and hence this score can be used to compare probabilistic and deterministic predictions.

    If only quantiles from

    are available, alternative scoring rules include enumerate

  3. quantile score: for quantile forecast at level ;

  4. interval score: for being a central % prediction interval from .

Quantile and interval scores can be averaged across available quantiles/intervals to provide a score for the predictive density, while CRPS is the integral of the quantile score with respect to .

2 Combinations of predictive distributions

Consider a model list , with the

th model having posterior predictive density

for data . The ensemble methods considered here mostly have posterior predictive densities of the form


where weights the contribution of the th model to the overall prediction, with .

Giving equal weighting to each model has proved surprisingly effective in economic forecasting13. Alternatively, given a score function , weights can be chosen as


Such approaches have been applied in economics17 and are the essence of model stacking as described by Yao et al. 22. To avoid overfitting, the would typically not include the data used to train the individual models in the ensemble, e.g. estimate would be constructed using future data or cross-validation. Alternatively, scoring functions can be used to construct normalised weights


with being the estimated score for the th model, and being a monotonic function; with the log-score and , Akaike Information Criterion (AIC) style weights are obtained1.

A number of scenarios can be envisaged for the application of such stacking methods to currently employed epidemiological models:

  1. the only model output available is a pre-determined set of predictive quantiles: (i) quantile or interval scoring can be used to construct weights, with either individual quantiles directly included in a weighted sum (suitably adjusted to achieve nominal coverage of each interval) or a (non)-parametric distribution approximated from each set of quantiles and then included in Eq.1; or (ii) an upper-bound on CRPS scoring can be obtained using the quantiles.

  2. predictive distributions, or samples thereof, are available and hence direct mixing or stacking of predictive distributions is possible. Ideally, leave-one-out predictions or sequential predictions of future data are available to mitigate over-fitting. Alternatively, the rather confusingly named Bayesian model combination 14, 15 could be employed, where the ensemble is expanded to include linear combinations of models, e.g. mean predictions.

  3. if the models themselves are available, stacking can be applied via estimation of predictive distributions (and alternative methods of selecting the weights are possible22). If the models are computationally expensive, preventing the straightforward generation of predictive densities, then a hierarchical Bayesian model can be applied with Gaussian process priors11 assumed for each expensive model. Stacking can then be applied to the resulting posterior predictive distributions, conditioning on model runs and data. This approach would be preferred, as it provides a holistic treatment of the predictive uncertainty.

3 Regression-based forecast combination

The stacking methods described above attempt to calibrate the forecast ensemble by fitting a mixture distribution, with components corresponding to the individual model forecast probability density functions (pdfs). Alternatively, the calibrated forecasts can be generated from a regression model, with covariates corresponding to forecasts from the model ensemble. Two such regression-based methods are currently under evaluation; Ensemble Model Output Statistics

7 (EMOS) and Quantile Regression Averaging 16 (QRA).

EMOS defines a predictive model in the form of a Gaussian distribution:


where are the forecasts from the individual models, and are non-negative coefficients,

is the ensemble variance, and

and are regression coefficients. Tuning of the coefficients is achieved by minimizing the CRPS against training data.

QRA defines a predictive model for a given quantile, , for the combined forecast as:


where is the -th reported quantile for model , and fitting of the coefficients is achieved by minimizing the quantile score of the predictions against training data.

Further details of the implementation of EMOS, QRA and stacking for short term forecasts are provided in Appendix A.

4 Comparing the performance of combined and individual forecasts

The performance of the different combined and individual forecasts has been monitored using the interval score () averaged over the forecast window and the (i.e. the quantile score for the median), and intervals, in addition to the well-established assessment metrics; sharpness, bias and calibration 6. Sharpness () is a measure of prediction uncertainty, and is defined here as the average width of the central predictive interval over the forecast window. Bias () measures how likely a predictive model is to over or under predict, as the proportion of predictions for which the reported median is greater than the data value. Calibration () quantifies the statistical consistency between the predictions and data, via the proportion of predictions for which the data lies inside the

central predictive interval.The bias and calibration scores were linearly transformed as

and , such that a well-calibrated prediction with no bias or uncertainty corresponds to .

4.1 Results

Value types Description
death_inc_line New daily deaths by date of death
hospital_inc New and newly confirmed patients in hospital
hospital_prev Hospital bed occupancy
icu_prev ICU occupancy
Table 1: Value types (model outputs of interest) for which forecasts were scored.
Nations Regions
England London
Scotland East of England
Wales Midlands
Northern Ireland North East and Yorkshire
North West
South East
South West
Table 2: Nations/regions for which forecasts were scored.

For multiple sets of past forecasts, the three metrics were visualized on plots with axes corresponding to against , and with plot point sizes defined by . Calibration and bias were considered the most important criterion, while sharpness was used to break ties when the calibration and bias scores were equal. Where both data and model predictions were available, the metrics were evaluated for the four value types (model outputs of interest) and eleven regions/nations shown in Tables 1 and 2. Two scenarios were considered; firstly where it was required that a full window of training predictions was available for each model to be included in the combination, and secondly, where models with incomplete training prediction windows were also included. The second scenario has been encountered multiple times in practice.

QRA 0.80 49
Stacked: time invariant weights 0.81 50
EMOS 0.9 54
Stacked: time varying weights 0.93 54
Stacked: equal weights 0.91 58
(a) death_inc_line
EMOS 0.67 74
QRA 0.71 85
Stacked: time invariant weights 0.84 109
Stacked: time varying weights 0.87 115
Stacked: equal weights 0.94 119
(b) hospital_inc
Stacked: time invariant weights 1.02 488
Stacked: time varying weights 1.06 514
Stacked: equal weights 1.04 541
QRA 1.09 630
EMOS 1.17 1183
(c) hospital_prev
QRA 0.98 106
EMOS 1.08 126
Stacked: time invariant weights 1.29 190
Stacked: time varying weights 1.34 190
Stacked: equal weights 1.36 192
(d) icu_prev
Table 3: Average distances from origin for the calibration-bias plots, and average interval scores for each value type, averaged over regions/nations for each combination algorithm. The rows are ordered by increasing interval score, so that the best performing models are at the top of the tables.
Average score

Both the sharpness, bias and calibration metrics and interval scores suggest that on average over nations/region, the best performing forecasts (for the first scenario) originated from the QRA method for death_inc_line and icu_prev, the time-invariant weights stacking method for hospital_prev, and EMOS for hospital_inc (see Table 3).

Figure 1: Sharpness, bias and calibration scores for the (left) individual and (right) combined forecasts, for all regions and value types delivered on (top) 11th May and (bottom) 4th May, 2020. Note that multiple points are hidden when they coincide. The shading of the quadrants implies a preference for over-prediction to under-prediction, and for prediction intervals that contain too many data points, rather than too few.

At finer resolution, the performance of both the combination and individual models was found to vary considerably between different regions and value types (illustrated in Fig. 1, which overlays all the metric values for the (left panels) individual and (right panels) combined forecasts for each forecasts delivered on (top) 11th and (bottom) 4th of May 2020 respectively). The best forecasting model was also highly variable across nations/regions and value types (as shown for calibration and bias in Fig. 2), and was often an individual model.

Figure 2: The best performing model or combination method for each region/nation and value type (for forecasts delivered on the 11th, 7th, 4th and 1st of May 2020), evaluated using the absolute distance from the origin on the calibration-bias plots. Ties were broken using the sharpness score.

The variability in performance was particularly stark for QRA and EMOS, which were found in some cases to vastly outperform the individual and stacked forecasts, and in others to substantially under-perform in comparison to both individual models and other combinations algorithms (as shown in Fig. 3 below). In the former case, this occurred when (e.g for occupied ICU beds in Scotland) all the individual model training predictions were highly biased. In these cases, QRA and EMOS were able to correct the bias using non-convex combinations of regression coefficients. Whether this behaviour is desirable depends upon whether the data is believed to be accurate, or itself subject to large systematic biases, such as under-reporting, that is not accounted for within the model predictions of the data. For the latter case of under-performance, this occurred in the presence of large discontinuities between the individual model training predictions and forecasts, corresponding to changes in model structure or parameters. This disruption to the learnt relationships between individual model predictions, and to the data (as captured by the regression models), led to increased forecast bias (an example of this is shown in Fig. 4). Appendix A.4 introduces a simple modification to the QRA algorithm that corrects for this bias. Promising initial results suggest that the modified QRA algorithm substantially outperforms other combination methods for most value types.

Figure 3: Bar plots showing performance of model and combination method for each region/nation and value type (for forecasts delivered on the 11th, 7th, 4th and 1st of May 2020). The height of each bar is calculated as , so that higher bars correspond to better performance.
Figure 4: QRA forecast for hospital_prev in the North West for a forecast window beginning on the 14th May. A Large discontinuity between the current and past forecasts (black line) of the individual model corresponding to the covariate with the largest regression coefficient can lead to increased bias for the QRA algorithm. The median, and QRA prediction intervals are shown in blue, while the data is shown in red.

In comparison, the relative performance of the stacking methods was more stable across different regions/nations and value types (see Fig. 3), which likely reflects the conservative nature of the normalized weights in comparison to the optimized regression coefficients. In terms of sharpness, the stacked forecasts were always outperformed by the tight predictions of some of the individual models, and by the EMOS method.

It is clear from Figure 3 that while some models provided consistently high scoring forecasts for specific regions and value types, the performance of the individual and combined models varied considerably, even over the limited number of delivery dates considered. Further analysis of additional short term forecast data is required to establish the extent to which the combination method or individual model identified as best performing is stable over time.

It is worth noting the delay in the response of the combination methods to changes in individual model performance. For example, the predictions from model two for death_inc_line in the South East of England improved considerably to become the top performing model on the 1st of May. However, the model combination methods only began to weight this model highly within the mixture/regression model on the 11th May. This behaviour arises from the requirement for sufficient data to become available in order to detect the change in performance.

The results from the second scenario showed that under specific conditions, the naive equal weights mixture combination can outperform the data driven methods. This occurs when a new model with well calibrated predictions is introduced, or when the performance of an existing model changes substantially. For example, when initially delivered, a model will be penalized by the data driven methods for lacking predictions for the full training window. In comparison, the naive approach assigned the model equal weight, will result in improved performance if a new model with very low bias is introduced.

5 Conclusions

Scoring rules are a natural way of constructing weights for the combination of predictions from an ensemble of multiple models. Weighting via scores has the advantages over Bayesian model averaging of being directly tailored to approximate a predictive distribution and less sensitive to choice of prior distributions, and overcomes the problem of model averaging converging, as sample size increases, to a predictive density from a single model, even when the ensemble of models does not include the true data generator. Some guidance on which situations each method is appropriate for is given by Höge et al. 9.

Prediction from multi-model ensembles has a long history of being addressed in climate modelling3, 19, 2. Here, it is typical to make assumptions about the underlying distribution to which the models belong, e.g. that they can be approximated by a common regression form or are (co-) exchangeable.

Several methods (stacking, EMOS and QRA) to combine epidemiological forecasts have been investigated, and their performance evaluated using the well-established sharpness, bias and calibration metrics as well as the interval score. When averaged over nations/regions, the best performing forecasts according to both the metrics and interval score, originated from the QRA method for death_inc_line and icu_prev, the time-invariant weights stacking method for hospital_prev, and EMOS for hospital_inc. However, the performance metrics for each model and combination method were found to vary considerably over the different regions and value type combinations. This suggests that the optimal choice of method for combining short term forecasts will be different for each region and value type. Whilst some models were observed to perform consistently well for particular region and value type combinations, the extent to which the best performing models remain stable over time will require further investigation using additional forecasting data.

The rapid evolution of the models (through changes in both parameterization and structure) during the COVID-19 outbreak have led to substantial changes in each models’ predictive performance over time. This represents a significant challenge for combination methods that essentially use a model’s past performance to predict its future performance, and has resulted in cases where the combined forecasts do not represent an improvement to the individual model forecasts. For the stacking approaches, this challenge could be overcome by either (a) additionally providing quantile predictions from the latest version of the models for (but not fit to) data points within a training window, or (b) sampled trajectories from the posterior predictive distribution for the latest model at data points that have been used for parameter estimation. Option (a) would allow direct application of the current algorithms to the latest models but may be complicated by addition of model structure due to e.g. changes in control measures, whilst (b) would enable application of full Bayesian stacking approaches using, for example, leave-one-out cross validation 22. For the regression approaches, a simple modification to the QRA algorithm (detailed in Appendix A.4) appears to be able to correct the prediction bias associated with changes to the model structure or parameters. Promising initial results suggest that the modified QRA algorithm can substantially outperform other combination methods for most value types.

6 Recommendations

  • Forecasts for each region/nation and value type should be selected independently from the available combination algorithms, as sufficient evidence from their most recent calibration, sharpness and bias scores accumulates.

  • The performance of the various combination algorithms should continue to be monitored. The stability of the performance of the different individual models and algorithms should be assessed using additional forecasts as they become available.

  • The current choice of algorithm for each region/nation and value type should be assessed on a regular basis and updated as appropriate.

  • When forecasts from a new model (or from an existing model for a new region/nation or value type), with no corresponding training predictions, the equal-weights mixture algorithm should be included for consideration.

  • The performance of mixture-based combination methods with optimized weights should be investigated.

  • Further research should be conducted into modified regression based combination methods that are able to adjust to changes in the underlying models. The performance of the modified QRA algorithm should be monitored alongside the existing suite of combination algorithms.

  • The EMOS method should be further developed to incorporate information from the full set of reported quantiles.

  • The viability of stacking approaches whose training predictions originate from up-to-date versions of the models should be investigated and implemented when available.

  • When the performance of the forecast combination methods for each region and value type has stabilized, Dstl will make a recommendation on a preferred method for each region and value type, for agreement by SPI-M.


This work has greatly benefited from parallel research led by Sebastian Funk (LSHTM), and was conducted within a wider effort to improve the policy response to COVID-19. The authors would like to thank the SPI-M modelling groups for the data used in this paper. We would also like to thank Tom Finnie at PHE and the SPI-M secretariat for their helpful discussions.

(c) Crown copyright (2020), Dstl. This material is licensed under the terms of the Open Government License except where otherwise stated. To view this license, visit http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3 or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: psi@nationalarchives.gsi.gov.uk


  • K. P. Burnham and D. R. Anderson (2002) Model selection and multimodel inference. 2nd edition, Springer, New York. Cited by: §2.
  • R. E. Chandler (2013) Exploiting strength, discounting weakness: combining information from multiple climate simulators. Philosophical Transactions of the Royal Society A 371, pp. 20120388. Cited by: §5.
  • T. Gneiting, A. E. Raftery, A. H. Westveld III, and T. Goldman (2005a) Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review 133, pp. 1098–1118. Cited by: §5, Uncertainty quantification for epidemiological forecasts of COVID-19 through combinations of model predictions.
  • T. Gneiting and A. E. Raftery (2007) Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association 102, pp. 359–378. Cited by: §1.
  • T. Gneiting and R. Ranjan (2011) Comparing density forecasts using threshold- and quantile-weighted scoring rules. Journal of Business and Economic Statistics 29, pp. 411–422. Cited by: §1.
  • T. Gneiting, F. Balabdaoui, and A. E. Raftery (2007) Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (2), pp. 243–268. Cited by: §4.
  • T. Gneiting, A. E. Raftery, A. H. Westveld III, and T. Goldman (2005b) Calibrated probabilistic forecasting using ensemble model output statistics and minimum crps estimation. Monthly Weather Review 133 (5), pp. 1098–1118. Cited by: §A.2, §3.
  • J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky (1999) Bayesian model averaging: a tutorial. Statistical Science, pp. 382–401. Cited by: Uncertainty quantification for epidemiological forecasts of COVID-19 through combinations of model predictions.
  • M. Höge, A. Guthke, and W. Nowak (2019) The hydrologist’s guide to Bayesian model selection, averaging and combination. Journal of Hydrology 572, pp. 96–107. Cited by: §5.
  • J. Kennedy and R. Eberhart (1995) Particle swarm optimization. In

    Proceedings of ICNN’95-International Conference on Neural Networks

    Vol. 4, pp. 1942–1948. Cited by: §A.2.
  • M. C. Kennedy and A. O’Hagan (2001) Bayesian calibration of computer models (with discussion). Journal of the Royal Statistical Society B 63, pp. 425–464. Cited by: item 3.
  • D. Madigan, A. E. Raftery, C. Volinsky, and J. Hoeting (1996) Bayesian model averaging. In Proceedings of the AAAI Workshop on Integrating Multiple Learned Models, Portland, OR, pp. 77–83. Cited by: Uncertainty quantification for epidemiological forecasts of COVID-19 through combinations of model predictions.
  • C. McDonald and L. A. Thorsrud (2011) Evaluating density forecasts: model combination strategies versus the RBNZ. Technical report Technical Report DP2011/03, Reserve Bank of New Zealand. Cited by: §2, Uncertainty quantification for epidemiological forecasts of COVID-19 through combinations of model predictions.
  • T. P. Minka (2000) Bayesian model averaging is not model combination. Available electronically at http://www. stat. cmu. edu/minka/papers/bma. html, pp. 1–2. Cited by: item 2.
  • K. Monteith, J. L. Carroll, K. Seppi, and T. Martinez (2011) Turning Bayesian model averaging into Bayesian model combination. In The 2011 International Joint Conference on Neural Networks, pp. 2657–2663. Cited by: item 2.
  • J. Nowotarski and R. Weron (2015) Computing electricity spot price prediction intervals using quantile regression and forecast averaging. Computational Statistics 30 (3), pp. 791–803. Cited by: §3.
  • A. Opschoor, D. van Dijk, and M. van der Wel (2017) Combining density forecasts using focused scoring rules. Applied Econometrics 32, pp. 1298–1313. Cited by: §2.
  • A. E. Raftery, D. Madigan, and J. A. Hoeting (1997)

    Bayesian model averaging for linear regression models

    Journal of the American Statistical Association 92, pp. 179–191. Cited by: Uncertainty quantification for epidemiological forecasts of COVID-19 through combinations of model predictions.
  • J. Rougier, M. Goldstein, and L. House (2013) Second-order exchangeability analysis for multimodel ensembles. Journal of the American Statistical Association 108, pp. 852–863. Cited by: §5.
  • M. A. Semenov and P. Stratonovitch (2010) Use of multi-model ensembles from global climate models for assessment of climate change impacts. Climate research 41, pp. 1–14. Cited by: Uncertainty quantification for epidemiological forecasts of COVID-19 through combinations of model predictions.
  • K. F. Wallis (2005) Combining density and interval forecasts: a modest proposal. Oxford Bulletin of Economics and Statistics 67, pp. 983–994. Cited by: Uncertainty quantification for epidemiological forecasts of COVID-19 through combinations of model predictions.
  • Y. Yao, A. Vehtari, D. Simpson, and A. Gelman (2018) Using stacking to average Bayesian predictive distributions (with discussion). Bayesian Analysis 13, pp. 917–1007. Cited by: item 3, §2, §5.

Appendix A Implementation details

a.1 Stacking

The time invariant stacking weights () for model was calculated as the normalized sum of the reciprocal quantile scores (normalized for each data point) for a day window of past quantile predictions and data,


where , is taken as the sum of quantile scores for model ’s quantile predictions for data point . Use of the exponential decay term (with ), means that more recent observations have more impact on the stacking weights.

Given the negatively oriented quantile scoring function, constructing the weights as a normalized sum of reciprocals allowed models with a shorter window of predictions to be included with a naturally penalized weight. Where a subset of prediction quantiles were missing for given time point in the training window, they were estimated by fitting skewed-normal distributions to the available quantiles.

A simple implementation of time-varying weights was also implemented by exponential interpolation between the time-invariant stacking weights (one-day ahead) and equal weights (at the end of the forecast window, 14 days ahead).

a.2 Emos

The covariates, in Eq. 4

, were taken as the median predictions from the individual models. The ensemble standard deviation is estimated as the standard deviation of the median predictions. Dstl is investigating how to utilize more of the information provided by the reported quantiles within the EMOS method. Fitting of the regression coefficients was achieved using a Particle Swarm optimization routine

10, with the coefficients constrained to non-negative values (this version of the algorithm is known as EMOS 7), and with the intercept set to zero in order to force the combination to use the model predictions.

a.3 Qra

The covariates for quantile in Eq. 5, were taken as the individual model predictions for quantile , with coefficient values assumed to be fixed across the different quantiles. Fitting of the regression coefficients was achieved using a Particle Swarm optimization routine, with coefficients constrained to be non-negative.

a.4 Sqra

In Section 4.1 it was observed that large discrepancies between the past and present forecasts for an individual model could lead to increased bias in the QRA and EMOS combined forecasts. Overlapping of past and present forecasts allows this discrepancy to be characterized. Dstl is investigating how this information might used to improve combined forecast performance. Initially, a simple modification to the QRA algorithm has been implemented, that translates the covariates at time () in the regression model, such the values of match the training predictions for the start of the forecast window, . For cases where the discrepancy is large (e.g. Fig. 4), the reduction in bias of shifted QRA (SQRA) over QRA is striking (see Fig. 5).

Figure 5: QRA (blue) and SQRA (green) forecasts for hospital_prev in the North West for a forecast window beginning on the 14th May. SQRA corrects for the discontinuity between past and current forecasts for the individual model (black line) that corresponds to the covariate with the largest coefficient. Data is shown in red.
QRA 49.4
Stacked:time varying 57.7
SQRA 58.4
Stacked:equal weights 59.4
Stacked:time invariant 60.6
EMOS 79.0
(a) death_inc_line
SQRA 66.5
Stacked:time invariant 103.0
Stacked:time varying 107.0
Stacked:equal weights 110.0
QRA 116.3
EMOS 118.5
(b) hospital_inc
SQRA 422.8
EMOS 443.6
Stacked:time invariant 467.6
QRA 504.2
Stacked:time varying 508.2
Stacked:equal weights 528.6
(c) hospital_prev
SQRA 53.9
QRA 103.4
Stacked:time invariant 159.7
Stacked:time varying 177.4
Stacked:equal weights 177.4
EMOS 274.2
(d) icu_prev
Table 4: Interval scores for each value type, averaged over regions/nations for each model and combination algorithm for forecasts from the 18th May 2020.
Average score

Table 4 shows the average interval scores for each combination algorithm for forecasts delivered on the 18th May 2020. For hospital_inc, hospital_prev and icu_prev, the SQRA algorithm has considerably lower average interval scores than any other algorithm. The cases where QRA slightly outperforms SQRA on average, appear to correspond to training predictions that are drawn from multiple deliveries over which the evolving models’ forecasts are particularly volatile. Dstl are investigating how to choose the magnitude of the shift sensibly in these situations.