1 Scoring forecasts
Forecast quality is often assessed via loss functions known as
scoring rules ^{4, 5}, that take a predictive density and outcome as arguments. The scores are typically calculated using outofsample observations, e.g. future data. A proper scoring rule ensures that minimum loss is obtained by choosing the data generating process as the predictive density. Common rules include:
log score: ;

continuous ranked probability score (CRPS): , with
and having finite first moment. enumerate For deterministic predictions, i.e.
being a point mass density with support on , the CRPS reduces to the mean absolute error , and hence this score can be used to compare probabilistic and deterministic predictions.If only quantiles from
are available, alternative scoring rules include enumerate 
quantile score: for quantile forecast at level ;

interval score: for being a central % prediction interval from .
Quantile and interval scores can be averaged across available quantiles/intervals to provide a score for the predictive density, while CRPS is the integral of the quantile score with respect to .
2 Combinations of predictive distributions
Consider a model list , with the
th model having posterior predictive density
for data . The ensemble methods considered here mostly have posterior predictive densities of the form(1) 
where weights the contribution of the th model to the overall prediction, with .
Giving equal weighting to each model has proved surprisingly effective in economic forecasting^{13}. Alternatively, given a score function , weights can be chosen as
(2) 
Such approaches have been applied in economics^{17} and are the essence of model stacking as described by Yao et al. ^{22}. To avoid overfitting, the would typically not include the data used to train the individual models in the ensemble, e.g. estimate would be constructed using future data or crossvalidation. Alternatively, scoring functions can be used to construct normalised weights
(3) 
with being the estimated score for the th model, and being a monotonic function; with the logscore and , Akaike Information Criterion (AIC) style weights are obtained^{1}.
A number of scenarios can be envisaged for the application of such stacking methods to currently employed epidemiological models:

the only model output available is a predetermined set of predictive quantiles: (i) quantile or interval scoring can be used to construct weights, with either individual quantiles directly included in a weighted sum (suitably adjusted to achieve nominal coverage of each interval) or a (non)parametric distribution approximated from each set of quantiles and then included in Eq.1; or (ii) an upperbound on CRPS scoring can be obtained using the quantiles.

predictive distributions, or samples thereof, are available and hence direct mixing or stacking of predictive distributions is possible. Ideally, leaveoneout predictions or sequential predictions of future data are available to mitigate overfitting. Alternatively, the rather confusingly named Bayesian model combination ^{14, 15} could be employed, where the ensemble is expanded to include linear combinations of models, e.g. mean predictions.

if the models themselves are available, stacking can be applied via estimation of predictive distributions (and alternative methods of selecting the weights are possible^{22}). If the models are computationally expensive, preventing the straightforward generation of predictive densities, then a hierarchical Bayesian model can be applied with Gaussian process priors^{11} assumed for each expensive model. Stacking can then be applied to the resulting posterior predictive distributions, conditioning on model runs and data. This approach would be preferred, as it provides a holistic treatment of the predictive uncertainty.
3 Regressionbased forecast combination
The stacking methods described above attempt to calibrate the forecast ensemble by fitting a mixture distribution, with components corresponding to the individual model forecast probability density functions (pdfs). Alternatively, the calibrated forecasts can be generated from a regression model, with covariates corresponding to forecasts from the model ensemble. Two such regressionbased methods are currently under evaluation; Ensemble Model Output Statistics
^{7} (EMOS) and Quantile Regression Averaging ^{16} (QRA).EMOS defines a predictive model in the form of a Gaussian distribution:
(4) 
where are the forecasts from the individual models, and are nonnegative coefficients,
is the ensemble variance, and
and are regression coefficients. Tuning of the coefficients is achieved by minimizing the CRPS against training data.QRA defines a predictive model for a given quantile, , for the combined forecast as:
(5) 
where is the th reported quantile for model , and fitting of the coefficients is achieved by minimizing the quantile score of the predictions against training data.
Further details of the implementation of EMOS, QRA and stacking for short term forecasts are provided in Appendix A.
4 Comparing the performance of combined and individual forecasts
The performance of the different combined and individual forecasts has been monitored using the interval score () averaged over the forecast window and the (i.e. the quantile score for the median), and intervals, in addition to the wellestablished assessment metrics; sharpness, bias and calibration ^{6}. Sharpness () is a measure of prediction uncertainty, and is defined here as the average width of the central predictive interval over the forecast window. Bias () measures how likely a predictive model is to over or under predict, as the proportion of predictions for which the reported median is greater than the data value. Calibration () quantifies the statistical consistency between the predictions and data, via the proportion of predictions for which the data lies inside the
central predictive interval.The bias and calibration scores were linearly transformed as
and , such that a wellcalibrated prediction with no bias or uncertainty corresponds to .4.1 Results
Value types  Description 

death_inc_line  New daily deaths by date of death 
hospital_inc  New and newly confirmed patients in hospital 
hospital_prev  Hospital bed occupancy 
icu_prev  ICU occupancy 
Nations  Regions 

England  London 
Scotland  East of England 
Wales  Midlands 
Northern Ireland  North East and Yorkshire 
North West  
South East  
South West 
For multiple sets of past forecasts, the three metrics were visualized on plots with axes corresponding to against , and with plot point sizes defined by . Calibration and bias were considered the most important criterion, while sharpness was used to break ties when the calibration and bias scores were equal. Where both data and model predictions were available, the metrics were evaluated for the four value types (model outputs of interest) and eleven regions/nations shown in Tables 1 and 2. Two scenarios were considered; firstly where it was required that a full window of training predictions was available for each model to be included in the combination, and secondly, where models with incomplete training prediction windows were also included. The second scenario has been encountered multiple times in practice.




Both the sharpness, bias and calibration metrics and interval scores suggest that on average over nations/region, the best performing forecasts (for the first scenario) originated from the QRA method for death_inc_line and icu_prev, the timeinvariant weights stacking method for hospital_prev, and EMOS for hospital_inc (see Table 3).
At finer resolution, the performance of both the combination and individual models was found to vary considerably between different regions and value types (illustrated in Fig. 1, which overlays all the metric values for the (left panels) individual and (right panels) combined forecasts for each forecasts delivered on (top) 11th and (bottom) 4th of May 2020 respectively). The best forecasting model was also highly variable across nations/regions and value types (as shown for calibration and bias in Fig. 2), and was often an individual model.
The variability in performance was particularly stark for QRA and EMOS, which were found in some cases to vastly outperform the individual and stacked forecasts, and in others to substantially underperform in comparison to both individual models and other combinations algorithms (as shown in Fig. 3 below). In the former case, this occurred when (e.g for occupied ICU beds in Scotland) all the individual model training predictions were highly biased. In these cases, QRA and EMOS were able to correct the bias using nonconvex combinations of regression coefficients. Whether this behaviour is desirable depends upon whether the data is believed to be accurate, or itself subject to large systematic biases, such as underreporting, that is not accounted for within the model predictions of the data. For the latter case of underperformance, this occurred in the presence of large discontinuities between the individual model training predictions and forecasts, corresponding to changes in model structure or parameters. This disruption to the learnt relationships between individual model predictions, and to the data (as captured by the regression models), led to increased forecast bias (an example of this is shown in Fig. 4). Appendix A.4 introduces a simple modification to the QRA algorithm that corrects for this bias. Promising initial results suggest that the modified QRA algorithm substantially outperforms other combination methods for most value types.
In comparison, the relative performance of the stacking methods was more stable across different regions/nations and value types (see Fig. 3), which likely reflects the conservative nature of the normalized weights in comparison to the optimized regression coefficients. In terms of sharpness, the stacked forecasts were always outperformed by the tight predictions of some of the individual models, and by the EMOS method.
It is clear from Figure 3 that while some models provided consistently high scoring forecasts for specific regions and value types, the performance of the individual and combined models varied considerably, even over the limited number of delivery dates considered. Further analysis of additional short term forecast data is required to establish the extent to which the combination method or individual model identified as best performing is stable over time.
It is worth noting the delay in the response of the combination methods to changes in individual model performance. For example, the predictions from model two for death_inc_line in the South East of England improved considerably to become the top performing model on the 1st of May. However, the model combination methods only began to weight this model highly within the mixture/regression model on the 11th May. This behaviour arises from the requirement for sufficient data to become available in order to detect the change in performance.
The results from the second scenario showed that under specific conditions, the naive equal weights mixture combination can outperform the data driven methods. This occurs when a new model with well calibrated predictions is introduced, or when the performance of an existing model changes substantially. For example, when initially delivered, a model will be penalized by the data driven methods for lacking predictions for the full training window. In comparison, the naive approach assigned the model equal weight, will result in improved performance if a new model with very low bias is introduced.
5 Conclusions
Scoring rules are a natural way of constructing weights for the combination of predictions from an ensemble of multiple models. Weighting via scores has the advantages over Bayesian model averaging of being directly tailored to approximate a predictive distribution and less sensitive to choice of prior distributions, and overcomes the problem of model averaging converging, as sample size increases, to a predictive density from a single model, even when the ensemble of models does not include the true data generator. Some guidance on which situations each method is appropriate for is given by Höge et al. ^{9}.
Prediction from multimodel ensembles has a long history of being addressed in climate modelling^{3, 19, 2}. Here, it is typical to make assumptions about the underlying distribution to which the models belong, e.g. that they can be approximated by a common regression form or are (co) exchangeable.
Several methods (stacking, EMOS and QRA) to combine epidemiological forecasts have been investigated, and their performance evaluated using the wellestablished sharpness, bias and calibration metrics as well as the interval score. When averaged over nations/regions, the best performing forecasts according to both the metrics and interval score, originated from the QRA method for death_inc_line and icu_prev, the timeinvariant weights stacking method for hospital_prev, and EMOS for hospital_inc. However, the performance metrics for each model and combination method were found to vary considerably over the different regions and value type combinations. This suggests that the optimal choice of method for combining short term forecasts will be different for each region and value type. Whilst some models were observed to perform consistently well for particular region and value type combinations, the extent to which the best performing models remain stable over time will require further investigation using additional forecasting data.
The rapid evolution of the models (through changes in both parameterization and structure) during the COVID19 outbreak have led to substantial changes in each models’ predictive performance over time. This represents a significant challenge for combination methods that essentially use a model’s past performance to predict its future performance, and has resulted in cases where the combined forecasts do not represent an improvement to the individual model forecasts. For the stacking approaches, this challenge could be overcome by either (a) additionally providing quantile predictions from the latest version of the models for (but not fit to) data points within a training window, or (b) sampled trajectories from the posterior predictive distribution for the latest model at data points that have been used for parameter estimation. Option (a) would allow direct application of the current algorithms to the latest models but may be complicated by addition of model structure due to e.g. changes in control measures, whilst (b) would enable application of full Bayesian stacking approaches using, for example, leaveoneout cross validation ^{22}. For the regression approaches, a simple modification to the QRA algorithm (detailed in Appendix A.4) appears to be able to correct the prediction bias associated with changes to the model structure or parameters. Promising initial results suggest that the modified QRA algorithm can substantially outperform other combination methods for most value types.
6 Recommendations

Forecasts for each region/nation and value type should be selected independently from the available combination algorithms, as sufficient evidence from their most recent calibration, sharpness and bias scores accumulates.

The performance of the various combination algorithms should continue to be monitored. The stability of the performance of the different individual models and algorithms should be assessed using additional forecasts as they become available.

The current choice of algorithm for each region/nation and value type should be assessed on a regular basis and updated as appropriate.

When forecasts from a new model (or from an existing model for a new region/nation or value type), with no corresponding training predictions, the equalweights mixture algorithm should be included for consideration.

The performance of mixturebased combination methods with optimized weights should be investigated.

Further research should be conducted into modified regression based combination methods that are able to adjust to changes in the underlying models. The performance of the modified QRA algorithm should be monitored alongside the existing suite of combination algorithms.

The EMOS method should be further developed to incorporate information from the full set of reported quantiles.

The viability of stacking approaches whose training predictions originate from uptodate versions of the models should be investigated and implemented when available.

When the performance of the forecast combination methods for each region and value type has stabilized, Dstl will make a recommendation on a preferred method for each region and value type, for agreement by SPIM.
Acknowledgments
This work has greatly benefited from parallel research led by Sebastian Funk (LSHTM), and was conducted within a wider effort to improve the policy response to COVID19. The authors would like to thank the SPIM modelling groups for the data used in this paper. We would also like to thank Tom Finnie at PHE and the SPIM secretariat for their helpful discussions.
(c) Crown copyright (2020), Dstl. This material is licensed under the terms of the Open Government License except where otherwise stated. To view this license, visit http://www.nationalarchives.gov.uk/doc/opengovernmentlicence/version/3 or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: psi@nationalarchives.gsi.gov.uk
References
 Model selection and multimodel inference. 2nd edition, Springer, New York. Cited by: §2.
 Exploiting strength, discounting weakness: combining information from multiple climate simulators. Philosophical Transactions of the Royal Society A 371, pp. 20120388. Cited by: §5.
 Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review 133, pp. 1098–1118. Cited by: §5, Uncertainty quantification for epidemiological forecasts of COVID19 through combinations of model predictions.
 Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association 102, pp. 359–378. Cited by: §1.
 Comparing density forecasts using threshold and quantileweighted scoring rules. Journal of Business and Economic Statistics 29, pp. 411–422. Cited by: §1.
 Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (2), pp. 243–268. Cited by: §4.
 Calibrated probabilistic forecasting using ensemble model output statistics and minimum crps estimation. Monthly Weather Review 133 (5), pp. 1098–1118. Cited by: §A.2, §3.
 Bayesian model averaging: a tutorial. Statistical Science, pp. 382–401. Cited by: Uncertainty quantification for epidemiological forecasts of COVID19 through combinations of model predictions.
 The hydrologist’s guide to Bayesian model selection, averaging and combination. Journal of Hydrology 572, pp. 96–107. Cited by: §5.

Particle swarm optimization.
In
Proceedings of ICNN’95International Conference on Neural Networks
, Vol. 4, pp. 1942–1948. Cited by: §A.2.  Bayesian calibration of computer models (with discussion). Journal of the Royal Statistical Society B 63, pp. 425–464. Cited by: item 3.
 Bayesian model averaging. In Proceedings of the AAAI Workshop on Integrating Multiple Learned Models, Portland, OR, pp. 77–83. Cited by: Uncertainty quantification for epidemiological forecasts of COVID19 through combinations of model predictions.
 Evaluating density forecasts: model combination strategies versus the RBNZ. Technical report Technical Report DP2011/03, Reserve Bank of New Zealand. Cited by: §2, Uncertainty quantification for epidemiological forecasts of COVID19 through combinations of model predictions.
 Bayesian model averaging is not model combination. Available electronically at http://www. stat. cmu. edu/minka/papers/bma. html, pp. 1–2. Cited by: item 2.
 Turning Bayesian model averaging into Bayesian model combination. In The 2011 International Joint Conference on Neural Networks, pp. 2657–2663. Cited by: item 2.
 Computing electricity spot price prediction intervals using quantile regression and forecast averaging. Computational Statistics 30 (3), pp. 791–803. Cited by: §3.
 Combining density forecasts using focused scoring rules. Applied Econometrics 32, pp. 1298–1313. Cited by: §2.

Bayesian model averaging for linear regression models
. Journal of the American Statistical Association 92, pp. 179–191. Cited by: Uncertainty quantification for epidemiological forecasts of COVID19 through combinations of model predictions.  Secondorder exchangeability analysis for multimodel ensembles. Journal of the American Statistical Association 108, pp. 852–863. Cited by: §5.
 Use of multimodel ensembles from global climate models for assessment of climate change impacts. Climate research 41, pp. 1–14. Cited by: Uncertainty quantification for epidemiological forecasts of COVID19 through combinations of model predictions.
 Combining density and interval forecasts: a modest proposal. Oxford Bulletin of Economics and Statistics 67, pp. 983–994. Cited by: Uncertainty quantification for epidemiological forecasts of COVID19 through combinations of model predictions.
 Using stacking to average Bayesian predictive distributions (with discussion). Bayesian Analysis 13, pp. 917–1007. Cited by: item 3, §2, §5.
Appendix A Implementation details
a.1 Stacking
The time invariant stacking weights () for model was calculated as the normalized sum of the reciprocal quantile scores (normalized for each data point) for a day window of past quantile predictions and data,
(6) 
where , is taken as the sum of quantile scores for model ’s quantile predictions for data point . Use of the exponential decay term (with ), means that more recent observations have more impact on the stacking weights.
Given the negatively oriented quantile scoring function, constructing the weights as a normalized sum of reciprocals allowed models with a shorter window of predictions to be included with a naturally penalized weight. Where a subset of prediction quantiles were missing for given time point in the training window, they were estimated by fitting skewednormal distributions to the available quantiles.
A simple implementation of timevarying weights was also implemented by exponential interpolation between the timeinvariant stacking weights (oneday ahead) and equal weights (at the end of the forecast window, 14 days ahead).
a.2 Emos
The covariates, in Eq. 4
, were taken as the median predictions from the individual models. The ensemble standard deviation is estimated as the standard deviation of the median predictions. Dstl is investigating how to utilize more of the information provided by the reported quantiles within the EMOS method. Fitting of the regression coefficients was achieved using a Particle Swarm optimization routine
^{10}, with the coefficients constrained to nonnegative values (this version of the algorithm is known as EMOS ^{7}), and with the intercept set to zero in order to force the combination to use the model predictions.a.3 Qra
The covariates for quantile in Eq. 5, were taken as the individual model predictions for quantile , with coefficient values assumed to be fixed across the different quantiles. Fitting of the regression coefficients was achieved using a Particle Swarm optimization routine, with coefficients constrained to be nonnegative.
a.4 Sqra
In Section 4.1 it was observed that large discrepancies between the past and present forecasts for an individual model could lead to increased bias in the QRA and EMOS combined forecasts. Overlapping of past and present forecasts allows this discrepancy to be characterized. Dstl is investigating how this information might used to improve combined forecast performance. Initially, a simple modification to the QRA algorithm has been implemented, that translates the covariates at time () in the regression model, such the values of match the training predictions for the start of the forecast window, . For cases where the discrepancy is large (e.g. Fig. 4), the reduction in bias of shifted QRA (SQRA) over QRA is striking (see Fig. 5).




Table 4 shows the average interval scores for each combination algorithm for forecasts delivered on the 18th May 2020. For hospital_inc, hospital_prev and icu_prev, the SQRA algorithm has considerably lower average interval scores than any other algorithm. The cases where QRA slightly outperforms SQRA on average, appear to correspond to training predictions that are drawn from multiple deliveries over which the evolving models’ forecasts are particularly volatile. Dstl are investigating how to choose the magnitude of the shift sensibly in these situations.
Comments
There are no comments yet.