Seasonal influenza outbreaks pose public health challenges and cause a large morbidity and mortality burden worldwide. The United States Centers for Disease Control and Prevention (CDC) estimates there were 35.5 million cases of influenza, 490,600 influenza-related hospitalizations, and 34,200 deaths from influenza during the 2018–2019 influenza season in the US(cdc_burden_19). Influenza forecasting has become integral to public health decision making (lutz_2019). A forecasting model uses data to make projections of the future trajectory of an infectious disease target, such as cases, hospitalizations and deaths, and can provide uncertainty measures of its predictions. Thus, forecasting models are a powerful tool for public health officials to improve seasonal outbreak preparedness and response, which can in turn potentially reduce the burden of seasonal influenza. The CDC’s establishment of the Center for Forecasting and Outbreak Analytics in August of 2021 (cdc_2021_center) highlights a critical need to advance the use of infectious disease forecasting and modeling.
To provide public health officials real-time, prospective information about the future trajectory of seasonal influenza, the CDC, in collaboration with external researchers, started an annual prospective influenza forecasting exercise in the U.S., known as the FluSight challenge, in 2013. This exercise has been conducted with the goal of improving forecast accuracy and the integration of forecasts with real-time public health decision making. Multiple academic and non-academic groups submit weekly forecasts to the FluSight challenge. A submission typically contains probabilistic and point forecasts for seven targets in each of the 10 Health and Human Services (HHS) regions in the U.S. as well as at the national level. In this manuscript, we focus on the probabilistic forecasts, in which a predictive distribution is specified for the outcome of interest. All forecast targets are based on the weighted percentage of outpatient visits for influenza-like illness (wILI) collected through the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet), weighted by state populations.
Constructing a single ensemble forecast that combines the forecasts from multiple individual models has advantages. A ensemble forecast unifies signals from many models into a single forecast, making it easier for stakeholders to understand. In addition, ensemble forecasts have been shown to consistently achieve a high degree of accuracy and often outperform individual forecasts of infectious disease targets (yamana_2016; yamana2017; defelice2017; viboud_2018; ray2020; chowell2020; cramer_2021; oidtman_2021; ray_2022). A subset of teams participating in the FluSight challenge has produced a collaborative multi-model ensemble, the FluSight Network ensemble, using stacked generalization —in particular, the FluSight Network ensemble is calculated as a linear combination of the individual forecasts.
Despite the success of linear combination methods such as the one used to produce the FluSight Network ensemble, their forecasts lack calibration (hora2004; gneiting2013). gneiting2013 proved that the linear aggregation increases the dispersion of the combined predictive distribution and therefore may result in overdispersed ensemble forecasts even when the individual forecasts are well-calibrated. More generally, a simple linear combination of individual forecasts may produce miscalibrated ensemble forecasts unless their calibration is adjusted for.
Previous work has presented parametric and nonparametric approaches to combining and calibrating ensemble forecasts. The beta-transformed linear pool is a combination formula that calibrates the combined predictive distribution by overlaying the linear pool with a beta transformation (ranjan2010; gneiting2013). bassetti2018 proposed an extension to the beta-transformed linear pool, using a Bayesian nonparametric approach to estimate infinite beta mixture models. This method achieves a theoretically stronger result of probabilistic calibration compared to the beta-transformed linear pool by extending the flexibility of the combination function. kuleshov_21
introduced calibrated risk minimization as a principle that maximizes sharpness subject to calibration by adding calibration loss as a constraint in the loss function.rumack_21 presented a post-processing method called the recalibration ensemble that combines and calibrates forecasts in separate steps and applied this method to recalibrating epidemic forecasts.
In practice, there is merit in selecting parsimonious models and combination methods with computationally efficient estimation (claeskens2016; baran2018; stanescu2016). The optimal degree of flexibility and computational complexity of combination methods often vary for different applications. baran2018 compare the performance of multiple forecast combination methods and assesses the degree of flexibility combination methods needed to yield the best practical results for post-processing applications in forecasting wind speed and precipitation. In influenza probabilistic forecasting, ray2018_ens
study a range of weighing schemes with different levels of complexity in generating ensemble forecasts via the feature-weighted ensemble approach that combines aspects of linear pooling or stacking and gradient boosting. In both of these studies, the methods with an intermediate level of flexibility yielded better predictive performances in their respective applications.
This work aims to add to the growing field of infectious disease probabilistic forecasting by investigating the accuracy and probabilistic calibration of ensemble forecasts produced from combination methods that combine and calibrate simultaneously while not having any knowledge of the underlying model structure of the individual models or the ability to reproduce their forecasts in the U.S. seasonal influenza setting. Using 27 individual models from the FluSight network, we apply the linear pool, beta-transformed linear pool, and the finite beta mixture approach to combine predictive distributions. We also examine whether more parsimonious approaches to the aforementioned methods with fixed, equal individual model weights are sufficient to produce accurate and well-calibrated ensemble forecasts. We modify estimation approaches of the methods with beta transformation to accommodate the binned probability distribution representation used in the FluSight challenge(mcgowan2019).
Section 2 reviews the CDC influenza data, forecast targets, and forecast combination methods. Section 3 describes the application of the combination methods in seasonal influenza forecasting and presents results. Section 4 contains discussions of results in the context of related work, real-time forecasting operations, and data-driven public health decision making.
2.1 Influenza Data
The U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet) publishes the weekly percentage of outpatient doctor’s office visits due to influenza-like illness weighted by state populations (wILI). ILINet is a syndromic surveillance system that includes more than 3,000 providers (cdc_2020). The CDC Influenza Division reports weekly estimates of wILI for the United States and for the 10 Health and Human Services (HHS) regions (Figure 1).
2.2 Forecast targets
Forecasts submitted to the CDC FluSight challenge typically consists of three seasonal targets and four short-term targets. We produce ensemble forecasts of short-term 1-4 week ahead wILI for all locations from the 2016/2017 to 2018/2019 influenza season in this study. We do not include forecasts of seasonal targets, such as the peak week, peak incidence, and seasonal onset, due to the lack of importance of probabilistic forecasts after those events have been observed in a particular season.
2.3 Forecast combination methods
Let and, from individual models. The combination methods described in this section include the linear pool as a baseline method and the beta-transformed linear pool and finite beta mixture combination as the methods that combine and calibrate forecasts.
2.3.1 Linear pool (LP and EW-LP)
The linear pool is a mixture model with a predictive density
where is a nonnegative weight for the individual model and . The equally weighted linear pool (EW-LP) is a special case of the LP with the weights fixed to .
2.3.2 Beta-transformed linear pool (BLP and EW-BLP)
gneiting2013 demonstrate that the LP produces forecasts that lack calibration when the component forecasts are well-calibrated and propose a flexible alternative approach, the beta-transformed linear pool (BLP), which has a predictive CDF defined by
denotes the CDF of the beta distribution with the parameters, is a nonnegative weight for the individual model weight and . To find the predictive PDF of the BLP we can differntiate the above CDF, finding
where is the PDF of the beta distribution. The LP is a special case of the BLP when . The equally weighted component variation of this method is the equally weighted beta-transformed linear pool (EW-BLP), which is a special case of the BLP with fixed weights . Figure 2 demonstrates how the BLP’s beta transformation operates on the LP’s predictive CDF.
2.3.3 Finite beta mixture combination ( and )
bassetti2018 use a Bayesian approach to extend the BLP to finite and infinite beta mixtures for combining and calibrating predictive distributions. baran2018 note the high computational costs of the estimating this approach. Due to the computational burden of this Bayesian approach, we choose to employ a frequentist approach to estimate a finite beta mixture model
where is the number of beta components, is a beta mixture weight for the beta component, denotes the CDF of the beta distribution with the parameters , and comprises the individual model weights specific to each beta component. Differentiating the CDF, the predictive density of the is
The equally weighted variation of the finite beta mixture combination approach () is a special case of the with With , the and the become the BLP and the EW-BLP, respectively.
2.4 Modification of the BLP and methods for combining discrete distributions
The predictive density functions of the BLP and the are given by the equation (3) and (5), respectively. However, the forecasts of 1-4 week ahead wILI, which is a continuous measure of disease incidence, are represented using a binned probability format in submissions to the FluSight challenge. Here we describe a modification to BLP and models to handle this discretized representation of the target variable.
Let denote a predictive CDF of a forecasting model, be the outcome variable, and be a collection of disjoint bins covering the set of possible outcomes for , with for 111The FluSight Challenge uses a slightly different binned probability format where the bin is defined as (mcgowan2019); this detail does not have a practical impact on the set up because the influenza-like-illness measure is continuous.. An individual forecast of consists of an assignment of probabilities to each of the bins:
In order to estimate the parameters of the , we modify the log-likelihood function, which is the log of equation (5), for a single observation that falls in bin to be
where is the probability assigned to bin by the ’s discretized predictive distribution, is the continuous predictive CDF of the , and are the predictive CDFs of a individual model , and is the probability assigned to bin by individual model ’s discretized predictive distribution.
Since the BLP is a special case of the where , the modified log-likelihood function of the BLP is the same as above with a single beta component term in the outer summation.
2.5 Data and code accessibility
All individual forecasts, data, and code used for conducting analyses presented in this manuscript are publicly available (flusight; code_git).
3 Application in seasonal influenza forecasting in the U.S.
We apply the combination methods introduced in Section 2 to prospective forecasts from 27 individual forecasting models (Table S1, Supplementary Material (Wattana22)) available in the FluSight Network repository (flusight) to generate weekly ensemble forecasts of 1-4 week ahead wILI for the United States and the 10 Health and Human Services (HHS) regions from the 2016/2017 to 2018/2019 influenza seasons.
3.1 Forecast evaluation
We follow the FluSight Challenge guidelines (cdc_1920) by using the logarithmic score or log score which is defined as the logarithm of the predictive density or mass function evaluated at the observed data point. The log score is a proper scoring rule that assesses the sharpness and calibration of probabilistic forecasts simultaneously (gneiting2007). In the FluSight challenge where forecasts are represented in a binned probability format, the log score is defined as
where is the observed value of the forecast target and and are the pre-specified lower and upper bounds of bin such that .
We generate forecasts from each combination method for all combinations of week, region, target, and season, and calculate their log scores. Following the CDC scoring convention (mcgowan2019), we truncate log scores to be no lower than . The benefit of this approach is that it enables us to average log scores for a method even when that method receives a log score of (assigning zero probability to an observed value) for any forecasts. However, this modified log score is formally no longer a proper score. Log scores are averaged across all forecast regions and weeks for each target and test season to get summary measures of accuracy for each method to compare their performance.
The calibration of a probabilistic forecast addresses the statistical consistency between the predictive distributions of forecasts and the observations. The concept of calibration allows us to assess whether a model produces reliable forecasts, i.e. whether an event that the model assigns a particular predicted probability really occurs that at that frequency in the long run. The forecast
is probabilistically calibrated if its probability integral transform (PIT) values are uniformly distributed on the unit interval(gneiting_calib_07). The probability integral transform, , is the probability obtained from evaluating the predictive CDF of a model at the observed value ()222Since the target variables are discretized in this application, we adapt Definition 2.5 in gneiting2007 by sampling a uniformly distributed PIT value between and where is the bin containing the observed value..
To assess the probabilistic calibration of the ensemble forecasts (i.e., the uniformity of the PIT values), we use the graphical tool called the probability plot, which plots the empirical CDF of the PIT values. Specifically, we compute the PIT values of all observations in the test seasons and plot their the empirical CDFs by target and season. The empirical CDF curve should follow a 45-degree line bisecting the plot if the forecasts are probabilistically calibrated. In the case where deviations from uniformity are observed, the shape of empirical CDF curve of the PIT values suggests the causes behind the lack of probabilistic calibration (laio2007). For example, PIT values concentrating near 0 and 1 indicates that the observed values fall on the tails of the predictive distribution of the forecasts more frequently than expected, i.e., the probability plot shows the slope steeper than 1 near the PIT values of 0 and 1, so that the predictive distributions were too narrow. To quantitatively measure the deviation of a PIT CDF curve from a standard uniform CDF, we compute the Cramer distance (gneiting2007; rizzo_16), , where is a PIT CDF curve and is a standard uniform CDF. The Cramer distance can be viewed as a summary measure of calibration, however, it lacks the diagnostic property of the probability plot.
3.2 Parameter estimation
In our application, the training data set consists of individual forecasts and influenza data from the 2010/2011 to 2018/2019 season. The test data set includes the 2016/2017 to 2018/2019 influenza seasons. When generating forecasts for a test set season, the training data set consists of all the influenza seasons preceding that season.
The parameters of each combination method are estimated simultaneously by maximizing the average log score, which is positively oriented (i.e., higher scores are better), via maximum likelihood estimation over a training data set. The parameters are chosen to be target-specific to allow for variations among targets. We update the parameter estimates for each test season, so there are 12 sets of parameters to be estimated for four targets and three test seasons. We modify the estimation approach for the BLP, EW-BLP, , and as outlined in the Methods section in order to apply the combination methods to individual forecasts in a binned probability representation.
3.2.1 Choice of for finite beta mixture combination approaches
We use a leave-one-season-out cross validation approach to select the number of beta components, , in the and for each target-test season pair. Specifically, we train the and using
through 5 on each subset of data in the training data with one season left out and use those ensemble fits to generate forecasts for the left out influenza season. Log scores for all combination methods are calculated for all unique forecasts, then averaged across all weeks, regions, and validation seasons to obtain a single mean validation log score for each target and method. In order to take model complexity into account, we calculate mean validation log scores across all locations for each validation season in training seasons, compute a standard error for each target-test season pair, and select the smallestfor and with mean validation log scores within 1 standard error of the best log score in a particular target-test season pair.
Based on the mean validation log scores in Table 1, is selected for the and for all four targets and three seasons. The variation with has the best mean validation log scores in every instance other than the 2-week ahead target in the 2018/2019 season. Overall, using a higher number of beta components in the finite beta mixture approaches does not substantially improve mean out-of-sample log scores in our application. Thus, the finite beta mixture methods with the most parsimonious number of parameters are selected.
3.3.1 Overall Summary
Based on mean out-of-sample log scores across all targets and seasons (Figure 3 panel (b)), the is the most accurate method, followed by the BLP and LP. Across all three test seasons, the outperformed the other five methods for 3 and 4 week ahead horizons, and performed as equally well as the BLP for the 2 week ahead horizon (Figure 3 panel (a)). The BLP is the best performing method for the 1 week ahead horizon. The is also the best performing method for the 2017/2018 and 2018/2019 season based on mean out-of-sample log scores across all four horizons, while the BLP is the best performing method for the 2016/2017 season. These results indicate that the BLP and can consistently improve the accuracy of ensemble forecasts compared to the other commonly used methods included in this study despite season-to-season and target variations.
The 1 week ahead forecasts from the BLP and methods are more probabilistically calibrated than other methods based on the probability plots and their Cramer distances from the standard uniform CDF. However, the forecasts produced from all beta-transformed methods became less calibrated as forecast horizons increased.
3.3.2 Comparison of combination methods’ accuracy
Across all targets and seasons, the has the best mean out-of-sample log score of , though it only marginally outperformed the BLP and LP, which have mean out-of-sample log scores of and , respectively (Figure 3 panel (b)). Across all three test seasons, the BLP has the best mean out-of-sample log scores for the 1 week ahead horizon (Figure 3 panel (a)). The , which is the most flexible method in this study, has the best mean out-of-sample log scores of and for 3 and 4 week ahead horizons, respectively. It also performed as well as the BLP, which also has the best mean out-of-sample log score of for the 2 week ahead horizon. Across all four target horizons, the BLP is the most accurate method for the 2016/2017 season, while the is the most accurate for the 2017/2018 and 2018/2019 season.
The individual observation-level log scores of the ensemble forecasts from the beta-transformed combination methods exhibit higher variation compared to those from the EW-LP and the LP for all targets and test seasons (Figure 4). This is in alignment with the theoretical result that the LP tends to produce overdispersed or wider forecasts, resulting in less extreme log scores. The performance of in the test seasons was less consistent compared to its superior performance across all targets and seasons in the training periods, which indicates of some degree of over-fitting; however, it was always among the top two methods in terms of out-of-sample log score across all targets and seasons (Figure 3). We also see notably higher variation in log scores across all methods for all targets in the 2017/2018 season, which was one of the most severe and longest flu seasons in the recent years (cdc1718). Section 2 in the Supplementary Material (Wattana22) provides detailed results of mean out-of-sample log scores aggregated and disaggregated at different levels.
3.3.3 Comparison of combination methods’ calibration
The empirical CDF curves of PIT values from probabilistically calibrated forecasts should follow the CDF of a standard uniform distribution, that is, a diagonal line between 0 and 1. To quantify the deviation from the CDF of a standard uniform distribution, we computed the Cramer distances between the CDF curves of PIT values and the CDF of a standard uniform distribution. Overall, the BLP and methods are more probabilistically calibrated than other methods for the 1 week ahead horizon, as their empirical CDF curves are less deviated from the reference line (Figure 5) and their Cramer distances are the lowest (Figure 6). However, the forecasts produced from all beta-transformed methods became less calibrated as forecast horizons increased. Across all horizons in the test period, except the 1 week ahead horizon, the empirical CDF curves of the PIT values from the forecasts from the were the most miscalibrated among all beta-transformed methods as indicated by its Cramer distances being the highest.
Based on the probability plots in Figure 5, all combination methods produced forecasts that lack probabilistic calibration in the test period. Recall that according to theory, the LP will produce ensemble forecasts with too wide predictive distributions when individual models are well-calibrated (gneiting2013). The results (Figure 5) in our application during the training period are consistent with this theory. Specifically, the forecasts from the LP and EW-LP tended to be to wide, i.e., more observed values concentrated near the center of predictive distributions than expected for a well-calibrated model as indicated by the slopes of PIT CDF curves being higher than 1 for intermediate PIT values and lower than 1 near 0 and 1 (Figure 5).
The beta transformed combination methods were successful at correcting the overdispersion of the LP and EW-LP. However, their forecasts are miscalibrated in the test period due to under-prediction (Figure 5 panel (b)). The empirical CDF curves of the PIT values from the BLP, EW-BLP, , and were below the reference line across all PIT values from the test period, indicating consistent under-prediction, i.e., the predictive distributions tended to be concentrated below the observed values.
Despite the under-prediction across horizons observed in both training and test periods, the beta-transformed combination methods probabilistic calibration was notably better in the training period, especially for 3 and 4 week ahead horizons (Figure 5 panel (a)). In addition, the calibration of the forecasts produced from the EW-BLP and was similar to that of the forecasts produced from the BLP and in the training period, but more miscalibrated in the test period, as their Cramer distances are notably higher than those of the BLP and in the test period. Similar to the calibration observed in the test period, The LP and EW-LP also produced wide forecasts in the training period, albeit with more miscalibration in the upper tails across all horizons.
The probability plots by target-season pairs (Figure S6, Supplemental Material (Wattana22)) show similar calibration results as in Figure 5 —the empirical CDF curves of the PIT values of forecasts produced from the LP methods in the test seasons also appear miscalibrated in the lower tail. The calibration of all methods by target-season pairs are discussed in more details in Section 3 in Supplemental Material (Wattana22).
3.3.4 Comparison of accuracy and calibration between combination methods with equally weighted and with optimally weighted individual models
The equally weighted variations of the combination methods (EW-LP, EW-BLP, and ), though more parsimonious, had sub-optimal forecast accuracy compared to their counterparts that assigned weights to the individual models in this application. The EW-LP was the worst overall method across all targets and seasons, and all equally weighted variations had worse mean out-of-sample log scores compared to their more complex counterparts (Figure 3). Additionally, the equally weighted ensembles generally had poorer calibration than the corresponding weighted variations in both the training period and the test period (Figures 5, 6).
As demonstrated in the forecasting literature, in many settings ensemble forecasts have consistent superior performance and give decision makers the ability to unify the strengths and diversity of individual models into one forecast. These particular advantages are of great importance in practice for infectious disease forecasting (defelice2017; reich2019; chowell2020; mcandrew2020; ray2020; ray_2022). This work aims to offer insight into forecast accuracy and calibration of parametric combination methods in which calibration and individual model weight estimation happen simultaneously in the application of seasonal influenza forecasting in the U.S.
We applied the linear pool, beta-transformed linear pool, and the finite beta mixture combination method to available forecasts in the FluSight challenge to produce ensemble forecasts for seasonal influenza in the U.S. retrospectively for three test seasons and compared their performance. Our results showed that two of the combination methods included in this study, the BLP and , offered consistently superior forecast accuracy relative to the LP and EW-LP. Either the BLP or the or both delivered better mean log scores across the test seasons compared to the other methods for all targets and across all targets for all seasons. Despite using different methods to create ensemble forecasts, our findings are in agreement with the findings in rumack_21 that combination methods that take into account forecast calibration improve accuracy of seasonal influenza ensemble forecasts in the U.S.
The uses twice as many parameters as the BLP, but only marginally outperformed it in two out of four targets and five out of twelve target-season pairs. Considering the large number of individual models in the FluSight challenge, the BLP may be more easily applicable in practice compared to the as it has half as many individual model weights and beta parameters to estimate. Although the LP under-performed relative to the BLP and for most targets and seasons, the differences in mean log scores were typically small. More parsimonious combination methods with fixed, equally weighted individual model weights, namely the EW-LP, EW-BLP and , appear to not be flexible enough to deliver superior performance compared to the other methods in this study. While this is the case for this application, combination methods using equal weights with or without the beta transformation could be useful in other applications where it might be difficult to estimate individual model weights when available models change over time or training data are limited (ray_2022).
The results on the probabilistic calibration of the ensemble forecasts measured by the uniformity of the PIT values are less straightforward. The results for the LP and EW-LP forecasts indicate that they had too wide predictive distributions across all targets. Despite the beta-transformed combination methods’ success at correcting the overdispersion of the ensemble forecasts from the LP and EW-LP, their forecasts exhibited a pattern of systematic under-prediction. This under-prediction is relatively more pronounced at longer forecast horizons, especially in the 2017/2018 and 2018/2019 influenza seasons. In the 2017/2018 season, which was a large influenza season in the U.S. (cdc1718), the ensemble forecasts from all combination methods under-predict to some extent. Note that an overdispersed forecaster, such as the LP and EW-LP, has the advantage of being more likely to capture an extreme season, though it may not be optimal overall based on proper scores. These more conservative methods may be desirable in applications in which stakeholders want to avoid missing a large season at the expense of having too wide forecasts.
Due to the flexibility of the beta-transformed combination methods and season-to-season variations of influenza, the lack of calibration could stem from overfitting to the training season data as the CDFs of PIT values in training seasons showed better calibration overall. In this study, the parameters of the beta-transformed combination methods were estimated by maximum likelihood, which is not equivalent to a measure of probabilistic calibration. If probabilistic calibration is of critical importance for the application at hand, other approaches such as post-hoc calibration techniques that directly target a calibrated forecast distribution may be appropriate.
Exploring different approaches to select training periods can be useful. For instance, baran2018 use rolling training periods in the application of a similar set of combination methods to wind speed and precipitation forecasting and rumack_21 constructs a training period that takes into account seasonality of epidemic forecasting. In addition, combination methods that require the joint estimation of all parameters, including the parameters in the individual models, may be considered in a setting where the underlying model structure of individual models are known. An example of one of these methods is the mixture EMOS model proposed by baran2016. Since the FluSight challenge provides forecasts only, in this work we selected combination methods that do not require reproduction of forecasts from individual models.
For combining forecasts in outbreak settings, the beta-transformed linear pool (BLP) is a promising alternative to standard linear pooling (LP) methods. Compared with LP, the BLP has only two additional parameters, and , and the simple modification of the log likelihood function makes the BLP applicable to combining forecasts in a binned probability format. The may add value in instances where the BLP is not flexible enough, though we only see marginal improvement in the mean out-of-sample log scores in our application.
As infectious disease forecasting has come to the forefront of the public health effort in formulating well-informed policies in response to outbreaks, it is critical to gain insight on model combination approaches in order to combine individual models’ strengths and to produce accurate ensemble forecasts. This study demonstrates an effort to improve our understanding of how forecast combination methods compare in a setting of an infectious disease with well-established surveillance data pipeline like seasonal influenza. This work has been supported by the National Institutes of General Medical Sciences (R35GM119582) and the US CDC (1U01IP001122). The content is solely the responsibility of the authors and does not necessarily represent the official views of NIGMS, the National Institutes of Health, or CDC. Individual Forecasting Model Information and Additional Results This supplement contains sections 1–3 including individual forecasting model information and additional results of the application.