Every year seasonal influenza costs hospitals time and resources, and causes significant loss of life, especially among patients with cardiac disease, previous respiratory illness or allergies, children and the elderly [1, 2, 3, 4]. During the season’s peak, hospitals admit patients beyond capacity, and in severe cases, for greater lengths of stay, postponing lower priority but still important surgeries [5, 6, 7]
. The total economic burden of influenza outbreaks (including medical costs, lost earnings, and loss of life) in the US alone are estimated to exceed $85 billion annually.
Accurate forecasts of influenza can help with public health response. Public health officials at the national, regional, and state level benefit from the advance warning forecasts can provide, by preparing hospitals for increased disease burden , shifting vaccines to places in need , and alerting the public . Recognizing the value of accurate forecasting, the US Centers for Disease Control and Prevention (CDC) have organized forecasting challenges around seasonal influenza outbreaks [11, 12, 13]. These challenges focus on generating weekly forecasts of influenza-like illness (ILI) rates at the national and regional level in the US. ILI is a syndromic classification, defined as a fever plus an additional symptom such as a cough or sore throat. Public health officials regard ILI surveillance data as an important indicator of the trends and severity of a given influenza season .
The CDC influenza forecasting challenges, nicknamed FluSight, have spurred methodological development in the area of influenza forecasting and infectious disease forecasting more broadly. This effort has led to multiple research groups developing models trained on historical ILI data to forecast future incidence across the US. These models include statistical and mechanistic models, among other approaches [15, 16, 17, 18].
Decision-makers have requested a single unified forecast, and several groups have developed multi-model ensemble forecasts of influenza. These models synthesize predictions from a set of different component models and have shown better predictive ability compared to any single model [19, 20, 21].
Multi-model ensembles aggregate diverse component model predictions into a final ‘meta-prediction’ that is often, but not always, more robust than any single component model [22, 23]. This robustness comes from the ensemble’s mixture of component models. Freed from having to make a single set of assumptions about the data, the ensemble never depends on just one model and associated limitations [22, 23]. In contrast to ensemble modeling approaches like bagging [24, 25] and boosting [26, 25] which create a library of similar component models, multi-model ensembles combine distinct component models together, possibly models designed and implemented in completely different modeling paradigms. Application examples include weather , economics, energy production, education, and infectious disease .
Extending an existing ensemble implementation , we developed a new method for combining component models that relies on recently observed in-season data to adaptively estimate a convex combination of models, sometimes referred to as a “linear pool”. Though past work has studied methods for training ensembles from a fixed data set [31, 32, 33], and continuously adapting ensembles to streaming data [34, 35, 36], a unique and new challenge in our setting is the reliance on noisy, revision-prone observations like weekly ILI surveillance data. Recent ILI observations, because they come from a real-time public health surveillance system, are subject to revisions, which are occasionally substantial and can negatively impact forecast accuracy [37, 38]. Like past work in ensemble forecasting, this work combines a fixed set of probabilistic forecasting models provided by different forecasting teams  and does not presume to have the capability to refit component model parameters.
To protect the adaptive multi-model ensemble framework from relying too heavily on recent, revision-prone ILI data, we developed a Bayesian model that uses a prior to regularize ensemble weights. Previous results in this field show equally-weighted ensembles often perform very well [13, 21], and to align with these results, we chose a uniform Dirichlet prior that shrinks model weights towards equal. Our prior is also time dependent. Unlike a typical model where the prior becomes less influential with increasing data, our prior is designed to exert a constant influence on the final model weights. Our model never allows model weights to depend only on revision-prone data. We show our Bayesian framework can be solved with a Variational Inference algorithm (devi-MM
), and show that an existing Expectation-Maximization approach to ensemble weight estimation is a special case.
We compared ensembles with equal model weights (EW) and those with possibly unequal but static, unchanging weights throughout a season (static), against our adaptive ensemble (adaptive) which updated component model weights every week. Static ensembles were trained on all data before the beginning of each season and were unable to modify their model weights in-season. In contrast, adaptive ensembles were only trained on within-season data but could modify their weights as each additional week of ILI data was reported. Comparing static and equal to adaptive ensembles, we highlight: (i) the adaptive model’s ability to buffer against sparse and revision-prone data by using a prior over ensemble weights and (ii) similar, and sometimes improved, performance when comparing adaptive to static models, where adaptive models require substantially less data to train.
The manuscript is structured as follows: Section 2 describes the CDC influenza training data and component models included in all ensembles, defines performance metrics, and develops the methodology for our adaptive ensemble. Section 3 presents results, investigating the prior’s impact on adaptive ensemble performance and compares the adaptive ensemble to static and equally weighted ensembles. Section 4 relates this work to linear pooling algorithms, and discusses our approach in the broader context of data-driven, evidence-based public health decision making.
2.1.1 Influenza data
Every week throughout the season, for different reporting regions (HHS1, HHS2, .., HHS10) and a national average, the CDC publishes surveillance data on influenza-like illness (ILI). ILI is defined as the percentage of patients presenting with a fever (greater than 100F) plus cough or sore throat. More than outpatient clinics report ILI percentages to the CDC as part of the ILINet surveillance system. In addition to the newest ILI data, updates to past ILI percentages are also published weekly.
Because of reporting delays, ILI percentages undergo revisions every week after they’re first reported, finalized only several weeks after the season ends. Revisions make predicting future ILI difficult. Component model forecasts need to update week-to-week for two reasons: (i) the newest ILI data is reported, and (ii) all previous ILI values have been revised. Advanced forecasting models try to adjust for this revision process and the uncertainty in reported data, typically outperforming models that do not account for data revisions.
2.1.2 Forecasting data
The FluSight Network (FSN) is a collaborative group of influenza forecasters that use historical performance of models to build ensemble forecasts. Teams trained their models and generated forecasts as if their model had been applied in real-time across all seasons since . The resulting collection of forecasts serves as a historical record of how each model would have performed in real-time.
Probabilistic forecasts for ‘component’ models were submitted to the FSN. The majority of forecasts were submitted retrospectively (models from teams KoT and CU submitted in real-time), as if they were forecasting in real-time, contending with ILI data revisions.
For the CDC FluSight challenges, component forecast models and ensembles built from them, forecast key targets (Figure 1). These targets are , , , week-ahead ILI percentages, onset week (defined as the week where consecutive weeks are above the CDC-specified baseline ILI percentage), the season peak week (the epidemic week with highest ILI percentage), and the season peak percentage (the highest ILI percentage throughout the season). The CDC cannot determine with certitude the start of the influenza season and when the peak occurs until mid-summer, due to continued revisions to data. This means final performance data on forecasts—the data used for building ensembles—for the seasonal targets (baseline onset, season peak week and percentage) can not be compiled until after the season closes. Therefore, only component model forecasts of the week-ahead targets will be used for training from the FluSight Network.
The difficulty estimating ensemble weights comes from ILI data revisions. An adaptive ensemble must compute weights based on component model performance—the probability component model forecasts place on true ILI values. But past ILI data is revised weekly. The best estimate of true ILI component model performance changes as past ILI data is revised. An adaptive ensemble must then account for ILI revisions because they change past component model performance week to week.
Seasons from up to will be used to compare equal, static, and adaptive ensembles, holding out the as a training season to build a prior for the adaptive ensemble.
2.1.3 Performance metrics
A useful predictive model must be well-calibrated and accurate, consistently placing high probability on potential ILI percentages, that after all revisions, become the final ILI percentage. A proper scoring rule  widely accepted in the forecasting community, the logscore [40, 38]
is defined as the predicted probability placed on a future true (or finally revised) ILI value. The CDC has also adopted a modified logscore as their standard scoring metric. Given a model with probability density functionand true ILI value , both the logscore and modified logscore can be written
where the traditional logscore is associated with a . The modified logscore () awards models predictions that place high probability mass on ILI values within percentages points of the truth. Technically, this modified score is no longer proper, but in practice it may make little difference when evaluating models  and has the advantage of not penalizing models too heavily (placing no probability on the true ILI would result in a severely poor logscore of ). Forecasts for , , , are given a window of absolute percentage points, the CDC-accepted standard window. Both the proper () and modified () logscores were used to compare ensembles, proper logscores are included in the main text and modified logscores are present in the supplement (Suppl. F)
In practice, we bin our observed ILI and predictive probabilities from to by , plus a single bin from to . For the proper score, computing the logscore reduces to computing the log of the probability given to the single true ILI bin, and in the CDC modified case, summing the log probability given to the true ILI bin, 5 bins to the left and 5 bins to the right. Models that place high probabilities on the bins containing the observed ILIs are scored closer to (perfect score), while models that place low probabilities on observed ILI bins are scored closer to (the worst score). In practice, logscores are truncated to -10, following convention from CDC FluSight scoring.
2.2 Ensemble Framework
2.2.1 Ensemble model specification
Static and adaptive ensemble algorithms use a mixture model to combine (or average) component models forecasts. We developed two approaches to finding ensemble weights: a maximum likelihood method and a maximum aposteriori Bayesian approach. A Bayesian approach lets us regularize our log likelihood via a prior, discouraging erratic shifts in our ensemble weights as we move, week-by-week through an influenza season.
The model assumes our data—ILI percentage at time ()—is generated by first sampling a component model with probability , and second, sampling the ILI percentage from the component model with probability mass function . (In practice, the submitted probabilistic forecasts
are probability mass functions defined over discrete bins of possible ILI incidence. These are often discretized versions of empirical probability density functions.) While a mixture model is likely not the true generating process for ILI data, this framework conveniently expresses the probability distribution over possible ILI values as a convex combination of component models. This mixture model framework for probabilistic model averaging is the standard for the field[21, 20, 39]. We define the probability of observing an ILI percentage at time ,
requiring the s are non-negative and sum to one.
2.2.2 Expectation-Maximization method for weight estimation
Though a convenient way to specify the model, the above probability makes computing the weights difficult, having to optimize the loglikelihood over all time points,
a product of log-sums, where is the observed ILI percentages. An alternative and equivalent formulation [42, 43] introduces a set of hidden indicator variables —one for each week of data—that decide which component model (from possible models) was selected to generate data point . The hidden variables simplify the likelihood to
where equals when the model generated the ILI observation and otherwise, is a matrix of all these hidden variables. The Expectation-Maximization (EM) algorithm [44, 38], by iteratively optimizing a lower bound on (4), the likelihood averaged over our hidden variables (), can find the most likely model weights given the data (deEM-MM Fig. 2.2.5 left); where the prefix de (degeneracy) refers to this algorithm’s inability to change parameters inside component models and the suffix MM (mixture model) references the assumed data generating process.
2.2.3 Variational Inference for weight estimation
While the deEM-MM algorithm find an optimal set of weights, it considers
a fixed vector of parameters. Assuming4), modeling the posterior probability of
multiplying the complete loglikelihood with a prior distribution, , over ensemble weights. While many different choices are available for a prior over ensemble weights, the Dirichlet distribution is a natural choice and conjugate to the likelihood. We further decided on a Dirichlet prior that has a single parameter () shared across all models. The scale of the shared parameter () governs the influence of the prior distribution on the posterior. A larger value of shrinks ensemble weights closer towards an equally weighted ensemble. Equal weights are a reasonable first choice for many ensemble settings. Smaller values allow the model to rely more heavily on the observed data.
Typically, the chosen prior is considered fixed. However, in our setting, where the amount of data is growing and recent observations will be revised as we move through a season, we allow the prior to be time-dependent. This acts to continually weaken the influence of the data on the ensemble weights, allowing for the prior to exert a consistent pull towards equal-weighting across the entire season.
We specify the prior as
where is the parameter (the same across all component models) that defines the Dirichlet distribution at time . Choosing the prior to regularize the likelihood at a constant rate, we chose a constant fraction () of the number of observations available for fitting at timepoint divided equally amongst all component models. Defining the number of observations at time as and constant fraction , we defined our prior
for between and . Throughout the season, the prior parameter will grow linearly as the ensemble trains on an increasing number of observations.
The prior defined as above, the posterior at time equals
The finding an optimal set of mixture weights reduces to optimizing the following function of and ,
We note that the first term on the right-hand side of equation (8) is the same as the frequentist expression for the likelihood shown in equation (4). Intuitively, we expect a stronger (increasing ) even prior to encourage weights across components to move towards
, uniformly distributing probability mass to every component model. But the log-likelihood does not readily show how the prior influences the mixture weights.
Our variational inference algorithm (deVI-MM) estimates the posterior probability over the mixture weights by iteratively maximizing a lower bound, similar to the EM algorithm. Both VI and EM algorithms can be motivated by introducing a distribution over hidden variables and decomposing the loglikehood into two parts: an expectation over a complete loglikelihood and an error term made by assuming the hidden variables take a specific probabilistic form. Attempting to maximize our loglikelihood, given adjustable weights and fixed data , we first factor the complete likelihood
Reordering terms and taking the log,
the marginal loglikelihood is expressed in terms of the complete loglikelihood (numerator) and distribution over hidden variables (denominator). Not necessarily knowing the probability distribution over , we multiple and divide by an approximate distribution ,
and integrate out by taking the expected value with respect to
is the Kullback-Leibler divergence, a non-negative function of . Then the right-hand side, first expressed as an expectation plus error term (9), lower bounds the loglikelihood by only considering the expectation on the right hand side of (10). Iteratively optimizing this lower bound can be shown to monotonically improve the loglikelihood, converging on the best possible representation ( and ) of the complete loglikelihood.
From this decomposition, the EM and VI algorithms diverge: the EM algorithm chooses , zeroing out the above Kullback-Leibler divergence but also assuming can be computed (see  for theoretical development and  for an application to infectious disease forecasting). The VI algorithm is more general, allowing any choice of . The mean-field approximation, the most common choice for , factors into discrete pieces [46, 47] and is the approach we take. We chose to factor our distribution over hidden variables into a distribution over and separate distribution over , , dividing indicator variables from mixture weights.
This choice yields (see details in Suppl. A) a that follows a Dirichlet distribution and indicator variable
that follows a Bernoulli distribution:
where the responsibility , or the probability that model generated datapoint , equals
and equals , where is the digamma function. From , we see the hidden variables computes the probability ILI value was generated by model .
The Variational approach here can also be interpreted as a generalization of the EM solution. Considering a fixed parameter, the responsibilities reduce to
the same responsibility function for the EM algorithm . VI and EM will find a unique global maximum of the log-likelihood due to the convexity of the log-likelihood function (see analysis of convexity in Suppl. C).
2.2.4 How prior impacts ensemble weights
The prior can be recognized as a regularizer. With the posterior Dirichlet distributed, the maximum aposteriori estimate (MAP) for the entry is computed directly by dividing the weight given to model by the sum over all weights (sum over ),
where is the total number of ILI values used for training at time . Though useful for computation, an alternative characterization expresses the MAP as a convex combination of the prior and empirical percentage.
Defining the prior percent as and the empirical percent as , the MAP of can be reexpressed as a convex combination of the prior plus empirical percent
Where needs to be found so that this convex combination equals the original MAP estimate. Plugging in our prior , the weight equals and so the MAP equals
Strengthening the prior, in our case increasing , shifts the MAP estimate away from the data-estimated empirical percentage and towards the prior percentage. The constant shift, despite increased training data, is a result of our time-dependent prior. Setting our prior percentage to equal across all models in the ensemble , the prior weighting interpolates between an equally-weighted ensemble (ignoring past model performance) and a completely data-driven weighting.
2.2.5 deEM-MM and deVI-MM Algorithms
The striking algorithmic similarity between the EM and VI is shown below in Fig.2.2.5. Both algorithms rely on adding hidden variables to the loglikelihood and iteratively maximizing a lower bound. The key difference is how the EM and VI algorithms approximate the distribution over hidden variables . Comparing the EM and VI algorithms in detail, the majority of steps are the same, except for updating the hidden variables (step ) and (step ). When computing , both algorithms need the probability model generated data point for all models to and all data points to . Where they differ, the EM algorithm only requires the past ensemble weights, while the VI algorithm requires the ensemble weight’s expectation (see Suppl. B for details on calculating ). Another difference is in evaluating model fit. The EM algorithm computes the loglikelihood, log of (4), while the VI algorithm computes a related quantity called the Evidence Lower Bound (ELBO). The ELBO is defined as
or the difference between the loglikelihood and our approximation over hidden variables. When updating the ensemble weights, both algorithms sum over the hidden variable’s probability data point belongs to model , but the VI algorithm adds an additional term to the ensemble weight, the prior.
1:input: , , 2:output: 3: 4: 5: = 6:for j=1:maxIters do 7: 8: 9: rowSum() 10: sum() 11: computeLL(,) 12: if LL[j] - LL[j-1] then 13: break 14: end if 15: return 16:end for 1:input: , , 2:output: 3: 4: 5: = 6:for j=1:maxIters do 7: 8: 9: rowSum() 10: sum() + 11: computeELBO(,) 12: if ELBO[j] - ELBO[j-1] then 13: break 14: end if 15: return 16:end for
figure An Expectation maximization (left) and Variational Inference (right) algorithm estimating ensemble weights for a Frequentist and Bayesian adaptive algorithm. Although both algorithms have different underlying probability models, they follow similar steps. The major differences between EM and VI algorithms are: how is computed (step ), the prior over (step ), and how weights are evaluated (steps and ).
2.3 Experimental design
Five ensemble models will be analyzed in detail (Table. 1). The equally-weighted (EW) ensemble weights all component models the same. The Static ensemble trains on all past component model performance data, and assigns weights to component models before the next season. Weights are kept fixed throughout the next season. The adaptive model with three types of regularization will be studied: close to 0% regularization (Adaptive), ‘optimal’ regularization (Adaptive), and ‘over’ regularization (Adaptive). We will consider our adaptive model ‘optimally’ regularized by computing logscore statistics for the 2010/2011 season and choosing the prior that has the highest average logscore.
|Equally-weighted (EW)||Assign a weight of to every component model|
|Static with no regularization (Static)||Weights are trained on all previous seasons and kept fixed throughout the target season|
|Adaptive with no regularization (Adaptive)||Weights are trained within season, change week to week, and have a 0% prior regularization.|
|Adaptive with optimal regularization (Adaptive)||Weights are trained within season, change week to week, and have a optimal prior calculated from the 2010/2011 season.|
|Adaptive with over regularization (Adaptive)||Weights are trained within season, change week to week, and have a prior greater than the optimal|
2.3.1 Training and scoring component models for ensembles
Component models were scored on unrevised ILI data. We created a record of the component model scores, as if they had been scored in real time every week throughout each season. Log scores from all past weeks were updated each week based on “current” data. This means training data for adaptive ensembles fluctuated every week as performance fluctuated for component model forecasts. Ensemble log scores were calculated on ILI percentages from EW28, the “final” week used by the CDC FluSight challenges. ILI values at or after EW28 for each season are considered definitive.
Based on previous work , and due to the relatively smaller amount of data available to the adaptive ensemble, we chose to fit a ‘constant weight’ ensemble model for both the static and adaptive approaches. This formulation assigns a single weight to each component model, the weight does not vary by region or target.
2.3.2 Fitted ensemble models
We computed adaptive ensembles for prior values from 1% to 100% by 1% increments for the 2010/2011 season. In order to determine a prespecified ”optimal prior”, we chose the prior corresponding to the highest average logscore. The 2010/2011 season was then removed from all formal comparisons between ensembles. The adaptive ensemble corresponding to the optimal prior was used for formal comparisons.
For each season in 2010/2011 through 2017/2018, we computed adaptive , adaptive , adaptive , ensembles. Static ensemble and equally-weighted ensembles were also fit at the beginning of each season, using only data for prior seasons.
2.3.3 Formal Comparisons
The quantity of interest is the difference between the logscore assigned to one ensemble versus another. A random intercept model will describe the difference between logscores, averaged over epidemic weeks and paired by season, region, and target.
where is the difference between logscores assigned to two ensembles: for epidemic week , in season , region , and for target . The fixed intercept is denoted , and season (), region (), and target (
) effects are Normally distributed (
) with corresponding variances.
We will fit two random effects models: one comparing the adaptive vs. equally-weighted ensemble, and the second comparing the adaptive vs. static ensemble. The conditional mean, 95% confidence interval, and corresponding p-value will be reported. An additional bootstrapped p-value is also reported. Random samples (with replacement) are selected from the set of all season-region-target-epidemic week tuples. For every random sample, a random effects models is fit and conditional means collected for: seasons, regions, and targets. The set of random samples, first centered, is compared to our original dataset’s conditional mean. We report the empirical probability a randomly sampled exceeds our original dataset’s conditional mean.
3.1 Choosing prior to maximize performance
We chose an optimal prior for the adaptive model by fitting an adaptive ensemble to the 2010/2011 season for priors from 0% to 100% by 1%, and selecting the prior with the highest average logscore. The average logscore peaks at a prior of 8% (Fig. 2), and we will consider an adaptive model with 8% prior our adaptive ensemble. After a prior of 8%, the logscore sharply decreases. This decrease in performance suggest ensemble weights are over regularized, and we chose a 20% prior as our adaptive ensemble. Finally, a 0% prior was chosen as our adaptive ensemble.
3.2 Prior successfully regularizes ensemble weights
We compared the adaptive , adaptive , and adaptive ensembles ( = 0.01, 0.08, and 0.20 respectively) to investigate how the ensemble weights change with the prior. Adding a prior regularizes component model weights (Fig. 3). Smaller priors yield higher variability in ensemble weights throughout any given season. For example, in 2017/2018 (Fig. 3), this is especially evident for the adaptive ensemble, when ensemble weights vacillate between assigning almost all weight to one component model or another. The adaptive and adaptive weights, in comparison, do not show as high variability over time. Component model weights for all three ensembles do track one another, with specific model weights moving up and down in unison, albeit in a more muted fashion for the stronger 8% prior (adaptive ) and 20% prior (adaptive ). The patterns shown in Fig. 3 persist in other seasons as well (see Suppl. D).
3.3 Comparing adaptive vs. equally-weighted and static ensembles
Adaptive ensembles consistently outperform equally-weighted ensembles, and show comparable performance to static ensembles (Fig. 4). The adaptive model has higher logscores than either adaptive and adaptive models, and the EW model. The adaptive , unlike the adaptive model, always outperforms the EW model, indicating it is better to over- than under-regularize, at least to some degree. The static model—trained on multiple seasons of data—performs the best. Adaptivity improves over assigning equal weights to component models, but cannot outperform the data-rich static model.
Formal comparisons (see Fig. 5 and regression table in Suppl. E) demonstrate the adaptive ensemble has statistically higher logscores compared to the EW ensembles, and perform similarly to static ensembles. The adaptive ensemble does perform worse against the static models for particular seasons, regions, and targets. EW and adaptive ensembles perform similar for the 2012/2013 season. The 2015/2016, 2016/2017, and 2017/2018 season does show better (but not significant) performance for the static ensemble. This suggests that the static model’s performance comes from training on multiple seasons of data. Differences in performance are less variable when stratified by region (see Fig. 5B). Difference in logscores decrease as forecasting horizons increase. This reflects the difficulty in forecasting further into the future rather than ensemble performance.
In select strata the adaptive ensemble shows a statistically significant higher logscore compared to the EW and close to signficiant difference compared to the static ensemble. This significance is not enough to conclude the adaptive ensemble should outperform the static ensemble in most cases, however the data suggest, unsurprisingly, that the static ensemble performs relatively better the more data is has to train on.
3.4 The adaptive ensemble’s ability to learn seasonal variation
The adaptive model has the opportunity to outperform the static model by tailoring component model weights within season. Training weights from component model performance data throughout the season is useful, as is shown by the adaptive model outperforming the EW model. But the adaptive ensemble must accrue several weeks of training data before it can perform similar, and in some cases better, than the static ensemble (Fig. 6). When this increase in adaptive performance occurs highlights two key points: (i) the adaptive ensemble is learning seasonal variation and (ii) the static model’s better performance early in the season suggests some continuity between component model weights at the end of one season and beginning of the next.
However, the adaptive models also are able to, without any prior data, learn how to create an optimal weighting within a season. While the adaptive models may be penalized early on in a season by not possessing the historical knowledge of which models have performed well, they adjust for this by learning which models are performing well in a particular season. This is illustrated clearly in Fig. 3 where the heaviest model at the end of the season (with 46.2% of the weight for the adaptive ensemble) was assigned only 4.82% of weight at the beginning of the season. This ability to adapt to within-season dynamics appears to provide the adaptive model with a slight advantage over the static model in the middle and end of the season.
We developed a new algorithm for assigning weights to component models. Our model extends a previous ensemble algorithm  by assuming component model weights are random variables, rather than fixed quantities. This model reassigns new component model weights every week. But influenza data is prone to change within a season. For settings with sparse and possibly revisable data, we introduced a time-dependent uniform Dirichlet distribution that regularizes ensemble weights. This adaptive ensemble outperforms an equally-weighted ensemble, and performs similarly to a static ensemble that requires multiple years of data to perform well.
It is expected that learning from training data, and the amount available, would contribute to better predictive performance. The equally-weighted model ignores any training data, and similar to an adaptive ensemble with no training data, assigns equal weights. The equally-weighted ensemble performs worst. The next best model, the adaptive ensemble, trains on component model performance within the same season. Even though the static model has multiple seasons of training data available, and does have the best performance, the adaptive ensemble is not far behind. It is difficult to tell whether the similar performance is a shortcoming of the static or adaptive ensemble. Similar performance between ensembles could be because component model performance from past seasons (available to the static ensemble) does not generalize to future seasons. Alternatively, the adaptive ensemble may not be using the within-season training data efficiently, leaving behind hidden patterns in the component model data. But we did find static weights outperforming the adaptive ensemble early on in the season. These results suggest a model could perhaps use the static ensemble weights, when available, as a prior for the adaptive ensemble.
Regularizing ensemble weights increased adaptive ensemble performance. Not only can regularization improve adaptive model performance, but the static model too (Suppl. 10). Improving performance by regularizing ensemble weights with a time-dependent uniform Dirichlet prior is not unique to an adaptive ensemble, but may be applicable to any ensemble weighting scheme.
Though our empirical work suggests an optimal prior near 8%, we expect the optimal prior to vary by season, region, and target. Different priors will likely be optimal for different biological phenomena too. A more flexible prior could improve adaptive ensemble performance: changing the prior percent throughout the season, allowing the prior to depend on performance data from past seasons, or modeling the prior on regional and target information. Instead of limiting the prior as a regularizer, we can include outside information from past seasons, or patterns in influenza data the ensemble cannot learn from training data. Future research will explore different methods of modeling prior ensemble weights.
Our work here connects to Gneiting and other’s work on linear pooling [48, 49, 50, 51]. Our weights are optimized with respect to the loglikelihood of a generative model, but the loglikelihood is really a different representation of a logscore—the log of the probability corresponding to weights .
Unlike Gneiting, we did not recalibrate the combined predictive distributions, and methods like the beta transformed linear pool  or shifted linear pool  could improve predictive performance even further.
This paper has many limitations future work can address: (i) better choice of prior, (ii) accounting for correlated component models, (iii) post-processing or ‘recalibrating’ our combined forecast. Our adaptive ensemble examined how a prior can impact ensemble performance, but we only explored a uniform prior. Future work could study informative priors and how to manipulate priors during the course of a season to improve ensemble performance. Our model also assumed component models made independent forecasts. A more advanced ensemble would examine the correlation structure of the data (region, target), and the component model forecasts. Finally, our ensemble model focused on an optimal set of weights for forecasting, and made no efforts to recalibrate the combined forecast.
From a public health perspective, officials have been moving away from stand-alone subjective decision making in favor of incorporating data-driven analyses into decisions. This trend has become particularly apparent in infectious disease outbreak management. Adaptive ensembles, providing real-time forecasts of infectious disease, supports this trend towards evidence based decision making for public health. Combining their expertise with statistical models like our adaptive ensemble, public health officials can work towards impeding outbreaks and changing public health policy. Public health officials often find themselves making decisions in the middle of crises, times when fast effective forecasting is necessary. The adaptive ensemble’s flexibility, reliance only on near-term data, and capacity to track unusual disease trends can support accelerated public health decision making.
Adaptive ensembles—reweighting component models, week by week, throughout a season—can handle sparse data scenarios, admit new component models season by season, and shows similar prescience compared to the more data-heavy static ensemble. This added flexibility makes them an important tool for real-time, accurate forecasts.
5 Supplementary material
The data and code used to train equal, static, and adaptive ensembles can be found at https://github.com/tomcm39/adaptively_stacking_ensembles_for_infleunza_forecasting_with_incomplete_data. Influenza-like illness, component model forecasts, and ensemble performance metrics are published on Harvard’s Dataverse. Links to the data are provided on the above github page.
Appendix A Computing and for the Degenerate Variational Mixture model (deVI-MM)
The functional forms for and are computed. Readers interested in more details, and theoretical background, should consult [47, 46, 42, 43]. In particular,  gives a brief introduction to variational inference focused on the applied statistician, while [42, 43] provide more theoretical details. The most detailed material can be found in .
We find the and that maximize the lower bound by taking advantage of our factored and using iterated expectations.
Maximizing ), we can take the iterated expectation
The Kullback-Leibler divergence, taking values from 0 to , is minimized when
Maximizing follows a similar pattern. The optimal hidden distributions of and can then be computed
We first compute , expanding the complete loglikelihood, taking the expectation over , and recognizing this expectation as a specific distribution. Here we don’t explicitly describe ’s dependence on for convenience.
where is the normalizing constant for the Dirichlet distribution and the expected value of the indicator variable , the probability model generated the ILI value at time . Studying the form of we recognize is Dirichlet distributed
The same procedure can also be applied to compute :
and we recognize is Bernoulli distributed, and with the additional constraint that all indicators must sum to one for every time period , we see is multinomial for every .
Although burdensome upfront, this approximate procedure drastically reduces computational time, compared to more intense monte carlo sampling techniques, and gives a good approximate answer, only assuming and
independent from one another. Also note this mixture model algorithm (both EM and VI), unlike a typical Gaussian mixture model, cannot manipulate the paramters control the component model distributions. But this inability to access the component model parameters opens up greater opportunities for other forecasters to submit models, requiring every forecast model to supply just 1, 2 ,3 , and 4 week ahead forecasts.
Appendix B Computing
Variational inference over hidden variables requires computing the expected value of the log of ’s, a natural consequence of adapting the Dirichlet distribution to exponential form. Exponential form rewrites a probability distribution in the form
where is a normalizing constant. Two facts about exponential form lead to an analytic formula for : taking the gradient of and relating the set of sufficient statistics with this expectation. Starting with the first fact. If is a normalizing constant, then it must equal
and it’s gradient must equal
A powerful consequence of exponential form, the gradient of the normalizing constant equals the expected value of the distribution’s sufficient statistics. The Second fact. Working with the log likelihood of a distribution in exponential form,
taking the gradient and setting equal to 0,
The expected value of the distribution’s sufficient statistics are equal to the gradient of . If the Dirichlet’s sufficient statistics take the form , then we only need to take the gradient of the normalizing constant to find an analytic expression.
Looking at the Dirichlet distribution, the loglikelihood equals
the sufficient statistics for take the form . Taking the gradient then will provide an analytic expression for computing the ’s expected value.
where is the digamma function. Then the expected value of equals a difference of digamma functions
This formula can be used to compute the responsibility function (see Algorithm step ).
Appendix C Convexity analysis
The problem of finding an optimal set of mixture weights can be recast as a constrained optimization problem. Attempting to optimize a convex function over a convex set will prove a global optima exists, and showing strict convexity will prove the a unique vector obtains this optimum.
The original problem searches for weights that maximize a loglikelihood whose sum is constrained to equal one,
and can be converted to a Langragian with a single constraint
Knowing the solution needs to lie in the dimensional simplex, a convex set, we only need to show the above function (32) is convex. Given (32) is differentiable at least twice, we will appeal to a second-order condition for convexity—proving the Hessian is positive semidefinite. After proving the Hessian is positive semidefinite, going further and showing the Hessian is in fact positive definite will prove a unique vector is a the global optimum of our objective function (32).
First, we compute the element of the gradient for (32),
Then the entry of the Hessian is
The Hessian is always negative, and if we minimized the negative loglikelihood (instead of maximizing the positive logelikihood) would see the Hessian is a positive semidefinite matrix. But we can prove more by rewriting
where the element
Positive definite matrices can always be written as a transposed matrix times the matrix itself, and so must also be positive definite. Our convex optimization problem must then have a global optimum, attained by a unique vector . The inability to change forecasting models is a limitation, but allows us to guarantee a global optima. When the component model parameters can also be updated, no such guarantee exists.
Appendix D Simplex plots of all Seasons
Regularizing the adaptive ensemble suppresses large swings in weight assignment, independent of season. The adaptive ensemble stays closer to equal weighting throughout the season compared to the adaptive and adaptive ensembles. If one exists, there is no strong relationship between the adaptive and static ensemble weighting. Some seasons (2014/2015 and 2016/2017) show higher variability in weight assignment than others. The optimal weight assignments given data from all previous seasons (Static MLE) does not carry forward to the optimal weight assignment for the adaptive ensemble at the end of the current season.
Appendix E Tabulated comparisons of adaptive, static, and equal logscores
The adaptive ensemble consistently outperforms the EW ensemble, and performs similarly to the static ensemble. When compared to the EW model, the adaptive cannot show statistical significance in the 2012/2013 season, but all differences are in the positive direction, favoring the adaptive model. The adaptive and static comparisons are more even. Some season, regions, and targets favor the static ensemble, others the adaptive model, but absolute differences are small. Despite markedly less data, the adaptive ensemble bests the EW ensemble and shows similar performance to the static.
|Adaptive - EW||Adaptive - Static|
|2011/2012||0.13 (0.08, 0.19)||0.01||0.02 (-0.04, 0.08)||0.51||0.58|
|2012/2013||0.06 (0.00, 0.12)||0.04||0.21||0.02 (-0.04, 0.08)||0.48||0.52|
|2013/2014||0.10 (0.04, 0.15)||0.01||0.04||0.00 (-0.06, 0.06)||0.96||0.97|
|2014/2015||0.14 (0.09, 0.20)||0.01||0.03 (-0.03, 0.09)||0.30||0.36|
|2015/2016||0.13 (0.07, 0.18)||0.01||0.01||-0.02 (-0.08, 0.04)||0.54||1.00|
|2016/2017||0.11 (0.06, 0.17)||0.01||0.01||-0.04 (-0.10, 0.02)||0.20||1.00|
|2017/2018||0.21 (0.15, 0.26)||0.01||0.00||-0.01 (-0.06, 0.05)||0.86||1.00|
|HHS1||0.16 (0.11, 0.22)||0.01||0.01 (-0.05, 0.07)||0.80||0.83|
|HHS2||0.11 (0.06, 0.17)||0.01||-0.01 (-0.07, 0.06)||0.85||1.00|
|HHS3||0.13 (0.08, 0.18)||0.01||0.04 (-0.03, 0.10)||0.28||0.41|
|HHS4||0.12 (0.07, 0.17)||0.01||-0.01 (-0.08, 0.05)||0.68||1.00|
|HHS5||0.13 (0.07, 0.18)||0.01||-0.05 (-0.11, 0.01)||0.12||1.00|
|HHS6||0.12 (0.07, 0.18)||0.01||0.001||-0.04 (-0.11, 0.02)||0.17||1.00|
|HHS7||0.13 (0.08, 0.18)||0.01||0.05 (-0.01, 0.12)||0.09||0.20|
|HHS8||0.14 (0.09, 0.20)||0.01||0.00 (-0.06, 0.06)||0.99||1.00|
|HHS9||0.08 (0.02, 0.13)||0.01||0.01||0.02 (-0.04, 0.08)||0.55||0.60|
|HHS10||0.12 (0.07, 0.17)||0.01||-0.04 (-0.11, 0.02)||0.19||1.00|
|Nat||0.13 (0.07, 0.18)||0.01||0.06 (-0.01, 0.12)||0.08||0.17|
|1 week ahead||0.16 (0.11, 0.22)||0.01||0.06 (0.00, 0.11)||0.07||0.18|
|2 week ahead||0.13 (0.08, 0.19)||0.01||-0.01 (-0.07, 0.05)||0.81||1.00|
|3 week ahead||0.11 (0.06, 0.17)||0.01||-0.01 (-0.07, 0.05)||0.67||1.00|
|4 week ahead||0.09 (0.03, 0.14)||0.01||0.02||-0.03 (-0.09, 0.03)||0.35||1.00|
Appendix F Comparing ensembles with modified-logscore
All previous formal comparisons using the traditional logscore were re-computed using the modified-logscore. While the logscore is a proper scoring rule, the CDC has implemented a ‘modified’ logscore. This score integrates a model’s predictive density from 0.5 percentages points smaller to 0.5 percentage points higher than the true ILI value. This corresponds to a in equation 1.
Formal comparisons made with the traditional logscore were recomputed with the modified logscore. Re-computing the statistics allows the CDC and other practitioners using the modified logscore to evaluate the adaptive ensemble results.
Though the quantities change, the qualitative picture remains the same. The adaptive ensemble outperforms the EW ensemble and performs similar to the static ensemble.
Appendix G Regularization improves static ensemble
Regularizing ensemble weights improves both adaptive and static ensemble performance (Fig. 10). The static ensemble achieves peak performance with a smaller prior than the adaptive. This smaller prior reflects the data the static model is trained on. Past seasons of finalized ILI data. The adaptive model, finding peak performance for a larger prior, needs to account for the lack of training data and the revision-prone state of the data mid-season.
-  Centers for Disease Control, Prevention, et al. Prevention and control of seasonal influenza with vaccines. recommendations of the advisory committee on immunization practices (acip), 2009. MMWR Early release, 58(Early release):1–54, 2009.
-  Colin A Russell, Terry C Jones, Ian G Barr, Nancy J Cox, Rebecca J Garten, Vicky Gregory, Ian D Gust, Alan W Hampson, Alan J Hay, Aeron C Hurt, et al. The global circulation of seasonal influenza a (h3n2) viruses. Science, 320(5874):340–346, 2008.
-  Scott A Harper, John S Bradley, Janet A Englund, Thomas M File, Stefan Gravenstein, Frederick G Hayden, Allison J McGeer, Kathleen M Neuzil, Andrew T Pavia, Michael L Tapper, et al. Seasonal influenza in adults and children—diagnosis, treatment, chemoprophylaxis, and institutional outbreak management: clinical practice guidelines of the infectious diseases society of america. Clinical infectious diseases, pages 1003–1032, 2009.
-  Rebecca Garten, Lenee Blanton, Anwar Isa Abd Elal, Noreen Alabi, John Barnes, Matthew Biggerstaff, Lynnette Brammer, Alicia P Budd, Erin Burns, Charisse N Cummings, et al. Update: Influenza activity in the united states during the 2017–18 season and composition of the 2018–19 influenza vaccine. Morbidity and Mortality Weekly Report, 67(22):634, 2018.
-  R Krumkamp, M Kretzschmar, JW Rudge, A Ahmad, P Hanvoravongchai, J Westenhoefer, M Stein, W Putthasri, and R Coker. Health service resource needs for pandemic influenza in developing countries: a linked transmission dynamics, interventions and resource demand model. Epidemiology & Infection, 139(1):59–67, 2011.
-  Dena L Schanzer and Brian Schwartz. Impact of seasonal and pandemic influenza on emergency department visits, 2003–2010, ontario, canada. Academic Emergency Medicine, 20(4):388–397, 2013.
-  Noelle-Angelique M Molinari, Ismael R Ortega-Sanchez, Mark L Messonnier, William W Thompson, Pascale M Wortley, Eric Weintraub, and Carolyn B Bridges. The annual impact of seasonal influenza in the us: measuring disease burden and costs. Vaccine, 25(27):5086–5096, 2007.
-  Deoine Reed and Sandra A Kemmerly. Infection control and prevention: a review of hospital-acquired infections and the economic implications. The Ochsner Journal, 9(1):27–31, 2009.
-  Timothy C Germann, Kai Kadau, Ira M Longini, and Catherine A Macken. Mitigation strategies for pandemic influenza in the united states. Proceedings of the National Academy of Sciences, 103(15):5935–5940, 2006.
-  Elaine Vaughan and Timothy Tinker. Effective health risk communication about pandemic influenza for vulnerable populations. American Journal of Public Health, 99(S2):S324–S332, 2009.
-  Matthew Biggerstaff, David Alper, Mark Dredze, Spencer Fox, Isaac Chun-Hai Fung, Kyle S Hickmann, Bryan Lewis, Roni Rosenfeld, Jeffrey Shaman, Ming-Hsiang Tsou, et al. Results from the centers for disease control and prevention’s predict the 2013–2014 influenza season challenge. BMC infectious diseases, 16(1):357, 2016.
-  Matthew Biggerstaff, Michael Johansson, David Alper, Logan C Brooks, Prithwish Chakraborty, David C Farrow, Sangwon Hyun, Sasikiran Kandula, Craig McGowan, Naren Ramakrishnan, et al. Results from the second year of a collaborative effort to forecast influenza seasons in the united states. Epidemics, 24:26–33, 2018.
-  Craig J McGowan, Matthew Biggerstaff, Michael Johansson, Karyn M Apfeldorf, Michal Ben-Nun, Logan Brooks, Matteo Convertino, Madhav Erraguntla, David C Farrow, John Freeze, et al. Collaborative efforts to forecast seasonal influenza in the united states, 2015–2016. Scientific reports, 9(1):683, 2019.
Matthew Biggerstaff, Krista Kniss, Daniel B Jernigan, Lynnette Brammer, Joseph
Bresee, Shikha Garg, Erin Burns, and Carrie Reed.
Systematic assessment of multiple routine and near real-time indicators to classify the severity of influenza seasons and pandemics in the united states, 2003–2004 through 2015–2016.American journal of epidemiology, 187(5):1040–1050, 2017.
-  Jeffrey Shaman and Alicia Karspeck. Forecasting seasonal outbreaks of influenza. Proceedings of the National Academy of Sciences, 109(50):20425–20430, 2012.
-  Logan C Brooks, David C Farrow, Sangwon Hyun, Ryan J Tibshirani, and Roni Rosenfeld. Flexible modeling of epidemics with an empirical bayes framework. PLoS computational biology, 11(8):e1004382, 2015.
-  Evan L Ray, Krzysztof Sakrejda, Stephen A Lauer, Michael A Johansson, and Nicholas G Reich. Infectious disease prediction with kernel conditional density estimation. Statistics in medicine, 36(30):4908–4929, 2017.
-  Dave Osthus, James Gattiker, Reid Priedhorsky, Sara Y Del Valle, et al. Dynamic bayesian influenza forecasting in the united states with hierarchical discrepancy. Bayesian Analysis, 2018.
-  Teresa K Yamana, Sasikiran Kandula, and Jeffrey Shaman. Individual versus superensemble forecasts of seasonal influenza outbreaks in the united states. PLoS computational biology, 13(11):e1005801, 2017.
-  Evan L Ray and Nicholas G Reich. Prediction of infectious disease epidemics via weighted density ensembles. PLoS computational biology, 14(2):e1005910, 2018.
-  Nicholas G Reich, Craig J McGowan, Teresa K Yamana, Abhinav Tushar, Evan L Ray, Dave Osthus, Sasikiran Kandula, Logan C Brooks, Willow Crawford-Crudell, Graham Casey Gibson, et al. A collaborative multi-model ensemble for real-time influenza season forecasting in the us. bioRxiv, page 566604, 2019.
-  Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, 2012.
-  Martin Sewell. Ensemble learning. RN, 11(02), 2008.
-  Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
-  Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
-  Robert E Schapire. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification, pages 149–171. Springer, 2003.
-  TN Krishnamurti, CM Kishtawal, Timothy E LaRow, David R Bachiochi, Zhan Zhang, C Eric Williford, Sulochana Gadgil, and Sajani Surendran. Improved weather and seasonal climate forecasts from multimodel superensemble. Science, 285(5433):1548–1550, 1999.
-  Anthony Garratt, James Mitchell, Shaun P Vahey, and Elizabeth C Wakerly. Real-time inflation forecast densities from ensemble phillips curves. The North American Journal of Economics and Finance, 22(1):77–87, 2011.
-  Marco Pierro, Francesco Bucci, Matteo De Felice, Enrico Maggioni, David Moser, Alessandro Perotto, Francesco Spada, and Cristina Cornaro. Multi-model ensemble for day ahead prediction of photovoltaic power generation. Solar energy, 134:132–146, 2016.
-  Olugbenga Wilson Adejo and Thomas Connolly. Predicting student academic performance using multi-model heterogeneous ensemble approach. Journal of Applied Research in Higher Education, 10(1):61–75, 2018.
-  Tilmann Gneiting and Adrian E Raftery. Weather forecasting with ensemble methods. Science, 310(5746):248–249, 2005.
-  Aoife M Foley, Paul G Leahy, Antonino Marvuglia, and Eamon J McKeogh. Current methods and advances in forecasting of wind power generation. Renewable Energy, 37(1):1–8, 2012.
-  Mark J Van der Laan, Eric C Polley, and Alan E Hubbard. Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007.
-  R Pari, M Sandhya, and Sharmila Sankar. A multi-tier stacked ensemble algorithm to reduce the regret of incremental learning for streaming data. IEEE Access, 6:48726–48739, 2018.
-  Alan Fern and Robert Givan. Online ensemble learning: An empirical study. Machine Learning, 53(1-2):71–109, 2003.
-  David Benkeser, Cheng Ju, Sam Lendle, and Mark van der Laan. Online cross-validation-based ensemble learning. Statistics in medicine, 37(2):249–260, 2018.
-  Dave Osthus, Ashlynn R Daughton, and Reid Priedhorsky. Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited. PLoS computational biology, 15(2):e1006599, 2019.
-  Nicholas G Reich, Logan C Brooks, Spencer J Fox, Sasikiran Kandula, Craig J McGowan, Evan Moore, Dave Osthus, Evan L Ray, Abhinav Tushar, Teresa K Yamana, et al. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the united states. Proceedings of the National Academy of Sciences, 116(8):3146–3154, 2019.
-  Logan C Brooks, David C Farrow, Sangwon Hyun, Ryan J Tibshirani, and Roni Rosenfeld. Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions. PLoS computational biology, 14(6):e1006134, 2018.
-  Alexander Philip Dawid and Monica Musio. Theory and applications of proper scoring rules. Metron, 72(2):169–183, 2014.
-  Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
-  Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
-  K Murphy. Machine learning: a probabilistic approach. Massachusetts Institute of Technology, pages 1–21, 2012.
-  Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
-  Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
-  David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
-  Jagdish S Rustagi. Variational methods in statistics. In Variational methods in statistics. Academic Press, 1976.
-  Roopesh Ranjan and Tilmann Gneiting. Combining probability forecasts. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(1):71–91, 2010.
-  Tilmann Gneiting, Roopesh Ranjan, et al. Combining predictive distributions. Electronic Journal of Statistics, 7:1747–1782, 2013.
-  John Geweke and Gianni Amisano. Optimal prediction pools. Journal of Econometrics, 164(1):130–141, 2011.
-  Kenneth F Wallis. Combining forecasts–forty years later. Applied Financial Economics, 21(1-2):33–41, 2011.
-  William Kleiber, Adrian E Raftery, Jeffrey Baars, Tilmann Gneiting, Clifford F Mass, and Eric Grimit. Locally calibrated probabilistic temperature forecasting using geostatistical model averaging and local bayesian model averaging. Monthly Weather Review, 139(8):2630–2649, 2011.
-  Caitlin Rivers, Jean-Paul Chretien, Steven Riley, Julie A Pavlin, Alexandra Woodward, David Brett-Major, Irina Maljkovic Berry, Lindsay Morton, Richard G Jarman, Matthew Biggerstaff, Michael A Johansson, Nicholas G Reich, Diane Meyer, Michael R Snyder, and Simon Pollett. Using “outbreak science” to strengthen the use of models during epidemics. Nature Communications, 10(1):3102, 2019.