Prediction of infectious disease epidemics via weighted density ensembles

by   Evan L. Ray, et al.
University of Massachusetts Amherst

Accurate and reliable predictions of infectious disease dynamics can be valuable to public health organizations that plan interventions to decrease or prevent disease transmission. A great variety of models have been developed for this task, using different model structures, covariates, and targets for prediction. Experience has shown that the performance of these models varies; some tend to do better or worse in different seasons or at different points within a season. Ensemble methods combine multiple models to obtain a single prediction that leverages the strengths of each model. We considered a range of ensemble methods that each form a predictive density for a target of interest as a weighted sum of the predictive densities from component models. In the simplest case, equal weight is assigned to each component model; in the most complex case, the weights vary with the region, prediction target, week of the season when the predictions are made, a measure of component model uncertainty, and recent observations of disease incidence. We applied these methods to predict measures of influenza season timing and severity in the United States, both at the national and regional levels, using three component models. We trained the models on retrospective predictions from 14 seasons (1997/1998 - 2010/2011) and evaluated each model's prospective, out-of-sample performance in the five subsequent influenza seasons. In this test phase, the ensemble methods showed overall performance that was similar to the best of the component models, but offered more consistent performance across seasons than the component models. Ensemble methods offer the potential to deliver more reliable predictions to public health decision makers.



There are no comments yet.


page 19

page 21


Adaptively stacking ensembles for influenza forecasting with incomplete data

Seasonal influenza infects between 10 and 50 million people in the Unite...

Time Series Methods and Ensemble Models to Nowcast Dengue at the State Level in Brazil

Predicting an infectious disease can help reduce its impact by advising ...

Comparing statistical methods to predict leptospirosis incidence using hydro-climatic covariables

Leptospiroris, the infectious disease caused by the spirochete bacteria ...

An Ensemble Classifier for Predicting the Onset of Type II Diabetes

Prediction of disease onset from patient survey and lifestyle data is qu...

Comparison of Combination Methods to Create Calibrated Ensemble Forecasts for Seasonal Influenza in the U.S

The characteristics of influenza seasons varies substantially from year ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Accurate and reliable predictions of infectious disease dynamics can be valuable to public health organizations that plan interventions to decrease or prevent disease transmission. A great variety of models have been developed for this task, using different model structures, covariates, and targets for prediction. Experience has shown that the performance of these models varies; some tend to do better or worse in different seasons or at different points within a season. Ensemble methods combine multiple models to obtain a single prediction that leverages the strengths of each model. We considered a range of ensemble methods that each form a predictive density for a target of interest as a weighted sum of the predictive densities from component models. In the simplest case, equal weight is assigned to each component model; in the most complex case, the weights vary with the region, prediction target, week of the season when the predictions are made, a measure of component model uncertainty, and recent observations of disease incidence. We applied these methods to predict measures of influenza season timing and severity in the United States, both at the national and regional levels, using three component models. We trained the models on retrospective predictions from 14 seasons (1997/1998 - 2010/2011) and evaluated each model’s prospective, out-of-sample performance in the five subsequent influenza seasons. In this test phase, the ensemble methods showed overall performance that was similar to the best of the component models, but offered more consistent performance across seasons than the component models. Ensemble methods offer the potential to deliver more reliable predictions to public health decision makers.


The practice of combining predictions from different models has been used for decades by climatologists and geophysical scientists. These methods have subsequently been adapted and extended by statisticians and computer scientists in diverse areas of scientific inquiry. In recent years, these “ensemble” forecasting approaches frequently have been among the top methods used in prediction challenges across a wide range of applications.

Ensembles are a natural choice for noisy, complex, and interdependent systems that evolve over time. In these settings, no one model is likely to be able to capture and predict the full set of complex relationships that drive future observations from a particular system of interest. Instead “specialist” or “component” models can be relied on to capture distinct features or signals from a system and, when combined, represent a nearly complete range of possible outcomes. In this work, we develop and compare a collection of ensemble methods for combining predictive densities. This enables us to quantify the improvement in predictions achieved by using ensemble methods with varying levels of complexity.

To illustrate these ensemble methods, we present time-series forecasts for infectious disease, specifically for influenza in the United States. The international significance of emerging epidemic threats in recent decades has highlighted the importance of understanding and being able to predict infectious disease dynamics. With the revolution in science driven by the promise of “big” and real-time data, there is an increased focus on and hope for using statistics to inform public health policy and decision-making in ways that could mitigate the impact of future outbreaks. Some of the largest public health agencies in the world, including the US Centers for Disease Control and Prevention (CDC) have openly endorsed using models to inform decision making, saying “with models, decision-makers can look to the future with confidence in their ability to respond to outbreaks and public health emergencies” [1].

Development of the methods presented in this manuscript was motivated by the observation that certain prediction models for infectious disease consistently performed better than other models at certain times of year. We observed in previous research that early in the U.S. influenza season, simple models of historical incidence often outperformed more standard time-series prediction models such as a seasonal auto-regressive integrated moving average (SARIMA) model [2]. However, in the middle of the season, the time-series models showed improved accuracy. We set out to determine whether ensemble methods could use this information about past model performance to improve predictions.

A large number of ensemble methods have been developed for a diverse array of tasks including regression, classification, and density estimation. These methods are broadly similar in that they combine results from multiple component models. However, details differ between ensemble methods. We suggest Polikar

[3] for a review of ensemble methods; many of these are also discussed in detail in Hastie et al.[4].

While there are many different methods for combining models, all ensemble models discussed in this paper use an approach called stacking. In this approach, each of the component models is trained separately in a first stage, and cross-validated measures of performance of those component models are obtained. Then, in a second stage, a stacking model is trained using the cross-validated performance measures to learn how to optimally combine predictive densities from the component models. The specific implementations of stacking that we use obtain the final predictive density as a weighted sum of the component predictive densities, where the weights may depend on covariates. We refer to this approach generally as a ‘’weighted density ensemble” approach to prediction. Several variations on this strategy have been explored in the literature previously [5, 6, 7]. However, other ensemble methods for density estimation have also been developed. For example, Rosset and Segal [8] develop a boosting method in which the component models are estimated sequentially, with results from earlier models affecting estimation of later models.

In structured prediction settings such as time series forecasting, ensemble methods may benefit from taking advantage of the data structure. For example, it may be the case that different models offer a better representation of the data at different points in time. A common idea in these settings is to use model weights that change over time. For instance, model weights may vary as a function of how well each model did in recent predictions [9]

or by using a more formal graphical structure such as a hidden Markov model to track which component model is most likely to have generated new observations as they arise over time

[10, 11]. It is also possible to combine the component models with weights that depend on observed covariates or features [12]. For example, in an ensemble for a user recommendation system, Jahrer et al.[13] allowed model weights to depend on a variety of features including the time that a user submitted a rating.

Using component models that generate predictive densities for outcomes of interest, we have implemented a series of ensembles using different methods for choosing the weights for each model. Specifically, we compare three different approaches. The first approach simply takes an equally weighted average of all models. The second approach estimates constant but not necessarily equal weights for each model. The third approach is a novel method for determining model weights based on features of the system at the time predictions are made. The overarching goal of this study is to create a systematic comparison between ensemble methods to study the benefits of increasing complexity in ensemble weighting schemes.

We are aware of one previous article that has developed ensemble methods for infectious disease prediction. Yamana et al.[14] developed a model stacking framework that is similar to the second approach outlined above using a constant weight for each component model. The present article is differentiated from that work in that we explore and compare a range of more flexible ensemble methods where the weights depend on observed features.

This paper presents a novel ensemble method that determines optimal model combinations based on (a) observed data at the time predictions are made and (b) aspects of the predictive distributions obtained from the component models. We refer to models built using this approach as “feature-weighted” ensembles. This approach fuses aspects of different ensemble methods: it uses model stacking [15] and estimates model weights based on features of the system [12] using gradient tree boosting [16].

Using seasonal influenza outbreaks in the US health regions as a case-study, we developed and applied our ensemble models to predict several attributes of the influenza season at each week during the season. By illustrating the utility of these approaches to ensemble forecasting in a setting with complex population dynamics, this work highlights the importance of continued innovation in ensemble methodology.


This paper presents a comparison of methods for determining weights for weighted density ensembles, applied to forecasting specific features of influenza seasons in the US. First, we present a description of the influenza data we use in our application and the prediction targets. Next, we discuss the three component models utilized by the ensemble framework. We then turn to the ensemble framework itself, describing the different ensemble model specifications used.

Data and prediction targets

We obtained publicly available data on seasonal influenza activity in the United States between 1997 and 2016 from the U.S. Centers for Disease Control and Prevention (CDC) (Fig 1). For each of the 10 Health and Human Services regions in the country in addition to the nation as a whole, the CDC calculates and publishes each week a measure called the weighted influenza-like illness (wILI) index. The wILI for a particular region is calculated as the average proportion of doctor visits with influenza-like illness for each state in the region, weighted by state population. During the CDC-defined influenza season (between Morbidity and Mortality Weekly Report week 40 of one year and 20 of the next year), the CDC publishes updated influenza data on a weekly basis. This includes “current” wILI data from two weeks prior to the reporting date, as well as updates to previously reported numbers as new data becomes available. For this analysis, we use only the final reported wILI measures to train and predict from our models.

The CDC defines the influenza season onset as the first of three successive weeks of the season for which wILI is greater than or equal to a threshold that is specific to the region and season. This threshold is the mean percent of patient visits where the patient had ILI during low incidence weeks for that region in the past three seasons, plus two standard deviations

[17]. The CDC provides historical threshold values for each region going back to the 2007/2008 season [18]. Additionally, we define two other metrics specific to a region-season. The peak incidence is the maximum observed wILI measured in a season. The peak week is the week at which the maximum wILI for the season is observed.

Each predictive distribution was represented by probabilities assigned to bins associated with different possible outcomes. For onset week, the bins are represented by integer values for each possible season week plus a bin for “no onset”. For peak week, the bins are represented by integer values for each possible season week. For peak incidence, the bins capture incidence rounded to a single decimal place, with a single bin to capture all incidence over

. Formally, the incidence bins are as follows: [0, 0.05), [0.05, 0.15), …, [12.95, 13.05), [13.05, ). These bins were used in the 2016-2017 influenza prediction contest run by the CDC [19].

We measure the accuracy of predictive distributions using the log score. The log score is a proper scoring rule [20], calculated in our setting as the natural log of the probability assigned to the bin containing the true observation. Proper scoring rules are preferred for measuring the quality of predictive distributions because the expected score is optimized by the true probabilty distribution. We note that for peak week, in some region-seasons the same peak incidence was achieved in multiple weeks (after rounding to one decimal place). In those cases, we calculated the log score as the log of the sum of the probabilities assigned to those weeks; this is consistent with scoring procedures used in the 2016-2017 flu prediction contest run by the CDC [19].

Component models

We used three component models to generate probabilistic predictions of the three prediction targets. The first model was a seasonal average model that utilized kernel density estimation (KDE) to estimate a predictive distribution for each target. The second model utilized kernel conditional density estimation (KCDE) and copulas to create a joint predictive distribution for incidence in all remaining weeks of the season, conditional on recent observations of incidence


. By calculating appropriate integrals of this joint distribution, we constructed predictive distributions for each of the seasonal targets. The third model used a standard seasonal auto-regressive integrated moving average (SARIMA) implementation. All models were fit independently on data within each region.

Kernel Density Estimation (KDE)

The simplest of the component models uses kernel density estimation [21] to estimate a distribution for each target based on observed values of that target in previous seasons within the region of interest. We used Gaussian kernels and the default KDE settings from the density function in the stats package for R [22] to estimate the bandwidth parameter. For the peak incidence target, we fit to log-transformed observations of historical peak incidence. For the onset week prediction target, we estimated the probability of no onset as the proportion of region-seasons in all regions in the training phase where no week in the season met the criteria for being a season onset.

To create an empirical predictive distribution of size

from a KDE fit based on a data vector

(for example, this might be the vector of peak week values from the training seasons), we first drew samples with replacement from , yielding a new vector . We then drew a single psuedo-random deviate from each of

truncated Gaussian distributions centered at

with the bandwidth estimated by the KDE algorithm. The Gaussians we sampled from were truncated at the lower and upper bounds of possible values for the given prediction target. Finally, we discretized the sampled values to the target-specific bins. These sampled points then make up the empirical predictive distribution from a KDE model. We set the sample size to . In theory, this model assigns non-zero probability to every possible outcome; however, in a few cases the empirical predictive distribution resulting from this Monte Carlo sampling approach assigned probability zero to some of the bins.

It is important to note that the predictions from this model do not change as new data are observed over the course of the season.

Kernel Conditional Density Estimation (KCDE)

We used kernel conditional density estimation and copulas to estimate a joint predictive distribution for flu incidence in each future week of the season, and then calculated predictive distributions for each target from that joint distribution [2]. In our implementation, we first used KCDE to obtain separate predictive densities for flu incidence in each future week of the season. Each of these predictive densities gives a conditional distribution for incidence at one future time point given recent observations of incidence and the current week of the season. KCDE can be viewed as a distribution-based analogue of nearest-neighbors regression. We then used a copula to model dependence among those individual predicitive densities, thereby obtaining a joint predicitive density, or a distribution of incidence trajectories in all future weeks.

To predict seasonal quantities (onset, peak timing, and peak incidence), we simulate trajectories of disease incidence from this joint predictive distribution. For each simulated incidence trajectory, we compute the onset week, peak week, and peak incidence. We then aggregate these values to create predictive distributions for each target. This procedure for obtaining predictive distributions for the targets of interest can be formally justified as an appropriate Monte Carlo integral of the joint predictive distribution for disease incidence in future weeks (see [2] for details).

Seasonal auto-regressive integrated moving average (SARIMA)

We fit seasonal ARIMA models [23] to wILI observations transformed to be on the natural log scale. We manually performed first-order seasonal differencing and used the stepwise procedure from the auto.arima function in the forecast package [24] for R to select the specification of the auto-regressive and moving average terms.

Similar to KCDE, forecasts were obtained by sampling trajectories of wILI values over the rest of the season (using the simulate.Arima function from the forecast package), and predictive distributions of the targets were computed from these sampled trajectories as described above.

Component model training

We used data from 14 seasons (1997/1998 through 2010/2011) to train the models. Data from five seasons (2011/2012 through 2015/2016) were held out when fitting the models and used exclusively in the testing phase. To avoid overfitting our models, we made predictions for the test phase only once [4].

Estimation of the ensemble models (discussed in the next subsection) requires cross-validated measures of performance of each of the component models in order to accurately gauge their relative performance. For each region, we estimated the parameters of each component model 15 times: 14 fits were obtained excluding one training season at a time, and another fit used all of the training data. For each fit obtained leaving one season out, we generated a set of three predictive distributions (one for each of the prediction targets) at each week in the held-out season. We were not able to generate predictions from the SARIMA and KCDE models for some seasons in the training phase because those models used lagged observations from previous seasons that were missing in our data set. The component model fits based on all of the training data were used to generate predictions for the test phase.

Ensemble models

All of the ensemble models we consider in this article work by averaging predictions from the component models to obtain the ensemble prediction. Additionally, these methods are stacked model ensembles because they use leave-one-season-out predictions from the independently estimated component models as inputs to estimate the model weights [15]. We begin our discussion of ensemble methods with a general overview, introducing a common set of notation and giving a broad outline of the ensemble models we will use in this article. We then describe our proposed weighted density ensemble model specifications in more detail.

Overview of ensemble models

A single set of notation can be used to describe all of the ensemble frameworks implemented here. Let denote the predictive density from component model

for the value of the scalar random variable

conditional on observed variables . Observations of disease incidence are reported weekly in our data set, so indexes the week of the season. The variable could for example represent the peak incidence for a given season and region; in our application to predicting seasonal quantities, the same outcome will be realized for all weeks within a given season. In the context of time series predictions, the covariate vector may include time-varying covariates such as the week at which the prediction is made or lagged incidence. The superscript reflects the fact that each component model may use a different set of covariates.

The combined predictive density for a particular target can be written as


In Equation (1) the are the model weights, which are allowed to vary as a function of observed features in . We define to be a vector of all observed quantities that are used by any of the component models or in calculating the model weights. In order to guarantee that

is a probability distribution we require that

for all . Fig 2 illustrates the concept of stacking the predictive densities for each component model.

In the following subsection, we propose a framework for estimating feature-dependent weights for a stacked ensemble model. By feature-dependent we mean that the weights associated with different component models are driven by observed features or covariates. Although we illustrate the method in the context of time-series predictions, the method could be used in any setting where we wish to combine distribution estimates from multiple models. Features could include observed data from the system being predicted (such as recent wILI measurements or the time of year at which predictions are being made), observed data from outside the system (for example, recent weather observations), or features of the predictions themselves (e.g. summaries of the predictive distributions from the component models, such as a measure of spread in the distribution, or the time until a predicted peak). Based on exploration of training phase data and a priori knowledge of the disease system, we chose three features of the system to illustrate the proposed “feature-weighting” methodology: week of season, component model uncertainty (defined as the minimum number of predictive distribution bins required to cover 90% probability), and wILI measurement at the time of prediction. These features were chosen prior to and not changed after implementing test-phase predictions.

We used four distinct methodologies to define weights to use for the stacking models:

  1. Equal Weights (EW): . In this scenario, each model contributes the same weight for each target and for all values of .

  2. Constant model weights via degenerate EM (dEM): , a constant where but the constants are not necessarily the same for each model. These weights are estimated using the degenerate estimation-maximization (dEM) algorithm [25]. A separate set of weights is estimated for each region and prediction target.

  3. Feature-weighted (FW): depends on features including week of the season and model uncertainty for the KCDE and SARIMA models. A separate set of weighting functions is estimated for each region and prediction target.

  4. Feature-weighted with regularization: depends on features, but with regularization discouraging the weights from taking extreme values or from varying too quickly as a function of . A separate set of weighting functions is estimated for each region and prediction target. We fit three variations on this ensemble model, using different sets of features:

    1. (FW-reg-w) week of the season;

    2. (FW-reg-wu) week of the season and model uncertainty for the KCDE and SARIMA models;

    3. (FW-reg-wui) week of the season, model uncertainty for the KCDE and SARIMA models, and incidence (wILI) in the most recent week.

All in all, this leads to 6 ensemble models, summarized in Table 1. The first three of these models (EW, dEM, and FW) can be viewed as variations on FW-reg-wu if we vary the amount and type of regularization imposed on the FW-reg-wu model. Thus, comparisons among these four models will enable us to explore the benefits of allowing the model weights to depend on covariates while imposing an appropriate amount of rigidity on the model weight functions . We will discuss the regularization strategies used in FW-reg-wu further in the next subsection. Meanwhile, comparisons among the FW-reg-w, FW-reg-wu, and FW-reg-wui models will allow us to explore the relative contributions to predictive performance that can be achieved by allowing the model weights to depend on different features.

Component Model Weights Vary with…
Prediction Week of SARIMA KCDE Current
Model Region Target Season Uncertainty Uncertainty wILI
FW-reg-w X X X
FW-reg-wu X X X X X
FW-reg-wui X X X X X X
Table 1: Summary of ensemble methods and what the model weights depend on.

Each of the six ensemble models, along with the three component models, are used to generate predictions in every season-week of each of the five testing seasons, assuming perfect reporting. These predictions are then used to evaluate the prospective predictive performance of each of the ensemble methods. In total, we evaluate 9 models in 11 regions over 5 years and 3 targets of interest.

Feature-weighted stacking framework

In this section we introduce the particular specification of the parameter weight functions that we use for the FW, FW-reg-w, FW-reg-wu, and FW-reg-wui models and discuss estimation.

In order to ensure that the the are non-negative and sum to 1 for all values of , we parameterize them in terms of the softmax transformation of real-valued latent functions


For a pair of models , indicates that model has more weight than model for predictions at the given value of . The functions could be parameterized and estimated using many different techniques, such as a linear specification in the features, splines, or so on. We chose to estimate the functions using gradient tree boosting.

Gradient tree boosting uses a forward stagewise additive modeling algorithm to iteratively and incrementally construct a series of regression trees that, when added together, create a function designed to minimize a given loss function. In our application, the algorithm builds up the

that minimize the negative log-score of the stacked predictions across all times :


where is the cross-validated predictive density from the th model evaluated at the realized outcome .

Specifically, we define a single tree as


where the are a set of disjoint regions that comprise a partition of the space of feature values , and is the indicator function taking the value if and otherwise. The parameters for the tree are the split points partitioning into the regions and the regression constants associated with each region. The function is obtained as the sum of trees:


In each iteration of the boosting process, we estimate new regression trees, one for each component model. These trees are estimated so as to minimize a local approximation to the loss function around the weight functions that were obtained after the previous boosting iteration. Our approach builds on the xgb.train function in the xgboost package for R to perform this estimation [26]. The functionality in that package assumes that the loss function is convex, and optimizes a quadratic approximation to the loss in each boosting iteration. The loss function in Equation (3) is not guaranteed to be convex, so a direct application of this optimization method fails in our setting. We have modified the implementation in the xgboost package to use a gradient descent step in cases where the loss is locally concave.

Gradient tree boosting is appealing as a method for estimating the functions because it offers a great deal of flexibility in how the weights can vary as a function of the features . On the other hand, this flexibility can lead to overfitting the training data. In order to limit the chances of overfitting, we have explored the use of three regularization parameters:

  1. The number of boosting iterations . As increases, more extreme weights (close to 0 or 1) and more rapid changes in the weights as varies are possible.

  2. An penalty on the number of tree leaves, . A large penalty encourages the regression trees to have fewer leaves, so that there is less flexibility for the model weights to vary as a function of .

  3. An penalty on the regression constants . A large penalty encourages these constants to be small, so that the overall model weights change less in each boosting iteration.

We selected values for these regularization parameters using a grid search optimizing leave-one-season-out cross-validated model performance.

Software and code

We used R version 3.2.2 (2015-08-14) for all analyses [22]. All data and code used for this analysis is freely available in an R package online at and may be installed in R directly. Predictions generated in real-time with early development versions of this model during the 2016/2017 influenza season may be viewed at To maximize reproducibility of our work, we have set seeds prior to running code that relies on stochastic simulations using the rstream package [27]. Additionally, the manuscript itself was dynamically generated using RMarkdown.


To evaluate overall model performance, we computed log scores for all predictions made by each model across all regions and test phase seasons. We also examined results for predictions made before the peak week (for predictions of peak timing or peak incidence) or the season onset (for predicitons of onset timing) within each of the test phase seasons.

Feature-weighted ensemble model weights reflect trends in component model log scores

Fig 3 displays variation in leave-one-season-out log scores from the three component models over the course of the training phase seasons, along with the corresponding model weight estimates from the CW and FW-reg-w models. Performance of the SARIMA and KCDE models is similar, with mean log scores from those models starting out near or slightly below the mean performance of KDE, but with performance improving as more data become available. Near the beginning of some seasons, predictions from the SARIMA model are quite a bit worse than predictions from the other two component models. Supplemental Fig 1 illustrates that these patterns are consistent across the other regions. Supplemental Fig 2 shows that performance of the component models also varies with the model’s uncertainty as measured by the number of bins required to cover 90% in the predictive distribution, and Supplemental Fig 3 shows that performance varies with the observed wILI in the week when predictions are made.

The model weights assigned by the feature weighted ensemble models generally track these trends in relative model performance (Figs 3, 4). For all three targets, at the national level the weight assigned to the SARIMA model increases and the weight assigned to KDE decreases as the season progresses. However, the magnitude of shifts in model weights as the weighting features vary is different for the three prediction targets.

Best models have similar aggregate performance

Aggregating across all combinations of region, season, and week of the season in the test phase, most of the models had similar performance (Fig 5). The most important exception to this is the KDE model, which achieved consistently lower log scores than the other methods, including several cases where the Monte Carlo sampling procedure we used to approximate the predictive distribution assigned probability 0 to the true peak incidence bin (there was one such case for the KCDE model). The low performance of the KDE model pulled the log scores for the EW method slightly below the other methods, and resulted in some outlying cases where the unregularized FW-wu method performed poorly for predicting peak incidence. However, aggregated performance of the KCDE, SARIMA, CW, and variations on the FW models was quite similar in most cases. A linear mixed effects model was not able to distinguish any statistically significant differences in mean performance of these methods (Supplemental Figs 4 and 5).

Ensembles show stable performance across seasons for early-season predictions

Although the aggregate performance of these models is quite similar, some differences between the methods begin to emerge when we examine performance in more detail. Predictions made before the season peak (for predictions of peak incidence or peak timing) or before the season onset (for predictions of season onset timing) are the most relevant to decision makers using the predictions as inputs to set public policy. Additionally, it is important that these predictions be of consistent quality in all seasons, whether the seasonal dynamics follow historic seasonal trends or diverge from those common patterns. We evaluated the relative performance of predictions from each model that were generated before the onset or peak week in each test season (Fig 6).

In all test phase seasons and for all prediction targets, the worst performing model is always one of the three component models (KDE, KCDE, or SARIMA). Furthermore, each of the component models outperforms the other two for at least one combination of season and prediction target. Within each season, the component models are often among either the best-performing or the worst-performing models. For example, the SARIMA model ranks first for predictions of peak timing in four out of five seasons and last in the fifth season, but ranks last for predictions of peak incidence in four out of five seasons and first in the remaining season. Similar trends, though slightly less extreme, also hold for KDE and KCDE.

The ensemble methods have more stable performance across the test phase seasons. Only FW-reg-w was among the top half of models in all seasons for predictions of onset timing; only EW and FW-reg-w were among the top half of models in all seasons for predictions of peak timing; and only FW-reg-wu and FW-reg-wui were among the top half of models in all seasons for predictions of peak incidence (Fig 5)

Model consistency can also be measured with the minimum of the log scores achieved across the test phase seasons. Here again the ensemble methods outperform the component models. In particular, worst-case performance of the EW and regularized FW methods is in the top half of all methods for all three prediction targets (Fig 5).

For example, consider predictions of peak timing, which could be used to plan the deployment of medical resources [1]. Among the component models we considered, SARIMA had the highest mean log score across the test phase, and was the best model in four out of five test-phase seasons. However, it had the worst performance of all models we considered in the 2015/2016 season. The EW and FW-reg-w ensembles achieved only slightly lower average log scores than SARIMA overall, were among the best-performing methods in all five test seasons, and assigned an average of about 60 to 70 percent more probability to the eventually realized peak week than SARIMA in predictions made during the season that SARIMA struggled.

Regularization improves feature-weighted ensemble models

The regularization of feature-weighted ensembles improved early-season prediction accuracy for all three metrics. Specifically, a comparison of average log-scores for the FW-wu and FW-reg-wu models show consistent improvement in the model that used regularization to create smoother functions of model weights as a function of season week and model uncertainty. Only once out of 15 combinations of target and test-season (onset timing in the 2015/2016 season) did FW-wu outperform its regularized counterpart (Fig 5).

Fig 1: Plot of influenza data. The full data include observations aggregated to the national level and for 10 smaller regions. Here we plot only the data at the national level and in two of the smaller regions; data for the other regions are qualitatively similar. Missing data are indicated with vertical grey lines. The vertical red dashed lines indicate the cutoff time between the training and testing phases; 5 seasons of data were held out for testing.
Fig 2: Conceptual diagram of how the stacking models operate on probabilistic predictive distributions. The distributions illustrated here have density bins of 1 wILI unit, which differs from those used in the manuscript for illustrative purposes only. Panel A shows the predictive distributions from three component models. Panel B shows scaled versions of the distributions from A, after being multiplied by model weights. In Panel C, the scaled distributions are literally stacked to create the final ensemble predictive distribution.
Fig 3: Example of component model weights from the CW and FW-reg-w models for National predictions. The upper plot within each panel shows mean, minimum, and maximum log scores achieved by each component model for predictions of the given prediction target at the national level in each week of the season, summarizing across all seasons in the training phase when all three component models produced predictions. The lower plot within each panel shows model weights from the CW and FW-reg-w ensemble methods at each week in the season.
Fig 4: Weights assigned to each component model by the FW-reg-wu model for the prediction of season peak incidence at the national level. There are three weighting functions (one for each component model) represented in each row of the figure. The value of the weight is depicted by the color. Each function depends on three features: the week of the season at the time when the predictions are made, KCDE model uncertainty, and SARIMA model uncertainty. Model uncertainty represents the minimum number of predictive distribution bins required to cover 90% probability of the predictive distribution, so the higher this number is the more uncertain the model is.
Fig 5: Log scores across all regions, seasons, and season weeks represented in box plots. Log scores of negative infinity are represented with a cross at -15.
Fig 6: Model performance ranked by mean log score within each of the five test seasons for predictions made before the target (season onset or peak) occurred. Averages are taken across all regions.


In this work we have examined the potential for ensemble methods to improve infectious disease predictions. We explored a nested series of ensemble methods, focusing on methods that computed weighted averages of predictive distributions for seasonal targets of public health interest, such as the peak intensity of the outbreak and the timing of both season onset and peak. The methods we examined ranged from using equal model weights to more complex schemes with weights that varied as functions of multiple covariates. These ensemble methods achieved overall performance that was about as good as the individual component models, with increased stability in model performance across different seasons.

Increased stability in predictive accuracy can provide decision makers with more confidence when using predictions as inputs to set policy. For example, if a single model does well in most seasons but occasionally fails badly, planning decisions may be negatively impacted in those failing years. This may be particularly important in a public health setting where the events that are most important to get right are those relatively rare cases when incidence is much larger than usual or the season timing is earlier or later than usual. This reduction in variability of model performance achieved by ensemble methods is therefore important for ensuring that our predictions are reliable under a variety of conditions.

The different ensemble specifications we considered had similar average performance over the test phase, when all models were making prospective, out-of-sample predictions. However, there were differences among the ensemble models in terms of their consistency across different seasons. For all three prediction targets, worst-case log scores and the worst-case model ranking across the five test seasons were slightly lower for the constant-weight (CW) ensemble than they were for the ensemble with smoothed model weights varying by week (FW-reg-w). There were not appreciable or consistent differences among the feature-weighted models using different feature sets, indicating that including model uncertainty and recent observations of disease incidence did not add much more information about relative model performance than was available from the week of the season in which predictions were generated. The equally weighted ensemble model had lower mean log scores than the CW and FW-reg-w ensembles for all three prediction targets, but had better worst-case performance than the other ensembles for predictions of peak timing. Synthesizing these observations, no ensemble was uniformly better than the others, but the FW-reg-w method had good average and worst-case performance across all test phase seasons and prediction targets.

The feature-weighted ensemble models presented in this article use a novel scheme to estimate feature-dependent model weights that sum to 1 and are therefore suitable for use in combining predictive distributions. This general method could be applied to combine distribution estimates in any context, and is not limited to time-series or infectious disease applications. Furthermore, comparing an implementation of the feature-weighting that smoothed the model weights to one that did not showed consistent improvements in model performance. This result suggests that future work on feature-weighted ensemble implementations should consider regularized estimation.

Infectious disease predictions are only useful to public health officials if they are communicated effectively in real time. Predictions from an early version of the FW-reg-w model were updated weekly during the 2016/2017 influenza season and disseminated through an interactive website at

A central challenge of working with infectious disease data sets is the limited number of years of data available for model estimation and evaluation. We have used approximately one fourth of our data set for model evaluation, which left us with only 14 seasons of training data and 5 seasons of testing data. Additionally, we had fewer than 14 seasons of leave-one-season-out predictions to use in estimating the model weighting functions for the FW ensemble methods because the SARIMA model required unobserved seasonally lagged incidence to make predictions for the first few seasons in the training phase. This small sample size may have negatively impacted our ability to estimate the weighting functions. We also have a small effective sample size for detecting differences in average model performance in the test phase because of the high degree of correlation in model log scores for the same prediction target in different weeks and regions within the same season.

Another limitation of this work is the small selection of component models used. Theoretical results and applications have demonstrated that ensemble methods are most effective when using a diverse set of component models [3]. In our study, the KCDE and SARIMA component models are similar in that they both use seasonal terms and observations of recent incidence to (though we note that these two models tended to perform well in different seasons, as illustrated in Fig 6). Increased component model diversity could yield improved ensemble performance; this could be achieved either through inclusion of different model structures (such as agent-based or mechanistic models) or different covariates (such as spatial effects, weather, or circulating strains of a disease).

Our exploration of feature-weighted ensembles is also limited by the relatively restricted feature sets we used for the weighting functions. We selected a few features based on exploratory analysis of the training phase results, and set all ensemble model formulations before obtaining any predictions for the test phase. It is possible that other weighting features not considered in this work may be more informative than those we have used. Some ideas for weighting covariates to use in future work include the largest incidence so far this season; the onset threshold; alternative summaries of the predictive distributions from the component models such as the probability at the mode or the modal value; the predominant flu strain; or the distribution of incidence in age groups.

This work provides a rigorous and comprehensive evaluation of ensemble methods for averaging probabilistic predictions for features of infectious disease outbreaks. A range of models, both single component models and ensemble models that combined component model predictions, demonstrated the ability to make more accurate predictions than a seasonal average baseline model. Additionally, systematic comparisons of simple and complex prediction models highlight a crucial added value of ensemble modeling, namely increased stability and consistency of model performance across seasons. Continued investigation, application, and innovation is necessary to strengthen our understanding of how to best leverage combinations of models to assist decision makers in fields, such as public health and infectious disease surveillance, that require data-driven rapid response.


This work was supported by Award Number R35GM119582 from the National Institute Of General Medical Sciences and Defense Advanced Projects Research Agency Young Faculty Award Number Dl6AP00144. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute Of General Medical Sciences, the National Institutes of Health, or the Defense Advanced Projects Research Agency. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.


  •  1. Centers for Disease Control and Prevention. Staying Ahead of the Curve: Modeling and Public Health Decision-Making; 2016. Available from:
  •  2. Ray E, Sakrejda K, Lauer SA, Reich NG. The Reich Lab at UMass-Amherst; 2016.
  •  3. Polikar R. Ensemble based systems in decision making. IEEE Circuits and systems magazine. 2006;6(3):21–45.
  •  4. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer; 2011.
  •  5. Smyth P, Wolpert D. Linearly combining density estimators via stacking. Machine Learning. 1999;36(1-2):59–83.
  •  6. Rigollet P, Tsybakov AB. Linear and convex aggregation of density estimators. Mathematical Methods of Statistics. 2007;16(3):260–280.
  •  7. Ganti R, Gray A. Cake: Convex adaptive kernel density estimation.

    In: International Conference on Artificial Intelligence and Statistics; 2011. p. 498–506.

  •  8. Rosset S, Segal E. Boosting density estimation. In: NIPS; 2002. p. 641–648.
  •  9. Herbster M, Warmuth MK. Tracking the best expert. Machine Learning. 1998;32(2):151–178.
  •  10. Yamanishi K, Maruyama Y.

    Dynamic model selection with its applications to novelty detection.

    IEEE Transactions on Information Theory. 2007;53(6):2180–2189.
  •  11. Cortes C, Kuznetsov V, Mohri M. Ensemble Methods for Structured Prediction. In: Proceedings of The 31st International Conference on Machine Learning; 2014. p. 1134–1142.
  •  12. Sill J, Takacs G, Mackey L, Lin D. Feature-Weighted Linear Stacking. arXiv. 2009;.
  •  13. Jahrer M, Töscher A, Legenstein R. Combining predictions for accurate recommender systems. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2010. p. 693–702.
  •  14. Yamana TK, Kandula S, Shaman J. Superensemble forecasts of dengue outbreaks. Journal of The Royal Society Interface. 2016;13(123):20160410.
  •  15. Wolpert DH. Stacked generalization. Neural Networks. 1992;5(2):241–259.
  •  16. Friedman JH.

    Greedy function approximation: a gradient boosting machine.

    Annals of statistics. 2001; p. 1189–1232.
  •  17. Centers for Disease Control and Prevention. Overview of Influenza Surveillance in the United States; 2016. Available from:
  •  18. Centers for Disease Control and Prevention. Regional baseline values for influenza-like illness; 2016. Available from:{_}Baseline.csv.
  •  19. Centers for Disease Control and Prevention. Epidemic Prediction Initiative; 2016. Available from:
  •  20. Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association. 2007;102(477):359–378.
  •  21. Silverman BW. Density estimation for statistics and data analysis. vol. 26. CRC press; 1986.
  •  22. R Core Team. R: A Language and Environment for Statistical Computing; 2015. Available from:
  •  23. Box GE, Jenkins GM, Reinsel GC, Ljung GM. Time series analysis: forecasting and control. John Wiley & Sons; 2015.
  •  24. Hyndman RJ, Khandakar Y. Automatic time series forecasting: The forecast package for R. Journal Of Statistical Software. 2008;27(3):C3–C3. doi:10.18637/jss.v027.i03.
  •  25. Lin X, Zhu Y.

    Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction.

    In: Classification, Clustering, and Data Mining Applications. Berlin, Heidelberg: Springer Berlin Heidelberg; 2004. p. 259–268. Available from:{_}25.
  •  26. Chen T, He T, Benesty M. xgboost: Extreme Gradient Boosting; 2016. Available from:
  •  27. Leydold J. rstream: Streams of Random Numbers; 2015. Available from: