Abstract
Accurate and reliable predictions of infectious disease dynamics can be valuable to public health organizations that plan interventions to decrease or prevent disease transmission. A great variety of models have been developed for this task, using different model structures, covariates, and targets for prediction. Experience has shown that the performance of these models varies; some tend to do better or worse in different seasons or at different points within a season. Ensemble methods combine multiple models to obtain a single prediction that leverages the strengths of each model. We considered a range of ensemble methods that each form a predictive density for a target of interest as a weighted sum of the predictive densities from component models. In the simplest case, equal weight is assigned to each component model; in the most complex case, the weights vary with the region, prediction target, week of the season when the predictions are made, a measure of component model uncertainty, and recent observations of disease incidence. We applied these methods to predict measures of influenza season timing and severity in the United States, both at the national and regional levels, using three component models. We trained the models on retrospective predictions from 14 seasons (1997/1998  2010/2011) and evaluated each model’s prospective, outofsample performance in the five subsequent influenza seasons. In this test phase, the ensemble methods showed overall performance that was similar to the best of the component models, but offered more consistent performance across seasons than the component models. Ensemble methods offer the potential to deliver more reliable predictions to public health decision makers.
Introduction
The practice of combining predictions from different models has been used for decades by climatologists and geophysical scientists. These methods have subsequently been adapted and extended by statisticians and computer scientists in diverse areas of scientific inquiry. In recent years, these “ensemble” forecasting approaches frequently have been among the top methods used in prediction challenges across a wide range of applications.
Ensembles are a natural choice for noisy, complex, and interdependent systems that evolve over time. In these settings, no one model is likely to be able to capture and predict the full set of complex relationships that drive future observations from a particular system of interest. Instead “specialist” or “component” models can be relied on to capture distinct features or signals from a system and, when combined, represent a nearly complete range of possible outcomes. In this work, we develop and compare a collection of ensemble methods for combining predictive densities. This enables us to quantify the improvement in predictions achieved by using ensemble methods with varying levels of complexity.
To illustrate these ensemble methods, we present timeseries forecasts for infectious disease, specifically for influenza in the United States. The international significance of emerging epidemic threats in recent decades has highlighted the importance of understanding and being able to predict infectious disease dynamics. With the revolution in science driven by the promise of “big” and realtime data, there is an increased focus on and hope for using statistics to inform public health policy and decisionmaking in ways that could mitigate the impact of future outbreaks. Some of the largest public health agencies in the world, including the US Centers for Disease Control and Prevention (CDC) have openly endorsed using models to inform decision making, saying “with models, decisionmakers can look to the future with confidence in their ability to respond to outbreaks and public health emergencies” [1].
Development of the methods presented in this manuscript was motivated by the observation that certain prediction models for infectious disease consistently performed better than other models at certain times of year. We observed in previous research that early in the U.S. influenza season, simple models of historical incidence often outperformed more standard timeseries prediction models such as a seasonal autoregressive integrated moving average (SARIMA) model [2]. However, in the middle of the season, the timeseries models showed improved accuracy. We set out to determine whether ensemble methods could use this information about past model performance to improve predictions.
A large number of ensemble methods have been developed for a diverse array of tasks including regression, classification, and density estimation. These methods are broadly similar in that they combine results from multiple component models. However, details differ between ensemble methods. We suggest Polikar
[3] for a review of ensemble methods; many of these are also discussed in detail in Hastie et al.[4].While there are many different methods for combining models, all ensemble models discussed in this paper use an approach called stacking. In this approach, each of the component models is trained separately in a first stage, and crossvalidated measures of performance of those component models are obtained. Then, in a second stage, a stacking model is trained using the crossvalidated performance measures to learn how to optimally combine predictive densities from the component models. The specific implementations of stacking that we use obtain the final predictive density as a weighted sum of the component predictive densities, where the weights may depend on covariates. We refer to this approach generally as a ‘’weighted density ensemble” approach to prediction. Several variations on this strategy have been explored in the literature previously [5, 6, 7]. However, other ensemble methods for density estimation have also been developed. For example, Rosset and Segal [8] develop a boosting method in which the component models are estimated sequentially, with results from earlier models affecting estimation of later models.
In structured prediction settings such as time series forecasting, ensemble methods may benefit from taking advantage of the data structure. For example, it may be the case that different models offer a better representation of the data at different points in time. A common idea in these settings is to use model weights that change over time. For instance, model weights may vary as a function of how well each model did in recent predictions [9]
or by using a more formal graphical structure such as a hidden Markov model to track which component model is most likely to have generated new observations as they arise over time
[10, 11]. It is also possible to combine the component models with weights that depend on observed covariates or features [12]. For example, in an ensemble for a user recommendation system, Jahrer et al.[13] allowed model weights to depend on a variety of features including the time that a user submitted a rating.Using component models that generate predictive densities for outcomes of interest, we have implemented a series of ensembles using different methods for choosing the weights for each model. Specifically, we compare three different approaches. The first approach simply takes an equally weighted average of all models. The second approach estimates constant but not necessarily equal weights for each model. The third approach is a novel method for determining model weights based on features of the system at the time predictions are made. The overarching goal of this study is to create a systematic comparison between ensemble methods to study the benefits of increasing complexity in ensemble weighting schemes.
We are aware of one previous article that has developed ensemble methods for infectious disease prediction. Yamana et al.[14] developed a model stacking framework that is similar to the second approach outlined above using a constant weight for each component model. The present article is differentiated from that work in that we explore and compare a range of more flexible ensemble methods where the weights depend on observed features.
This paper presents a novel ensemble method that determines optimal model combinations based on (a) observed data at the time predictions are made and (b) aspects of the predictive distributions obtained from the component models. We refer to models built using this approach as “featureweighted” ensembles. This approach fuses aspects of different ensemble methods: it uses model stacking [15] and estimates model weights based on features of the system [12] using gradient tree boosting [16].
Using seasonal influenza outbreaks in the US health regions as a casestudy, we developed and applied our ensemble models to predict several attributes of the influenza season at each week during the season. By illustrating the utility of these approaches to ensemble forecasting in a setting with complex population dynamics, this work highlights the importance of continued innovation in ensemble methodology.
Methods
This paper presents a comparison of methods for determining weights for weighted density ensembles, applied to forecasting specific features of influenza seasons in the US. First, we present a description of the influenza data we use in our application and the prediction targets. Next, we discuss the three component models utilized by the ensemble framework. We then turn to the ensemble framework itself, describing the different ensemble model specifications used.
Data and prediction targets
We obtained publicly available data on seasonal influenza activity in the United States between 1997 and 2016 from the U.S. Centers for Disease Control and Prevention (CDC) (Fig 1). For each of the 10 Health and Human Services regions in the country in addition to the nation as a whole, the CDC calculates and publishes each week a measure called the weighted influenzalike illness (wILI) index. The wILI for a particular region is calculated as the average proportion of doctor visits with influenzalike illness for each state in the region, weighted by state population. During the CDCdefined influenza season (between Morbidity and Mortality Weekly Report week 40 of one year and 20 of the next year), the CDC publishes updated influenza data on a weekly basis. This includes “current” wILI data from two weeks prior to the reporting date, as well as updates to previously reported numbers as new data becomes available. For this analysis, we use only the final reported wILI measures to train and predict from our models.
The CDC defines the influenza season onset as the first of three successive weeks of the season for which wILI is greater than or equal to a threshold that is specific to the region and season. This threshold is the mean percent of patient visits where the patient had ILI during low incidence weeks for that region in the past three seasons, plus two standard deviations
[17]. The CDC provides historical threshold values for each region going back to the 2007/2008 season [18]. Additionally, we define two other metrics specific to a regionseason. The peak incidence is the maximum observed wILI measured in a season. The peak week is the week at which the maximum wILI for the season is observed.Each predictive distribution was represented by probabilities assigned to bins associated with different possible outcomes. For onset week, the bins are represented by integer values for each possible season week plus a bin for “no onset”. For peak week, the bins are represented by integer values for each possible season week. For peak incidence, the bins capture incidence rounded to a single decimal place, with a single bin to capture all incidence over
. Formally, the incidence bins are as follows: [0, 0.05), [0.05, 0.15), …, [12.95, 13.05), [13.05, ). These bins were used in the 20162017 influenza prediction contest run by the CDC [19].We measure the accuracy of predictive distributions using the log score. The log score is a proper scoring rule [20], calculated in our setting as the natural log of the probability assigned to the bin containing the true observation. Proper scoring rules are preferred for measuring the quality of predictive distributions because the expected score is optimized by the true probabilty distribution. We note that for peak week, in some regionseasons the same peak incidence was achieved in multiple weeks (after rounding to one decimal place). In those cases, we calculated the log score as the log of the sum of the probabilities assigned to those weeks; this is consistent with scoring procedures used in the 20162017 flu prediction contest run by the CDC [19].
Component models
We used three component models to generate probabilistic predictions of the three prediction targets. The first model was a seasonal average model that utilized kernel density estimation (KDE) to estimate a predictive distribution for each target. The second model utilized kernel conditional density estimation (KCDE) and copulas to create a joint predictive distribution for incidence in all remaining weeks of the season, conditional on recent observations of incidence
[2]. By calculating appropriate integrals of this joint distribution, we constructed predictive distributions for each of the seasonal targets. The third model used a standard seasonal autoregressive integrated moving average (SARIMA) implementation. All models were fit independently on data within each region.
Kernel Density Estimation (KDE)
The simplest of the component models uses kernel density estimation [21] to estimate a distribution for each target based on observed values of that target in previous seasons within the region of interest. We used Gaussian kernels and the default KDE settings from the density function in the stats package for R [22] to estimate the bandwidth parameter. For the peak incidence target, we fit to logtransformed observations of historical peak incidence. For the onset week prediction target, we estimated the probability of no onset as the proportion of regionseasons in all regions in the training phase where no week in the season met the criteria for being a season onset.
To create an empirical predictive distribution of size
from a KDE fit based on a data vector
(for example, this might be the vector of peak week values from the training seasons), we first drew samples with replacement from , yielding a new vector . We then drew a single psuedorandom deviate from each oftruncated Gaussian distributions centered at
with the bandwidth estimated by the KDE algorithm. The Gaussians we sampled from were truncated at the lower and upper bounds of possible values for the given prediction target. Finally, we discretized the sampled values to the targetspecific bins. These sampled points then make up the empirical predictive distribution from a KDE model. We set the sample size to . In theory, this model assigns nonzero probability to every possible outcome; however, in a few cases the empirical predictive distribution resulting from this Monte Carlo sampling approach assigned probability zero to some of the bins.It is important to note that the predictions from this model do not change as new data are observed over the course of the season.
Kernel Conditional Density Estimation (KCDE)
We used kernel conditional density estimation and copulas to estimate a joint predictive distribution for flu incidence in each future week of the season, and then calculated predictive distributions for each target from that joint distribution [2]. In our implementation, we first used KCDE to obtain separate predictive densities for flu incidence in each future week of the season. Each of these predictive densities gives a conditional distribution for incidence at one future time point given recent observations of incidence and the current week of the season. KCDE can be viewed as a distributionbased analogue of nearestneighbors regression. We then used a copula to model dependence among those individual predicitive densities, thereby obtaining a joint predicitive density, or a distribution of incidence trajectories in all future weeks.
To predict seasonal quantities (onset, peak timing, and peak incidence), we simulate trajectories of disease incidence from this joint predictive distribution. For each simulated incidence trajectory, we compute the onset week, peak week, and peak incidence. We then aggregate these values to create predictive distributions for each target. This procedure for obtaining predictive distributions for the targets of interest can be formally justified as an appropriate Monte Carlo integral of the joint predictive distribution for disease incidence in future weeks (see [2] for details).
Seasonal autoregressive integrated moving average (SARIMA)
We fit seasonal ARIMA models [23] to wILI observations transformed to be on the natural log scale. We manually performed firstorder seasonal differencing and used the stepwise procedure from the auto.arima function in the forecast package [24] for R to select the specification of the autoregressive and moving average terms.
Similar to KCDE, forecasts were obtained by sampling trajectories of wILI values over the rest of the season (using the simulate.Arima function from the forecast package), and predictive distributions of the targets were computed from these sampled trajectories as described above.
Component model training
We used data from 14 seasons (1997/1998 through 2010/2011) to train the models. Data from five seasons (2011/2012 through 2015/2016) were held out when fitting the models and used exclusively in the testing phase. To avoid overfitting our models, we made predictions for the test phase only once [4].
Estimation of the ensemble models (discussed in the next subsection) requires crossvalidated measures of performance of each of the component models in order to accurately gauge their relative performance. For each region, we estimated the parameters of each component model 15 times: 14 fits were obtained excluding one training season at a time, and another fit used all of the training data. For each fit obtained leaving one season out, we generated a set of three predictive distributions (one for each of the prediction targets) at each week in the heldout season. We were not able to generate predictions from the SARIMA and KCDE models for some seasons in the training phase because those models used lagged observations from previous seasons that were missing in our data set. The component model fits based on all of the training data were used to generate predictions for the test phase.
Ensemble models
All of the ensemble models we consider in this article work by averaging predictions from the component models to obtain the ensemble prediction. Additionally, these methods are stacked model ensembles because they use leaveoneseasonout predictions from the independently estimated component models as inputs to estimate the model weights [15]. We begin our discussion of ensemble methods with a general overview, introducing a common set of notation and giving a broad outline of the ensemble models we will use in this article. We then describe our proposed weighted density ensemble model specifications in more detail.
Overview of ensemble models
A single set of notation can be used to describe all of the ensemble frameworks implemented here. Let denote the predictive density from component model
for the value of the scalar random variable
conditional on observed variables . Observations of disease incidence are reported weekly in our data set, so indexes the week of the season. The variable could for example represent the peak incidence for a given season and region; in our application to predicting seasonal quantities, the same outcome will be realized for all weeks within a given season. In the context of time series predictions, the covariate vector may include timevarying covariates such as the week at which the prediction is made or lagged incidence. The superscript reflects the fact that each component model may use a different set of covariates.The combined predictive density for a particular target can be written as
(1) 
In Equation (1) the are the model weights, which are allowed to vary as a function of observed features in . We define to be a vector of all observed quantities that are used by any of the component models or in calculating the model weights. In order to guarantee that
is a probability distribution we require that
for all . Fig 2 illustrates the concept of stacking the predictive densities for each component model.In the following subsection, we propose a framework for estimating featuredependent weights for a stacked ensemble model. By featuredependent we mean that the weights associated with different component models are driven by observed features or covariates. Although we illustrate the method in the context of timeseries predictions, the method could be used in any setting where we wish to combine distribution estimates from multiple models. Features could include observed data from the system being predicted (such as recent wILI measurements or the time of year at which predictions are being made), observed data from outside the system (for example, recent weather observations), or features of the predictions themselves (e.g. summaries of the predictive distributions from the component models, such as a measure of spread in the distribution, or the time until a predicted peak). Based on exploration of training phase data and a priori knowledge of the disease system, we chose three features of the system to illustrate the proposed “featureweighting” methodology: week of season, component model uncertainty (defined as the minimum number of predictive distribution bins required to cover 90% probability), and wILI measurement at the time of prediction. These features were chosen prior to and not changed after implementing testphase predictions.
We used four distinct methodologies to define weights to use for the stacking models:

Equal Weights (EW): . In this scenario, each model contributes the same weight for each target and for all values of .

Constant model weights via degenerate EM (dEM): , a constant where but the constants are not necessarily the same for each model. These weights are estimated using the degenerate estimationmaximization (dEM) algorithm [25]. A separate set of weights is estimated for each region and prediction target.

Featureweighted (FW): depends on features including week of the season and model uncertainty for the KCDE and SARIMA models. A separate set of weighting functions is estimated for each region and prediction target.

Featureweighted with regularization: depends on features, but with regularization discouraging the weights from taking extreme values or from varying too quickly as a function of . A separate set of weighting functions is estimated for each region and prediction target. We fit three variations on this ensemble model, using different sets of features:

(FWregw) week of the season;

(FWregwu) week of the season and model uncertainty for the KCDE and SARIMA models;

(FWregwui) week of the season, model uncertainty for the KCDE and SARIMA models, and incidence (wILI) in the most recent week.

All in all, this leads to 6 ensemble models, summarized in Table 1. The first three of these models (EW, dEM, and FW) can be viewed as variations on FWregwu if we vary the amount and type of regularization imposed on the FWregwu model. Thus, comparisons among these four models will enable us to explore the benefits of allowing the model weights to depend on covariates while imposing an appropriate amount of rigidity on the model weight functions . We will discuss the regularization strategies used in FWregwu further in the next subsection. Meanwhile, comparisons among the FWregw, FWregwu, and FWregwui models will allow us to explore the relative contributions to predictive performance that can be achieved by allowing the model weights to depend on different features.
Component Model Weights Vary with…  
Prediction  Week of  SARIMA  KCDE  Current  
Model  Region  Target  Season  Uncertainty  Uncertainty  wILI 
EW  
CW  X  X  
FW  X  X  X  X  X  
FWregw  X  X  X  
FWregwu  X  X  X  X  X  
FWregwui  X  X  X  X  X  X 
Each of the six ensemble models, along with the three component models, are used to generate predictions in every seasonweek of each of the five testing seasons, assuming perfect reporting. These predictions are then used to evaluate the prospective predictive performance of each of the ensemble methods. In total, we evaluate 9 models in 11 regions over 5 years and 3 targets of interest.
Featureweighted stacking framework
In this section we introduce the particular specification of the parameter weight functions that we use for the FW, FWregw, FWregwu, and FWregwui models and discuss estimation.
In order to ensure that the the are nonnegative and sum to 1 for all values of , we parameterize them in terms of the softmax transformation of realvalued latent functions
(2) 
For a pair of models , indicates that model has more weight than model for predictions at the given value of . The functions could be parameterized and estimated using many different techniques, such as a linear specification in the features, splines, or so on. We chose to estimate the functions using gradient tree boosting.
Gradient tree boosting uses a forward stagewise additive modeling algorithm to iteratively and incrementally construct a series of regression trees that, when added together, create a function designed to minimize a given loss function. In our application, the algorithm builds up the
that minimize the negative logscore of the stacked predictions across all times :(3) 
where is the crossvalidated predictive density from the th model evaluated at the realized outcome .
Specifically, we define a single tree as
(4) 
where the are a set of disjoint regions that comprise a partition of the space of feature values , and is the indicator function taking the value if and otherwise. The parameters for the tree are the split points partitioning into the regions and the regression constants associated with each region. The function is obtained as the sum of trees:
(5) 
In each iteration of the boosting process, we estimate new regression trees, one for each component model. These trees are estimated so as to minimize a local approximation to the loss function around the weight functions that were obtained after the previous boosting iteration. Our approach builds on the xgb.train function in the xgboost package for R to perform this estimation [26]. The functionality in that package assumes that the loss function is convex, and optimizes a quadratic approximation to the loss in each boosting iteration. The loss function in Equation (3) is not guaranteed to be convex, so a direct application of this optimization method fails in our setting. We have modified the implementation in the xgboost package to use a gradient descent step in cases where the loss is locally concave.
Gradient tree boosting is appealing as a method for estimating the functions because it offers a great deal of flexibility in how the weights can vary as a function of the features . On the other hand, this flexibility can lead to overfitting the training data. In order to limit the chances of overfitting, we have explored the use of three regularization parameters:

The number of boosting iterations . As increases, more extreme weights (close to 0 or 1) and more rapid changes in the weights as varies are possible.

An penalty on the number of tree leaves, . A large penalty encourages the regression trees to have fewer leaves, so that there is less flexibility for the model weights to vary as a function of .

An penalty on the regression constants . A large penalty encourages these constants to be small, so that the overall model weights change less in each boosting iteration.
We selected values for these regularization parameters using a grid search optimizing leaveoneseasonout crossvalidated model performance.
Software and code
We used R version 3.2.2 (20150814) for all analyses [22]. All data and code used for this analysis is freely available in an R package online at https://github.com/reichlab/adaptivelyweightedensemble and may be installed in R directly. Predictions generated in realtime with early development versions of this model during the 2016/2017 influenza season may be viewed at https://reichlab.io/flusight/. To maximize reproducibility of our work, we have set seeds prior to running code that relies on stochastic simulations using the rstream package [27]. Additionally, the manuscript itself was dynamically generated using RMarkdown.
Results
To evaluate overall model performance, we computed log scores for all predictions made by each model across all regions and test phase seasons. We also examined results for predictions made before the peak week (for predictions of peak timing or peak incidence) or the season onset (for predicitons of onset timing) within each of the test phase seasons.
Featureweighted ensemble model weights reflect trends in component model log scores
Fig 3 displays variation in leaveoneseasonout log scores from the three component models over the course of the training phase seasons, along with the corresponding model weight estimates from the CW and FWregw models. Performance of the SARIMA and KCDE models is similar, with mean log scores from those models starting out near or slightly below the mean performance of KDE, but with performance improving as more data become available. Near the beginning of some seasons, predictions from the SARIMA model are quite a bit worse than predictions from the other two component models. Supplemental Fig 1 illustrates that these patterns are consistent across the other regions. Supplemental Fig 2 shows that performance of the component models also varies with the model’s uncertainty as measured by the number of bins required to cover 90% in the predictive distribution, and Supplemental Fig 3 shows that performance varies with the observed wILI in the week when predictions are made.
The model weights assigned by the feature weighted ensemble models generally track these trends in relative model performance (Figs 3, 4). For all three targets, at the national level the weight assigned to the SARIMA model increases and the weight assigned to KDE decreases as the season progresses. However, the magnitude of shifts in model weights as the weighting features vary is different for the three prediction targets.
Best models have similar aggregate performance
Aggregating across all combinations of region, season, and week of the season in the test phase, most of the models had similar performance (Fig 5). The most important exception to this is the KDE model, which achieved consistently lower log scores than the other methods, including several cases where the Monte Carlo sampling procedure we used to approximate the predictive distribution assigned probability 0 to the true peak incidence bin (there was one such case for the KCDE model). The low performance of the KDE model pulled the log scores for the EW method slightly below the other methods, and resulted in some outlying cases where the unregularized FWwu method performed poorly for predicting peak incidence. However, aggregated performance of the KCDE, SARIMA, CW, and variations on the FW models was quite similar in most cases. A linear mixed effects model was not able to distinguish any statistically significant differences in mean performance of these methods (Supplemental Figs 4 and 5).
Ensembles show stable performance across seasons for earlyseason predictions
Although the aggregate performance of these models is quite similar, some differences between the methods begin to emerge when we examine performance in more detail. Predictions made before the season peak (for predictions of peak incidence or peak timing) or before the season onset (for predictions of season onset timing) are the most relevant to decision makers using the predictions as inputs to set public policy. Additionally, it is important that these predictions be of consistent quality in all seasons, whether the seasonal dynamics follow historic seasonal trends or diverge from those common patterns. We evaluated the relative performance of predictions from each model that were generated before the onset or peak week in each test season (Fig 6).
In all test phase seasons and for all prediction targets, the worst performing model is always one of the three component models (KDE, KCDE, or SARIMA). Furthermore, each of the component models outperforms the other two for at least one combination of season and prediction target. Within each season, the component models are often among either the bestperforming or the worstperforming models. For example, the SARIMA model ranks first for predictions of peak timing in four out of five seasons and last in the fifth season, but ranks last for predictions of peak incidence in four out of five seasons and first in the remaining season. Similar trends, though slightly less extreme, also hold for KDE and KCDE.
The ensemble methods have more stable performance across the test phase seasons. Only FWregw was among the top half of models in all seasons for predictions of onset timing; only EW and FWregw were among the top half of models in all seasons for predictions of peak timing; and only FWregwu and FWregwui were among the top half of models in all seasons for predictions of peak incidence (Fig 5)
Model consistency can also be measured with the minimum of the log scores achieved across the test phase seasons. Here again the ensemble methods outperform the component models. In particular, worstcase performance of the EW and regularized FW methods is in the top half of all methods for all three prediction targets (Fig 5).
For example, consider predictions of peak timing, which could be used to plan the deployment of medical resources [1]. Among the component models we considered, SARIMA had the highest mean log score across the test phase, and was the best model in four out of five testphase seasons. However, it had the worst performance of all models we considered in the 2015/2016 season. The EW and FWregw ensembles achieved only slightly lower average log scores than SARIMA overall, were among the bestperforming methods in all five test seasons, and assigned an average of about 60 to 70 percent more probability to the eventually realized peak week than SARIMA in predictions made during the season that SARIMA struggled.
Regularization improves featureweighted ensemble models
The regularization of featureweighted ensembles improved earlyseason prediction accuracy for all three metrics. Specifically, a comparison of average logscores for the FWwu and FWregwu models show consistent improvement in the model that used regularization to create smoother functions of model weights as a function of season week and model uncertainty. Only once out of 15 combinations of target and testseason (onset timing in the 2015/2016 season) did FWwu outperform its regularized counterpart (Fig 5).
Discussion
In this work we have examined the potential for ensemble methods to improve infectious disease predictions. We explored a nested series of ensemble methods, focusing on methods that computed weighted averages of predictive distributions for seasonal targets of public health interest, such as the peak intensity of the outbreak and the timing of both season onset and peak. The methods we examined ranged from using equal model weights to more complex schemes with weights that varied as functions of multiple covariates. These ensemble methods achieved overall performance that was about as good as the individual component models, with increased stability in model performance across different seasons.
Increased stability in predictive accuracy can provide decision makers with more confidence when using predictions as inputs to set policy. For example, if a single model does well in most seasons but occasionally fails badly, planning decisions may be negatively impacted in those failing years. This may be particularly important in a public health setting where the events that are most important to get right are those relatively rare cases when incidence is much larger than usual or the season timing is earlier or later than usual. This reduction in variability of model performance achieved by ensemble methods is therefore important for ensuring that our predictions are reliable under a variety of conditions.
The different ensemble specifications we considered had similar average performance over the test phase, when all models were making prospective, outofsample predictions. However, there were differences among the ensemble models in terms of their consistency across different seasons. For all three prediction targets, worstcase log scores and the worstcase model ranking across the five test seasons were slightly lower for the constantweight (CW) ensemble than they were for the ensemble with smoothed model weights varying by week (FWregw). There were not appreciable or consistent differences among the featureweighted models using different feature sets, indicating that including model uncertainty and recent observations of disease incidence did not add much more information about relative model performance than was available from the week of the season in which predictions were generated. The equally weighted ensemble model had lower mean log scores than the CW and FWregw ensembles for all three prediction targets, but had better worstcase performance than the other ensembles for predictions of peak timing. Synthesizing these observations, no ensemble was uniformly better than the others, but the FWregw method had good average and worstcase performance across all test phase seasons and prediction targets.
The featureweighted ensemble models presented in this article use a novel scheme to estimate featuredependent model weights that sum to 1 and are therefore suitable for use in combining predictive distributions. This general method could be applied to combine distribution estimates in any context, and is not limited to timeseries or infectious disease applications. Furthermore, comparing an implementation of the featureweighting that smoothed the model weights to one that did not showed consistent improvements in model performance. This result suggests that future work on featureweighted ensemble implementations should consider regularized estimation.
Infectious disease predictions are only useful to public health officials if they are communicated effectively in real time. Predictions from an early version of the FWregw model were updated weekly during the 2016/2017 influenza season and disseminated through an interactive website at https://reichlab.io/flusight/.
A central challenge of working with infectious disease data sets is the limited number of years of data available for model estimation and evaluation. We have used approximately one fourth of our data set for model evaluation, which left us with only 14 seasons of training data and 5 seasons of testing data. Additionally, we had fewer than 14 seasons of leaveoneseasonout predictions to use in estimating the model weighting functions for the FW ensemble methods because the SARIMA model required unobserved seasonally lagged incidence to make predictions for the first few seasons in the training phase. This small sample size may have negatively impacted our ability to estimate the weighting functions. We also have a small effective sample size for detecting differences in average model performance in the test phase because of the high degree of correlation in model log scores for the same prediction target in different weeks and regions within the same season.
Another limitation of this work is the small selection of component models used. Theoretical results and applications have demonstrated that ensemble methods are most effective when using a diverse set of component models [3]. In our study, the KCDE and SARIMA component models are similar in that they both use seasonal terms and observations of recent incidence to (though we note that these two models tended to perform well in different seasons, as illustrated in Fig 6). Increased component model diversity could yield improved ensemble performance; this could be achieved either through inclusion of different model structures (such as agentbased or mechanistic models) or different covariates (such as spatial effects, weather, or circulating strains of a disease).
Our exploration of featureweighted ensembles is also limited by the relatively restricted feature sets we used for the weighting functions. We selected a few features based on exploratory analysis of the training phase results, and set all ensemble model formulations before obtaining any predictions for the test phase. It is possible that other weighting features not considered in this work may be more informative than those we have used. Some ideas for weighting covariates to use in future work include the largest incidence so far this season; the onset threshold; alternative summaries of the predictive distributions from the component models such as the probability at the mode or the modal value; the predominant flu strain; or the distribution of incidence in age groups.
This work provides a rigorous and comprehensive evaluation of ensemble methods for averaging probabilistic predictions for features of infectious disease outbreaks. A range of models, both single component models and ensemble models that combined component model predictions, demonstrated the ability to make more accurate predictions than a seasonal average baseline model. Additionally, systematic comparisons of simple and complex prediction models highlight a crucial added value of ensemble modeling, namely increased stability and consistency of model performance across seasons. Continued investigation, application, and innovation is necessary to strengthen our understanding of how to best leverage combinations of models to assist decision makers in fields, such as public health and infectious disease surveillance, that require datadriven rapid response.
Acknowledgments
This work was supported by Award Number R35GM119582 from the National Institute Of General Medical Sciences and Defense Advanced Projects Research Agency Young Faculty Award Number Dl6AP00144. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute Of General Medical Sciences, the National Institutes of Health, or the Defense Advanced Projects Research Agency. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
 1. Centers for Disease Control and Prevention. Staying Ahead of the Curve: Modeling and Public Health DecisionMaking; 2016. Available from: http://www.cdc.gov/cdcgrandrounds/archives/2016/january2016.htm.
 2. Ray E, Sakrejda K, Lauer SA, Reich NG. The Reich Lab at UMassAmherst; 2016.
 3. Polikar R. Ensemble based systems in decision making. IEEE Circuits and systems magazine. 2006;6(3):21–45.
 4. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer; 2011.
 5. Smyth P, Wolpert D. Linearly combining density estimators via stacking. Machine Learning. 1999;36(12):59–83.
 6. Rigollet P, Tsybakov AB. Linear and convex aggregation of density estimators. Mathematical Methods of Statistics. 2007;16(3):260–280.

7.
Ganti R, Gray A.
Cake: Convex adaptive kernel density estimation.
In: International Conference on Artificial Intelligence and Statistics; 2011. p. 498–506.
 8. Rosset S, Segal E. Boosting density estimation. In: NIPS; 2002. p. 641–648.
 9. Herbster M, Warmuth MK. Tracking the best expert. Machine Learning. 1998;32(2):151–178.

10.
Yamanishi K, Maruyama Y.
Dynamic model selection with its applications to novelty detection.
IEEE Transactions on Information Theory. 2007;53(6):2180–2189.  11. Cortes C, Kuznetsov V, Mohri M. Ensemble Methods for Structured Prediction. In: Proceedings of The 31st International Conference on Machine Learning; 2014. p. 1134–1142.
 12. Sill J, Takacs G, Mackey L, Lin D. FeatureWeighted Linear Stacking. arXiv. 2009;.
 13. Jahrer M, Töscher A, Legenstein R. Combining predictions for accurate recommender systems. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2010. p. 693–702.
 14. Yamana TK, Kandula S, Shaman J. Superensemble forecasts of dengue outbreaks. Journal of The Royal Society Interface. 2016;13(123):20160410.
 15. Wolpert DH. Stacked generalization. Neural Networks. 1992;5(2):241–259.

16.
Friedman JH.
Greedy function approximation: a gradient boosting machine.
Annals of statistics. 2001; p. 1189–1232.  17. Centers for Disease Control and Prevention. Overview of Influenza Surveillance in the United States; 2016. Available from: https://www.cdc.gov/flu/weekly/overview.htm.
 18. Centers for Disease Control and Prevention. Regional baseline values for influenzalike illness; 2016. Available from: https://github.com/cdcepi/FluSightforecasts/blob/master/wILI{_}Baseline.csv.
 19. Centers for Disease Control and Prevention. Epidemic Prediction Initiative; 2016. Available from: https://predict.phiresearchlab.org/post/57f3f440123b0f563ece2576.
 20. Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association. 2007;102(477):359–378.
 21. Silverman BW. Density estimation for statistics and data analysis. vol. 26. CRC press; 1986.
 22. R Core Team. R: A Language and Environment for Statistical Computing; 2015. Available from: https://www.Rproject.org/.
 23. Box GE, Jenkins GM, Reinsel GC, Ljung GM. Time series analysis: forecasting and control. John Wiley & Sons; 2015.
 24. Hyndman RJ, Khandakar Y. Automatic time series forecasting: The forecast package for R. Journal Of Statistical Software. 2008;27(3):C3–C3. doi:10.18637/jss.v027.i03.

25.
Lin X, Zhu Y.
Degenerate ExpectationMaximization Algorithm for Local Dimension Reduction.
In: Classification, Clustering, and Data Mining Applications. Berlin, Heidelberg: Springer Berlin Heidelberg; 2004. p. 259–268. Available from: http://link.springer.com/10.1007/9783642171031{_}25.  26. Chen T, He T, Benesty M. xgboost: Extreme Gradient Boosting; 2016. Available from: https://CRAN.Rproject.org/package=xgboost.
 27. Leydold J. rstream: Streams of Random Numbers; 2015. Available from: https://CRAN.Rproject.org/package=rstream.
Comments
There are no comments yet.