1 Introduction
Time series forecasting is a growing field of interest playing an important role in nearly all fields of science and engineering, such as economics, finance, meteorology and telecommunication (Palit and Popovic, 2005). Unlike onestep ahead forecasting, multistep ahead forecasting tasks are more difficult (Tiao and Tsay, 1994), since they have to deal with various additional complications, like accumulation of errors, reduced accuracy, and increased uncertainty (Weigend and Gershenfeld, 1994; Sorjamaa et al., 2007).
The forecasting domain has been influenced, for a long time, by linear statistical methods such as ARIMA models. However, in the late 1970s and early 1980s, it became increasingly clear that linear models are not adapted to many real applications (Gooijer and Hyndman, 2006). In the same period, several useful nonlinear time series models were proposed such as the bilinear model (Poskitt and Tremayne, 1986)
, the threshold autoregressive model
(Tong and Lim, 1980; Tong, 1983, 1990)and the autoregressive conditional heteroscedastic (ARCH) model
(Engle, 1982) (see (Gooijer and Hyndman, 2006) and (Gooijer and Kumar, 1992) for a review). Nowadays, Monte Carlo simulation or bootstrapping methods are used to compute nonlinear forecasts. Since no assumptions are made about the distribution of the error process, the latter approach is preferred (Clements et al., 2004; Gooijer and Hyndman, 2006). However, the study of nonlinear time series analysis and forecasting is still in its infancy compared to the development of linear time series (Gooijer and Hyndman, 2006).In the last two decades, machine learning models have drawn attention and have established themselves as serious contenders to classical statistical models in the forecasting community (Ahmed et al., 2010; Palit and Popovic, 2005; Zhang et al., 1998). These models, also called blackbox or datadriven models (Mitchell, 1997)
, are examples of nonparametric nonlinear models which use only historical data to learn the stochastic dependency between the past and the future. For instance, Werbos found that Artificial Neural Networks (ANNs) outperforms the classical statistical methods such as linear regression and BoxJenkins approaches
(Werbos, 1974, 1988). A similar study has been conducted by Lapedes and Farber (Lapedes and Farber, 1987)who conclude that ANNs can be successfully used for modeling and forecasting nonlinear time series. Later, others models appeared such as decision trees, support vector machines and nearest neighbor regression
(Hastie et al., 2009; Alpaydin, 2010). Moreover, the empirical accuracy of several machine learning models has been explored in a number of forecasting competitions under different data conditions (e.g. the NN3, NN5, and the annual ESTSP competitions (Crone, a, b; Lendasse, 2007, 2008)) creating interesting scientific debates in the area of data mining and forecasting (Hand, 2008; Price, 2009; Crone, 2009).In the forecasting community, researchers have paid attention to several aspects of the forecasting procedure such as model selection (Aha, 1997; Curry and Morgan, 2006; Anders and Korn, 1999; Chapelle and Vapnik, 2000), effect of deseasonalization (Hylleberg, 1992; Makridakis et al., 1998; Nelson et al., 1999; Zhang and Qi, 2005), forecasts combination (Bates, J. M. and Granger, C. W. J., 1969; Clemen, 1989; Timmermann, 2006) and many other critical topics (Gooijer and Hyndman, 2006). However, approaches for generating multistep ahead forecasts for machine learning models did not receive as much attention, as pointed out by Kline: “One issue that has had limited investigation is how to generate multiplestepahead forecasts” (Kline, 2004).
To the best of our knowledge, five alternatives (or strategies) have been proposed in the literature to tackle an step ahead forecasting task. The Recursive strategy (Weigend and Gershenfeld, 1994; Sorjamaa et al., 2007; Cheng et al., 2006; Tiao and Tsay, 1994; Kline, 2004; Hamzaçebi et al., 2009) iterates, times, a onestep ahead forecasting model to obtain the
forecasts. After estimating the future series value, it is fed back as an input for the following forecast.
In contrast to the previous strategy which use a single model, the Direct strategy (Weigend and Gershenfeld, 1994; Sorjamaa et al., 2007; Cheng et al., 2006; Tiao and Tsay, 1994; Kline, 2004; Hamzaçebi et al., 2009) estimates a set of forecasting models, each returning a forecast for the th value ().
A combination of the two previous strategies, called DirRec strategy has been proposed in (Sorjamaa and Lendasse, 2006). The idea behind this strategy is to combine aspects from both, the Direct and the Recursive strategies. In other words, a different model is used at each step but the approximations from previous steps are introduced into the input set.
In order to preserve, between the predicted values, the stochastic dependency characterizing the time series, the MultiInput MultiOutput (MIMO) strategy has been introduced and analyzed in (Bontempi, 2008; Bontempi and Ben Taieb, 2011; Kline, 2004). Unlike the previous strategies where the models return a scalar value, the MIMO strategy returns a vector of future values in a single step.
The last strategy, called DIRMO (Ben Taieb et al., 2009), aims to preserve the most appealing aspects of both the DIRect and miMO strategies. This strategy aims to find a tradeoff between the property of preserving the stochastic dependency between the forecasted values and the flexibility of the modeling procedure.
In the literature, these five forecasting strategies have been presented separately, sometimes, using different terminologies. The first contribution of this paper is to present a thorough unified review as well as a theoretical comparative analysis of the existing strategies for multistep ahead forecasting.
Despite the fact that many studies have compared between the different multistep ahead approaches, the collective outcome of these studies regarding forecasting performance has been inconclusive. So the modeler is still left with little guidance as to which strategy to use. For example, research from (Bontempi et al., 1999; Weigend et al., 1992) provide experimental evidence in favor of Recursive strategy against Direct strategy. However, results from (Zhang and Hutchinson, 1994; Sorjamaa et al., 2007; Hamzaçebi et al., 2009) support the fact that the Direct strategy is better than the Recursive strategy. The work by (Sorjamaa and Lendasse, 2006) shows that the DirRec strategy gives better performance than Direct and Recursive strategies. The Direct and Recursive strategies have been theoretically and empirically compared in (Atiya et al., 1999). In this study the authors obtained theoretical and experimental evidence in favor of Direct strategy. Concerning the MIMO strategy, Kline (Kline, 2004) and Cheng et al (Cheng et al., 2006) support the idea that MIMO strategy provides worse forecasting performance than Recursive and Direct strategies. However, in (Bontempi, 2008; Bontempi and Ben Taieb, 2011), the comparison between MIMO, Recursive, and Direct was in favor of MIMO. Finally, (Ben Taieb et al., 2009, 2010)
show that the DIRMO strategy gives better forecasting results than Direct and MIMO strategies when the parameter controlling the degree of dependency between forecasts is correctly identified. These previous comparisons have been performed with different datasets in different configurations using different forecasting methods, such as Multiple Linear Regression, Artificial Neural Networks, Hidden Markov Models and Nearest Neighbors.
All the contradictory findings of these studies make it all the more necessary to investigate further to find the truth concerning the relative performance of these strategies. The second contribution of this paper is an experimental comparison of the different multistep ahead forecasting strategies on the
time series of the NN5 international forecasting competition benchmark. These time series pose some of the realistic problems that one usually encounters in a typical multistep ahead forecasting task, for example the existence of several times series of possibly related dynamics, outliers, missing values, and multiple overlying seasonalities. This experimental comparison is performed for a variety of different configurations (regarding seasonality, input selection and combination), in order to have the comparison as encompassing as can be. In addition, the methodology used for this experimental comparison is based on the guidelines and recommendations advocated in some of the methodological papers
(Demšar, 2006; García and Herrera, 2009).In other words, the aim of this paper is not to make a comparison of machine learning algorithms for forecasting (which was already conducted in (Ahmed et al., 2010)) but rather to show for a given learning algorithm, how the choice of the forecasting strategy can sensibly influence the performance of the multistep ahead forecasts. In this work, we adopted the Lazy Learning algorithm (Aha, 1997), a particular instance of local learning, which has been successfully applied to many realworld forecasting tasks (Sauer, 1994; Bontempi et al., 1998; McNames, 1998).
Last but not least, the paper proposes also a Lazy Learning entry to the NN5 forecasting competition (Crone, b). The goal is to assess how this model fares compared to the other computational intelligence models that were proposed for the competition (Bontempi and Ben Taieb, 2011). This will give us an idea about the potential of this approach.
The paper is organized as follows. The next section presents a review of the different forecasting strategies. Section 3 describes the Lazy Learning model and the associated algorithms for the different forecasting strategies. Section 4 gives a detailed presentation of the datasets and the methodology applied for the experimental comparison. Section 5 presents the results and discusses them. Finally, Section 6 gives a summary and concludes the work.
2 Strategies for MultiStepAhead Time Series Forecasting
A multistep ahead (also called longterm) time series forecasting task consists of predicting the next values of a historical time series composed of observations, where denotes the forecasting horizon.
This section will first give a presentation of the five forecasting strategies and next, a subsection will be devoted to a comparative analysis of these strategies in terms of number and types of models to learn as well as forecasting properties.
We will use a common notation where and denote the functional dependency between past and future observations, refers to the embedding dimension (Casdagli et al., 1991) of the time series, that is the number of past values used to predict future values and represents the term that includes modeling error, disturbances and/or noise.
2.1 Recursive strategy
The oldest and most intuitive forecasting strategy is the Recursive (also called Iterated or MultiStage) strategy (Weigend and Gershenfeld, 1994; Sorjamaa et al., 2007; Cheng et al., 2006; Tiao and Tsay, 1994; Kline, 2004; Hamzaçebi et al., 2009). In this strategy, a single model is trained to perform a onestep ahead forecast, i.e.
(1) 
with .
When forecasting steps ahead, we first forecast the first step by applying the model. Subsequently, we use the value just forecasted as part of the input variables for forecasting the next step (using the same onestep ahead model). We continue in this manner until we have forecasted the entire horizon.
Let the trained onestep ahead model be . Then the forecasts are given by:
(2) 
Depending on the noise present in the time series and the forecasting horizon, the recursive strategy may suffer from low performance in multistep ahead forecasting tasks. Indeed, this is especially true if the forecasting horizon exceeds the embedding dimension , as at some point all the inputs are forecasted values instead of actual observations (Equation 2). The reason for the potential inaccuracy is that the Recursive strategy is sensitive to the accumulation of errors with the forecasting horizon. Errors present in intermediate forecasts will propagate forward as these forecasts are used to determine subsequent forecasts.
In spite of these limitations, the Recursive strategy has been successfully used to forecast many realworld time series by using different machine learning models, like recurrent neural networks
(Saad et al., 1998) and nearestneighbors (McNames, 1998; Bontempi et al., 1999).2.2 Direct strategy
The Direct (also called Independent) strategy (Weigend and Gershenfeld, 1994; Sorjamaa et al., 2007; Cheng et al., 2006; Tiao and Tsay, 1994; Kline, 2004; Hamzaçebi et al., 2009) consists of forecasting each horizon independently from the others. In other terms, models are learned (one for each horizon) from the time series where
(3) 
with and .
The forecasts are obtained by using the learned models as follows:
(4) 
This implies that the Direct strategy does not use any approximated values to compute the forecasts (Equation 4), being then immune to the accumulation of errors. However, the models are learned independently inducing a conditional independence of the forecasts. This affects the forecasting accuracy as it prevents the strategy from considering complex dependencies between the variables (Bontempi, 2008; Bontempi and Ben Taieb, 2011; Kline, 2004). For example consider a case where the best forecast is a linear or mildly nonlinear trend. The direct method could yield a broken curve because of the “uncooperative” way the forecasts are generated. Also, this strategy demands a large computational time since there are as many models to learn as the size of the horizon.
2.3 DirRec strategy
The DirRec strategy (Sorjamaa and Lendasse, 2006) combines the architectures and the principles underlying the Direct and the Recursive strategies. DirRec computes the forecasts with different models for every horizon (like the Direct strategy) and, at each time step, it enlarges the set of inputs by adding variables corresponding to the forecasts of the previous step (like the Recursive strategy). However, note that unlike the two previous strategies, the embedding size is not the same for all the horizons. In other terms, the DirRec strategy learns models from the time series where
(5) 
with and .
To obtain the forecasts, the learned models are used as follows:
(6) 
This strategy outperformed the Direct and the Recursive strategies on two realworld time series: Santa Fe and Poland Electricity Load data sets (Sorjamaa and Lendasse, 2006). Few research has been done regarding this strategy, so there is a need for further evaluation.
2.4 MIMO strategy
The three previous strategies (Recursive, Direct and DirRec) may be considered as SingleOutput strategies (Ben Taieb et al., 2010) since they model the data as a (multipleinput) singleoutput function (see Equations 2, 4 and 6).
The introduction of the MultiInput MultiOutput (MIMO) strategy (Bontempi, 2008; Bontempi and Ben Taieb, 2011) (also called Joint strategy (Kline, 2004)) has been motivated by the need to avoid the modeling of singleoutput mapping, which neglects the existence of stochastic dependencies between future values and consequently affects the forecast accuracy (Bontempi, 2008; Bontempi and Ben Taieb, 2011).
The MIMO strategy learns one multipleoutput model from the time series where
(7) 
with , is a vectorvalued function (Micchelli and Pontil, 2005), and is a noise vector with a covariance that is not necessarily diagonal (Matías, 2005) .
The forecasts are returned in one step by a multipleoutput model where
(8) 
The rationale of the MIMO strategy is to preserve, between the predicted values, the stochastic dependency characterizing the time series. This strategy avoids the conditional independence assumption made by the Direct strategy as well as the accumulation of errors from which plagues the Recursive strategy. So far, this strategy has been successfully applied to several realworld multistep ahead time series forecasting tasks (Bontempi, 2008; Bontempi and Ben Taieb, 2011; Ben Taieb et al., 2009, 2010).
However, the need to preserve the stochastic dependencies by using one model has a drawback as it constrains all the horizons to be forecasted with the same model structure. This constraint could reduce the flexibility of the forecasting approach (Ben Taieb et al., 2009). This was the motivation for the introduction of a new multipleoutput strategy: DIRMO (Ben Taieb et al., 2009, 2010), presented next.
2.5 DIRMO strategy
The DIRMO strategy (Ben Taieb et al., 2009) aims to preserve the most appealing aspects of both the DIRect and miMO strategies. Taking a middle approach, DIRMO forecasts the horizon in blocks, where each block is forecasted in a MIMO fashion. Thus, the stepahead forecasting task is decomposed into multipleoutput forecasting tasks (), each with an output of size ().
When the value of the parameter is , the number of forecasting tasks is equal to which correspond to the Direct strategy. When the value of the parameter is , the number of forecasting tasks is equal to which correspond to the MIMO strategy. There are intermediate configurations between these two extremes depending on the value of a parameter .
The tuning of the parameter allows us to improve the flexibility of the MIMO strategy by calibrating the dimensionality of the outputs (no dependency in the case and maximal dependency for ). This provides a beneficial trade off between the preserving a larger degree of the stochastic dependency between future values and having a greater flexibility of the predictor.
The DIRMO strategy, previously called MISMO strategy (Ben Taieb et al., 2009) (renamed for clarity reason), learns models from the time series where
(9) 
with , and is a vectorvalued function if .
The forecasts are returned by the learned models as follows:
(10) 
2.6 Comparative Analysis
To summarize, there are five possible forecasting strategies that perform a multistep ahead forecasting task: Recursive, Direct, DirRec, MIMO and DIRMO strategies. Figure 1 shows the different forecasting strategies with links indicating their relationships.
As we see, the DirRec strategy is a combination of the Direct and the Recursive strategy, while the DIRMO strategy is a combination of the Direct and the MIMO strategy.
Contingent on the selected strategy, a different number and type of models will be required. Before presenting the general comparison of the multistep ahead forecasting strategies, let us highlight using an example the differences between the forecasting strategies.
Consider a multistep ahead forecasting task for the time series where the forecasting horizon is . Table 1 shows, for each strategy, the different input sets and forecasting models involved in the calculation of the four forecasts .
Recursive  

Direct  
DirRec  
MIMO  
DIRMO () 
Let and denote the amount of computational time needed to learn (with a given learning algorithm) a SingleOutput model and a MultipleOutput model, respectively. For a given step ahead forecasting task, we can see in Table 2 for each strategy the number and type of models to learn, the size of the output for each model as well as the computational time.
Number of Models  Types of models  Size of output  Computational time  

Recursive  1  SO  1  
Direct  SO  1  
DirRec  SO  1  
MIMO  1  MO  
DIRMO  MO 
Suppose , which is a reasonable assumption because learning a model with a vectorvalued output takes more time than learning a model with a singlevalued output. This allows us to rank the forecasting strategies according to their computation time for training given in Table 2. Indeed, we have
(11) 
where we suppose that the parameter of DIRMO is not equal to or .
Note, in one hand, that the time needed to learn a SO model of the DirRec strategy equals because the input size of each SO task increases at each step. On the other hand, if we need to select the value of the parameter by on some tuning, the DIRMO strategy will take more time and hence will be the slowest one.
In the following, we conclude this section by summarizing the pros and cons of the five forecasting strategies as depicted on Table 3.
Pros  Cons  Computational time needed  

Recursive  Suitable for noisefree time series (e.g. chaotic)  Accumulation of errors  
Direct  No accumulation of errors  Conditional independence assumption  
DirRec  Tradeoff between Direct and Recursive  Input set grows linearly with  
MIMO  No conditional independence assumption  Reduced flexibility: same model structure for all the horizons  
DIRMO  Tradeoff between total dependence and total independence of forecasts  One additional parameter to estimate 
3 Lazy Learning for Time Series Forecasting
Each of the forecasting strategies introduced in Section 2 demands the definition of a specific forecasting model or learning algorithm to estimate either the scalarvalued function (see Equations 1, 3 and 5) or the vectorvalued function (see Equations 7 and 9) which represent the temporal stochastic dependencies. As the goal of the paper is not to compare forecasting models (as in (Ahmed et al., 2010)) but rather multistep ahead forecasting strategies, the choice of a underlying forecasting model is required to setup the experiments. In this paper, we adopted the Lazy Learning algorithm, which is a particular instance of local learning models, since it has been shown to be particularly effective in time series forecasting tasks (Bontempi et al., 1998; Bontempi, 1999, 2008; Bontempi and Ben Taieb, 2011; Ben Taieb et al., 2009, 2010).
Next section gives a general comparison of global models with local models. Section 3.2 presents the Lazy Learning Algorithm in terms of learning properties. Section 3.3 and 3.4 describe two Lazy Learning algorithms for two types of learning tasks, namely the SingleOutput and MultipleOutput Lazy Learning algorithms. Finally, the a discussion is presented on the model combination.
3.1 Global vs local modeling for supervised learning
Forecasting the future values of a time series using past observations can be reduced to a supervised learning problem or, more precisely, to a regression problem. Indeed, the time series can be seen as a dataset made of pairs where the first component, called input, is a past temporal pattern and the second, called output, is the corresponding future pattern. Being able to predict the unknown output for a given input is equivalent to forecasting the future values given the last observations of the time series.
Global modeling
is the typical approach to the supervised learning problem. Global models are parametric models that describe the relationship between the inputs and the output values as an analytical function over the whole input domain. Examples of global models are linear models
(Montgomery et al., 2006), nonlinear statistical regressions (Seber and Wild, 1989) and Neural Networks (Rumelhart et al., 1986).Another approach is the divideandconquer modeling which consists in relaxing the global modeling assumptions by dividing a complex problem into simpler problems, whose solutions can be combined to yield a solution to the original problem (Bontempi, 1999). The divideandconquer has evolved in two different paradigms: the modular architectures and the local modeling approach (Bontempi, 1999).
Modular techniques replace a global model with different modules covering different parts of the input space. Examples based on this approach are Fuzzy Inference Systems (Takagi and Sugeno, 1985)
(Moody and Darken, 1989; Poggio and Girosi, 1990), Local Model Networks (MurraySmith, 1994), Trees (Breiman et al., 1984) and Mixture of Experts (Jordan and Jacobs, 1994). The modular approach is in the intermediate scale between the two extremes, the global and the local approach. However, their identification is still performed on the basis of the whole dataset and requires the same procedures used for generic global models.Local modeling techniques are at the extreme end of divideandconquer methods. They are nonparametric models that combine excellent theoretical properties with a simple and flexible learning procedure. Indeed, they do not aim to return a complete description of the input/output mapping but rather to approximate the function in a neighborhood of the point to be predicted (also called the query point). There are different examples of local models, for example nearest neighbor, weighted average, and locally weighted regression (Atkeson et al., 1997b). Each of these models use data points near the point to be predicted for estimating the unknown output. Nearest neighbor models simply find the closest point and uses its output value. Weighted average models combines the closest points by averaging them with weights inversely proportional to their distance to the point to be predicted. Locally weighted regression models fit a model to nearby points with a weighted regression where the weights are function of distances to the query point.
The effectiveness of local models is wellknown in the time series and computational intelligence community. For example, the method proposed by Sauer (Sauer, 1994) gave good performance and ranked second best forecast for the Santa Fe A dataset from a forecasting competition organized by Santa Fe institute. Moreover, the two topranked entries of the KULeuven competition used local learning methods (Bontempi et al., 1998; McNames, 1998).
In this work, we will restrict to consider a particular instance of local modeling algorithms: the Lazy Learning algorithm (Aha, 1997).
3.2 The Lazy Learning algorithm
It is possible to encounter different degree of “laziness” in local learning algorithms. For instance, a Nearest Neighbor (NN) algorithm, which learns the best value of before the query is requested, is hardly a lazy approach since, after the query is presented, it requires only a reduced amount of learning procedure, only the computation of the neighbors and the average. On the contrary, a local method, which depends on the query to select the number of neighbors or other structural parameters presents a higher degree of “laziness”.
The Lazy Learning (LL) algorithm, extensively discussed in (Birattari et al., 1999; Birattari and Bersini, 1997), is a querybased local modeling technique where the whole learning procedure is deferred until a forecast is required. When the query is requested, the learning procedure may start to select the best value of the number of neighbors (or other structural parameters) and next, the dataset is searched for the nearest neighbors of the query point. The nearest neighbors are then used for estimating a local model, which returns a forecast. The local model is then discarded and the procedure is repeated from scratch for subsequent queries.
The LL algorithm has a number of attractive features (Aha, 1997), namely, the reduced number of assumptions, the online learning capability and the capacity to model nonstationarity. LL assumes no a priori knowledge on the process underlying the data, which is particularly relevant in real datasets. These considerations motivate the adoption of the LL algorithm as a learning model in a multistep ahead forecasting context.
Local modeling techniques require the definition of a set of model parameters namely the number of neighbors, the kernel function, the parametric family and the distance metric[REF]. In the literature, different methods have been proposed to select automatically the adequate configuration (Atkeson et al., 1997b; Birattari et al., 1999)
. However, in this paper, we will limit the search on only the selection of the number of neighbors (also called or equivalent to the bandwidth selection). This is essentially the most critical parameter, as it controls the bias/variance tradeoff. Bandwidth selection is usually performed by ruleofthumb techniques
(Fan and Gijbels, 1995), plugin methods (Ruppert et al., 1995) or crossvalidation strategies (Atkeson et al., 1997a). Concerning the other parameters, we use the tricubic kernel (Cleveland et al., 1988) as kernel function, a constant model for the parametric family and the euclidean distance as metric.Note that in order to apply local learning to a time series, we need to embed it into a dataset made of pairs where the first component is a temporal pattern of length and the second component is either the future value (in the case of SingleOutput Modeling) or the consecutive temporal pattern of length (in the case of MultipleOutput Modeling). In the following sections, will refer to the embedded time series with input/output pairs.
3.3 SingleOutput Lazy Learning algorithm
In the case of SingleOutput learning (i.e with scalar output), the Lazy Learning procedure consists of a sequence of steps detailed in Algorithm 1. The algorithm assesses the generalization performance of different local models and compares them in order to select the best one in terms of generalization capability. To perform that, the algorithm associate a LeaveOneOut (LOO) error to the estimation obtained with neighbors (lines 1 and 1).
The LOO error can provide a reliable estimate of the generalization capability. However the disadvantage of such an approach is that it requires to repeat times the training process, which means a large computational effort. Fortunately, in the case of linear models there exists a powerful statistical procedure to compute the LOO crossvalidation measure at a reduced computational cost: the PRESS (Prediction Sum of Squares) statistic (Allen, 1974).
In case of constant model, the LOO error for the estimation of the query point is calculated as follows (Bontempi, 1999):
(12) 
where designates the error obtained by setting aside the th neighbor of (). If we define the output of the closest neighbors of as then, is defined as
(13)  
(14)  
(15)  
(16)  
(17) 
Note that if we use Equation 13 to calculate the LOO error (Equation 12), the training process is repeated times since the sum in Equation 13 is performed for each index . However, by using the PRESS statistic (Equation 17), we avoid this large computational effort since the sum is replaced by the previously computed which was already calculated. This makes the PRESS statistic an efficient method to compute the LOO error.
3.4 MultipleOutput Lazy Learning algorithm
The adoption of MultipleOutput strategies requires the design of multipleoutput (or equivalently multiresponse) modeling techniques (Matías, 2005; Breiman and Friedman, 1997; Micchelli and Pontil, 2005) where the output is no more a scalar quantity but a vector of values. Like in the SingleOutput case, we need criteria to assess and compare local models with different number of neighbors. In the following, we present two criteria: the first one is an extension of the LOO error for the MultipleOutput case (Algorithm 2) (Bontempi, 2008; Ben Taieb et al., 2010) and the second one is a criterion proper to the MultipleOutput modeling (Algorithm 3) (Ben Taieb et al., 2010; Bontempi and Ben Taieb, 2011). Note that, in the two algorithms, the output is a vector of size (e.g. will equal with the MIMO strategy or in the DIRMO strategy).
Algorithm 2 is an extension of the Algorithm 1 for vectorial outputs. We still use the LOO crossvalidation measure as a criterion to estimate the generalization capability of the model but here, the LOO error is an aggregation of the errors obtained for each output (line 2). Note that the same number of neighbors is selected for all the outputs (e.g. MIMO strategy) unlike what could happen with different SingleOutput tasks (e.g. Direct strategy).
The second criterion uses the fact that the forecasting horizon
is supposed to be large (multistep ahead forecasting) and hence we have enough samples to estimate some descriptive statistics. Then, instead of using the LeaveOneOut error, we can use as criterion a measure of stochastic discrepancy between the forecasted values and the training time series. The lower the discrepancy between the descriptors of the forecasts and the training time series, the better is the quality of the forecasts
(Bontempi and Ben Taieb, 2011).Several measures of discrepancy can be defined, both linear and nonlinear. For example, the autocorrelation can be used as linear statistics and the maximum likelihood as a nonlinear one. In this work, we will consider only one linear measure using both the autocorrelation and the partial correlation.
The assessement of the quality of the estimation of the query point is calculated as follows
(18) 
where the symbol “” represents the concatenation, represent the training time series and is the Pearson correlation. This discrepancy measure is composed of two parts where the first part uses the autocorrelation (noted ) and the second uses the partial autocorrelation (noted ).
For each part, we calculate the discrepancy (estimated with the correlation, noted ) between, on one hand, the autocorrelation (or partial autocorrelation) of the concatenation of the training time series and the forecasted sequence and, on the other hand, the autocorrelation (or partial autocorrelation) of the training time series (Bontempi, 2008; Ben Taieb et al., 2009).
In Algorithm 3, after evaluating the performance of local models with different number of neighbors (lines 1 to 1), the best one, which minimizes the discrepancy between the forecasting sequence and the training time series (having index ), is selected (lines 3 and 3). In other words, the goal is to select the best number of neighbors which preserve the stochastic properties of the time series in the forecasted sequence. Finally, the prediction of the output of is returned (line 3).
3.5 Model selection or model averaging
Considering the Algorithm 1, we can see that we generate, for the query , a set of predictions , each obtained with different number of neighbors. For each of these predictions, a testing error has been calculated. Note that the next considerations are also applicable to Algorithms 2 and 3.
The goal of model selection is to use all this information (set of predictions and testing errors) to estimate the final prediction of the query point . There exist two main paradigms mainly the winnertakeall and combination approaches.
In the Algorithm 1, we presented the winnertakeall approach (noted WINNER) (Maron and Moore, 1997) which consists of comparing the set of models and selecting the best one in terms of testing error (see line 1).
Selecting the best model according to the testing error is intuitively the approach which should work the best. However, results in machine learning show that the performance of the final model can be improved by combining models having different structures (Raudys and Zliobaite, 2006; Jacobs et al., 1991; Breiman, 1996; Schapire et al., 1998).
In order to apply the model averaging, lines and of the Algorithm 1 can be replaced by
(19) 
where an average is calculated. The weights will take different values depending on the combination approach adopted. If equals , we are in the case of equally weighted combination and reduces to an arithmetic mean (noted COMB). Otherwise, if weights are assigned according to testing errors, will equal and reduces to a weighted mean (noted WCOMB).
4 Experimental Setup
4.1 Time Series Data
In the last decade, several time series forecasting competitions (e.g. the NN3, NN5, and the ESTSP competitions (Crone, a, b; Lendasse, 2007, 2008)) have been organized in order to compare and evaluate the performance of computational intelligence methods. Among them, the NN5 competition (Crone, b) is one of the most interesting one since it includes the challenges of a realworld multistep ahead forecasting task, namely multiple time series, outliers, missing values as well as multiple overlying seasonalities, etc. Figure 2 shows four time series from the NN5 dataset.
Each of the time series of this competition represents roughly two years of daily cash money withdrawal amounts ( data points) at ATM machines at one of the various cities in the UK. For each time series, the competition required to forecast the values of the next days, using the given historical data points, as accurately as possible. The performance of the forecasting methods over one time series was assessed by the symmetric mean absolute percentage of error (SMAPE) measure (Crone, b), defined as
(20) 
where
is the target output and
is the prediction. Since this is a relative error measure, the errors can be averaged over all time series to obtain a mean SMAPE defined as(21) 
where SMAPE denotes the SMAPE of the th time series.
4.2 Methodology
The aim of the experimental study is to compare the accuracy of the five forecasting strategies in the context of the NN5 competition. Since the accuracy of a forecasting technique is known to be dependent on several design choices (e.g. the deseasonalization or the input selection) and we want to focus our analysis on the multistep ahead forecasting strategies, we consider a number of different configurations in order to increase the statistical power of our comparison. Every configuration is composed of several preprocessing steps as sketched in Figure 3. Since some of these steps (e.g. deseasonalization) can be performed in alternative ways (e.g. two alternatives for the deseasonalization, two alternatives for input selection, three alternatives for the model selection), we come up with configurations. The details about each step are given in what follows.
Step 1: Gaps removal
The specificity of the NN5 series requires a preprocessing step called gaps removal where by gap we mean two types of anomalies: (i) zero values that indicate that no money withdrawal occurred and (ii) missing observations for which no value was recorded. About of the data are corrupted by gaps. In our experiments we adopted the gap removal method proposed in (Wichard, 2010): if is the gap sample, this method replaces the gap with the median of the set among which are available.
Step 2: Deseasonalization The adoption of deseasonalization may have a large impact on the forecasting strategies because the NN5 time series possess a variety of periodic patterns. For that reason we decided to consider tasks with and without deseasonalization in order to better account for the role of the forecasting strategy. We adopt the deseasonalization methodology discussed in (Andrawis et al., 2011) to remove the strong day of the week seasonality as well as the moderate day of the month seasonality. Of course after we deseasonalize and apply the forecasting model we restore back the seasonality.
Step 3: Embedding dimension selection
Every forecasting strategy requires the setting of the size of the embedding dimension (see Equations 1 to 9). Several approaches have been proposed in the literature to select this value (Kantz and Schreiber, 2004). Since this aspect is not a central theme in our paper we just applied the stateoftheart approach reviewed in (Crone and Kourentzes, 2009), which consists of selecting the timelagged realizations with significant partial correlation function (PACF). This method allows to select the value of the embedding dimension and then to identify the relevant variables within the window of past observations. We set the maximum lag of the PACF to to provide a sufficiently comprehensive pool of features. However, note that the final dimensionality of the input vectors for all the time series is on average equal to .
Step 4: Input Selection
We considered the forecasting task with and without input variable selection step. A variable selection procedure requires the setting of two elements: the relevance criterion, i.e. statistics which estimates the quality of the selected variables, and the search procedure, which describes the policy to explore the input space. We adopted the Delta test(DT) as relevance criterion. The DT has been introduced in time series forecasting domain by Pi and Peterson in (Pi and Peterson, 1994) and later successfully applied to several forecasting task (Ben Taieb et al., 2009, 2010; Liitiäinen and Lendasse, 2007; Guillén et al., 2008; Mateo and and, 2010). This criterion is based on applying some kind of a noise variance estimator, and then selecting the set of input variables that yield the strongest and most deterministic dependence between inputs and outputs (Mateo and Lendasse, 2008).
Concerning the search procedure, we adopted a ForwardBackward Search (FBS) procedure which is a combination of forward selection (sequentially adding input variables) and backward search (sequentially removing some input variables). This choice was motivated by the flexibility of the FBS procedure which allows a deeper exploration of the input space. Note that the search is initialized by the set of variables defined in the previous step.
Step 5: Model Selection
Concerning the model selection procedure, three approaches (see Section 3.5) are taken into consideration in our experiments:
 WINNER

: This approach selects the model that gives best performance for the test set (winnertakeall approach).
 COMB

: This approach combines all alternative models by simple averaging.
 WCOMB

: This approach combines models by weighted averaging where weights are inversely proportional to the test errors.
4.2.1 The Compared forecasting strategies
Table 4 presents the eight forecasting strategies that we tested, showing also their respective acronyms.
REC  The Recursive forecasting strategy.  

DIR  The Direct forecasting strategy.  
DIRREC  The DirRec forecasting strategy.  
MIMO  MIMOLOO  A variant of the MIMO forecasting strategy with the LOO selection criteria. 
MIMOACFLIN  A variant of the MIMO forecasting strategy with the autocorrelation selection criteria.  
DIRMO  DIRMOSEL  The DIRMO forecasting strategy which select the best value of the parameter . 
DIRMOAVG  A variant of the DIRMO strategy which calculates a simple average of the forecasts obtained with different values of the parameter .  
DIRMOWAVG  The DIRMOAVG with a weighted average where weights are inversely proportional to testing errors. 
4.2.2 Forecasting performance evaluation
This section describes the assessment procedure (Figure 4) of the 8 forecasting strategies.
The procedure for comparing between the eight forecasting strategies is shown in Figure 4. The accuracy of each forecasting strategy is first measured using the SMAPE* measure calculated over the time series and defined in Equation 21. To test if there are significant general differences in performance between the different strategies, we have to consider the problem of comparing multiple models on multiple data sets. For such case Demšar (Demšar, 2006; García and Herrera, 2009)
in a detailed comparative study recommended using a two stage procedure: first to apply Friedman’s or Iman and Davenport’s tests to test if the compared models have the same mean rank. If this test rejects the nullhypothesis, then posthoc pairwise tests are to be performed to compare the different models. These tests adjust the critical values higher to ensure that there is at most a 5% chance that one of the pairwise differences will be erroneously found significant.
Friedman test
The Friedman test (Friedman, 1937, 1940) is a nonparametric procedure which tests the significance of differences between multiple ranks. It ranks the algorithms for each dataset separately: the rank of 1 will be given to the best performing algorithm, the rank of 2 to the second best and so on. Note that average ranks are assigned in case of ties.
After ranking the algorithms for each dataset, the Friedman test compares the average ranks of algorithms. Let be the rank of the th of algorithms on the th of data sets, the average rank of the th algorithm is .
The nullhypothesis states that all the algorithms are equivalent and so their ranks should be equal. Under the nullhypothesis, the Friedman statistic
(22) 
is distributed according to a chisquared with degrees of freedom (), when and are large enough (as a rule of a thumb, and ) (Demšar, 2006).
Iman and Danvenport (Iman and Davenport, 1980), showing that Friedman’s statistic is undesirably conservative, derived another improved statistic, given by
(23) 
which is distributed, under the nullhypothesis, according to the Fdistribution with
and degrees of freedom.Posthoc test
When the nullhypothesis is rejected, i.e. there is a significant difference between at least 2 strategies, a posthoc test is performed to identify significant pairwise differences among all the algorithms. The test statistic for comparing the
th and the th algorithm is(24) 
which is asymptotically normally distributed under the null hypothesis. After the corresponding
value is calculated, it is compared with a given level of significance .However, in multiple comparisons, as there are a possibly large number of pairwise comparisons, there is a relatively high chance that some pairwise test are incorrectly rejected. Several procedures exist to adjust the value of to compensate for this bias, for instance Nemenyi, Holm, Shaffer as well as Bergmann and Hommel (Demšar, 2006). Based on the suggestion of Garcia and Herrera (García and Herrera, 2009) we adopted Shaffer’s correction. The reason is that Garcia and Herrera (García and Herrera, 2009) showed that Shaffer’s procedure has the same complexity Holm’s procedure, but with the advantage of using information about logically related hypothesis.
4.3 Experimental phases
In order to reproduce the same context of the NN5 forecasting competition the experimental setting is made of two phases: the precompetition and the competition phase.
4.3.1 Precompetition phase
The precompetition phase is devoted to the comparison of the different forecasting strategies using the available observations of time series. The goal is to learn the different parameters and then estimate the forecasting performance and compare between the different strategies.
To estimate the forecasting performance of each strategy, we used a learning scheme with trainingvalidationtesting sets. Each time series (containing observations) is partitioned in three mutually exclusive sets (A, B and C) as shown in Figure 5: training (Day to Day : values), validation (Day to Day : values) and testing (Day to Day : values).
The validation set (B in Figure 5) is used to build and tune the models. Specifically, as we use a Lazy Learning approach, we need to select, for each model, the range of number of neighbors() to use in performing the forecasting task.
The test set (C in Figure 5) is used to measure the performances of each forecasting strategy. To make the utmost use of the available data, we adopt a multiple time origin test as suggested by Tashman in (Tashman, 2000), where the time origin denotes the point from which the multistep ahead forecasts are generated.
The time origin and corresponding forecast intervals are given as:

Day to Day ( data points)

Day to Day ( data points)

Day to Day ( data points)
In other words, we perform the forecast three times starting from the three different starting points, each time forecasting a number of steps ahead till the end of the interval. Note that we used the same test period and evaluation criterion (i.e. the SMAPE) as used by Andrawis et al in (Andrawis et al., 2011). This allows us to compare our results with several other machine learning models tested in this article.
4.3.2 Competition phase
In the competition phase we generate the final forecasts, made up of future observations, which would have been submitted to the competition. This phase takes advantage of the lessons learned and the design choices made in the precompetition phase. Here, we combine the training set with the test set (A+B in Figure 6) to retrain the models of the different strategies and then generate the final forecast (which will be submitted to the competition). The training set (A+B on Figure 6) is now made of data points and the validation set (C on Figure 6) is composed of the next data points, as shown in Figure 6. In other words, the values are then used to build and tune the models, which will next return the forecasted values.
5 Results and discussion
This section presents and discusses the prediction results of the forecasting strategies for the precompetition and competition phase. For each phase, we report the results obtained in the different configurations introduced in Section 4.2.
The forecasting performance of the strategies is measured using the criteria discussed in Section 4.2.2 and presented by means of two tables. The first one provides the average SMAPE as well as the ranking for each forecasting strategy. Since the nullhypothesis stating that all the algorithms are equivalent has been rejected (using the ImanDavenport statistic) for all the configurations, we proceeded with the posthoc test. The second table presents the results of this posthoc test, which partitions the set of strategies in several groups which are statistically signicantly different in terms of forecasting performance.
Note that the configurations which require the input selection do not contain the DIRMO results since combining the selection of the inputs with the selection of the parameter would have needed an excessive amount of computation time.
5.1 Precompetition results
The SMAPE and ranking results for the precompetition phase are presented in Table 5 while the results of the posthoc test are summarized in Table 6.








The availability of the SMAPE* results, obtained according to the procedure used in (Andrawis et al., 2011), makes possible the comparison of our precompetition results with those of several others learning methods available in (Andrawis et al., 2011). For the sake of comparison, Table 7 reports the forecasting errors for some of the techniques considered in (Andrawis et al., 2011), notably Gaussian Process Regression (GPR), Neural Network (NN), Multiple Regression (MULTREGR), Simple Moving Average (MOVAVG), Holt’s Exponential Smoothing and a combination (Combined) of such techniques.
Model  SMAPE* 

GPRITER  19.90 
GPRDIR  21.22 
GPRLEV  20.19 
NNITER  21.11 
NNLEV  19.83 
MULTREGR1  19.11 
MULTIREGR2  18.96 
MULTREGR3  18.94 
MOVAVG  19.55 
Holt’s Exp Sm  23.77 
Combined  18.95 
The comparison shows that the best configuration of Table (d)d, that is the MIMOACFLIN strategy, is competitive with all these models with a SMAPE* amounting to .
5.2 Competition results
The SMAPE and ranking results for the competition phase are presented in Table 8 while the results of the posthoc test are summarized in Table 9.








The precompetition results presented in the previous section suggest us to use the MIMOACFLIN strategy with the Comb
model selection approach by removing the seasonality and applying the input selection procedure, since this configuration obtains the smallest forecasting error ().
By using the MIMOACFLIN strategy and the corresponding configuration in the competition phase, we would generate forecasts with a SMAPE* equals to which is quite good compared to the best computational intelligence entries of the competition as shown in Table 10. Figure 7 shows the forecasts of the MIMOACFLIN strategy versus the actual values for four NN5 time series to illustrate the forecasting capability of this strategy.
Model  SMAPE* 

MIMOACFLIN  
Andrawis  20.4 
Vogel  20.5 
D’yakonov  20.6 
Rauch  21.7 
Luna  21.8 
Wichard  22.1 
Gao  22.3 
PumaVillanueva  23.7 
Dang  23.77 
Pasero  25.3 
5.3 Discussion
From all presented results one can deduce the following observations below. These findings refer mainly to the precompetition results. But, one can easily see that they mostly also apply to the competition phase results.

The overall best method is MIMOACFLIN, used with input selection, deseasonalization and equal weight combination (COMB).

The MultipleOutput strategies (MIMO and DIRMO) are invariably the best strategies. They beat the SingleOutput strategies, such as DIR, REC, and DIRREC. Both MIMO and DIRMO give comparable performancce. For DIRMO, the selection of the parameter is critical, since it has a great impact on the performance. Should there be an improved selection approach, this strategy would have a big potential.

Both versions of MIMO are comparable. Also the versions of DIRMO give close results, with perhaps DIRMOWAVG a little better than the other two versions.

Among the SingleOutput strategies, the REC strategy has almost always a smaller SMAPE and a better ranking than the DIR strategy. DIRREC is the worse strategy overall, and gives especially low accuracy when no deseasonalization is performed.

Deseasonalization leads to consistently better results (in 38 out of 39 models). This result is consistent with some other studies, such as (Zhang and Qi, 2005). The possible reason for this is that when no deseasonalization is performed, we are putting a higher burden on the model to forecast the future seasonal pattern plus the trend and the other aspects, which apparently is hard to simultaneously satisfy.

Input selection is especially beneficial when we perform a deseasonalization. Absent deseasonalization, the results are mixed (as to whether input selection improves the results or not). The possible explanation is that when no deseasonalization is performed, the model needs all the previous cycle to construct the future seasonal pattern. Performing an input selection will deprive it from essential information.

Concerning the model selection aspect, both combination approaches (COMB and WCOMB) are superior to the winnertakeall (WINNER). Both COMB and WCOMB are comparable, and the results do not differ by much. This is consistent with much of the findings in forecast combination literature, e.g. (Andrawis et al., 2011; Clemen, 1989; Timmermann, 2006; Andrawis et al., 2010)

The relative performance and ranking of the different strategies is persistent. Most findings that are based on the precompetition results are true for the competition phase results. This is also true for the findings concerning the deseasonalization, input selection, and model selection. This persistence is reassuring, as we can have some confidence in relying on the test or validation results for selecting the best strategies.

The best strategy based on the precompetition data, the MIMOACFLIN method, would have topped all computational intelligence entries of the NN5 competition in the true competition holdout data.
6 Conclusion
Forecasting a time series many steps into the future is a very hard problem because the larger the forecast horizon, the higher is the uncertainty. In this paper we presented a comparative review of existing strategies for multistep ahead forecasting, together with an extensive comparison, applied on the 111 time series of the NN5 forecasting competition. The comparison gave some interesting lessons that could help researchers channel their experiments into the most promising approaches. The most consistent findings are that MultipleOutput approaches are invariably better than SingleOutput approaches. Also, deseasonalization had a very considerable positive impact on the performance. Finally, the results are clearly quite persistent. So, selecting the best strategy based on testing performance is a very potent approach. A possible direction for future research could therefore be developing other new improved MultipleOutput strategies. Also, possibly tailoring deseasonalization methods specifically for MultipleOutput strategies could also be a promising research point.
Acknowledgments
We would like to thanks the authors of the paper (García and Herrera, 2009) for making their methods available at http://sci2s.ugr.es/keel/multipleTest.zip.
References
 Aha [1997] David W. Aha, editor. Lazy learning. Kluwer Academic Publishers, Norwell, MA, USA, 1997. ISBN 0792345843.
 Ahmed et al. [2010] Nesreen K. Ahmed, Amir F. Atiya, Neamat El Gayar, and Hisham ElShishiny. An empirical comparison of machine learning models for time series forecasting. Econometric Reviews (to appear), 29(56), 2010.
 Allen [1974] David M. Allen. The relationship between variable selection and data agumentation and a method for prediction. Technometrics, 16(1):pp. 125–127, 1974. ISSN 00401706. URL http://www.jstor.org/stable/1267500.
 Alpaydin [2010] Ethem Alpaydin. Introduction to Machine Learning, Second Edition. Adaptive Computation and Machine Learning. The MIT Press, February 2010. ISBN 9780262012430. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike0720&path=ASIN/0262012111.
 Anders and Korn [1999] Ulrich Anders and Olaf Korn. Model selection in neural networks. Neural Netw., 12(2):309–323, 1999. ISSN 08936080. doi: http://dx.doi.org/10.1016/S08936080(98)001178.
 Andrawis et al. [2010] Robert R. Andrawis, Amir F. Atiya, and Hisham ElShishiny. Combination of long term and short term forecasts, with application to tourism demand forecasting. International Journal of Forecasting, In Press, Corrected Proof:–, 2010. ISSN 01692070. doi: DOI:10.1016/j.ijforecast.2010.05.019. URL http://www.sciencedirect.com/science/article/B6V92511BPB71/2/036adbc201cbc86a7156a65d2317bd51.
 Andrawis et al. [2011] Robert R. Andrawis, Amir F. Atiya, and Hisham ElShishiny. Forecast combinations of computational intelligence and linear models for the nn5 time series forecasting competition. International Journal of Forecasting, In Press, Corrected Proof:–, 2011. ISSN 01692070. doi: DOI:10.1016/j.ijforecast.2010.09.005. URL http://www.sciencedirect.com/science/article/B6V9251WV6JD2/2/110d69a3e7fdea1d853ee3152755f99a.
 Atiya et al. [1999] Amir Atiya, Suzan M. Elshoura, Samir I. Shaheen, and Mohamed S. Elsherif. A comparison between neuralnetwork forecasting techniques  case study: River flow forecasting. IEEE Transactions on Neural Networks, 10:402–409, 1999.
 Atkeson et al. [1997a] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Review, 11(1–5):11–73, 1997a.
 Atkeson et al. [1997b] Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. Locally weighted learning. Artificial Intelligence Review, 11(1):11–73, 1997b. URL http://www.springerlink.com/index/G8280541763Q0223.pdf.
 Bates, J. M. and Granger, C. W. J. [1969] Bates, J. M. and Granger, C. W. J. The combination of forecasts. OR, 20(4):451–468, 1969. ISSN 14732858. URL http://www.jstor.org/stable/3008764.
 Ben Taieb et al. [2009] Souhaib Ben Taieb, Gianluca Bontempi, Antti Sorjamaa, and Amaury Lendasse. Longterm prediction of time series by combining direct and mimo strategies. International Joint Conference on Neural Networks, 2009. URL http://eprints.pascalnetwork.org/archive/00004925/.
 Ben Taieb et al. [2010] Souhaib Ben Taieb, Antti Sorjamaa, and Gianluca Bontempi. Multipleoutput modeling for multistepahead time series forecasting. Neurocomputing, 73(1012):1950 – 1957, 2010. ISSN 09252312. doi: DOI:10.1016/j.neucom.2009.11.030. URL http://www.sciencedirect.com/science/article/B6V104YJ6GCW4/2/8429b80db7773717c9d455b485fb7c4d. Subspace Learning / Selected papers from the European Symposium on Time Series Prediction.
 Birattari and Bersini [1997] B. Birattari and M. Bersini. Lazy learning for local modeling and control design, 1997. URL http://citeseer.ist.psu.edu/bontempi97lazy.html.
 Birattari et al. [1999] M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets the recursive leastsquares algorithm. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, NIPS 11, pages 375–381, Cambridge, 1999. MIT Press.
 Bontempi [1999] G. Bontempi. Local Learning Techniques for Modeling, Prediction and Control. Ph.d., IRIDIAUniversité Libre de Bruxelles, BELGIUM, 1999.
 Bontempi [2008] G. Bontempi. Long term time series prediction with multiinput multioutput local learning. In Proceedings of the 2nd European Symposium on Time Series Prediction (TSP), ESTSP08, pages 145–154, Helsinki, Finland, February 2008.
 Bontempi et al. [1998] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for iterated time series prediction. In J. A. K. Suykens and J. Vandewalle, editors, Proceedings of the International Workshop on Advanced BlackBox Techniques for Nonlinear Modeling, pages 62–68. Katholieke Universiteit Leuven, Belgium, 1998.
 Bontempi et al. [1999] G. Bontempi, M. Birattari, and H. Bersini. Local learning for iterated timeseries prediction. In I. Bratko and S. Dzeroski, editors, Machine Learning: Proceedings of the Sixteenth International Conference, pages 32–38, San Francisco, CA, 1999. Morgan Kaufmann Publishers.
 Bontempi and Ben Taieb [2011] Gianluca Bontempi and Souhaib Ben Taieb. Conditionally dependent strategies for multiplestepahead prediction in local learning. International Journal of Forecasting, In Press, Corrected Proof:–, 2011. ISSN 01692070. doi: DOI:10.1016/j.ijforecast.2010.09.004. URL http://www.sciencedirect.com/science/article/B6V9251WV6JD1/2/8433907c4154533c05c47e8b56b79523.
 Breiman [1996] L Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. ISSN 08856125. doi: 10.1007/BF00058655. URL http://www.springerlink.com/index/10.1007/BF00058655.
 Breiman and Friedman [1997] L. Breiman and J. H. Friedman. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society, Series B, 59(1):3–54, 1997.
 Breiman et al. [1984] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth International Group, Belmont, CA, 1984.
 Casdagli et al. [1991] M. Casdagli, S. Eubank, J. D. Farmer, and J. Gibson. State space reconstruction in the presence of noise. Physica D, 51:52–98, 1991.
 Chapelle and Vapnik [2000] O Chapelle and V Vapnik. Model selection for support vector machines. In Advances in Neural Information Processing Systems 12. MIT Press, 2000.
 Cheng et al. [2006] Haibin Cheng, PangNing Tan, Jing Gao, and Jerry Scripps. Multistepahead time series prediction. In Wee Keong Ng, Masaru Kitsuregawa, Jianzhong Li, and Kuiyu Chang, editors, PAKDD, volume 3918 of Lecture Notes in Computer Science, pages 765–774. Springer, 2006. ISBN 3540332065.
 Clemen [1989] Robert T. Clemen. Combining forecasts: A review and annotated bibliography. International Journal of Forecasting, 5(4):559–583, 1989. ISSN 01692070. doi: http://dx.doi.org/10.1016/01692070(89)900125.
 Clements et al. [2004] Michael P. Clements, Philip Hans Franses, and Norman R. Swanson. Forecasting economic and financial timeseries with nonlinear models. International Journal of Forecasting, 20(2):169–183, 2004. ISSN 01692070. doi: http://dx.doi.org/10.1016/j.ijforecast.2003.10.004.
 Cleveland et al. [1988] William S. Cleveland, Susan J. Devlin, and Eric Grosse. Regression by local fitting : Methods, properties, and computational algorithms. Journal of Econometrics, 37(1):87 – 114, 1988. ISSN 03044076. doi: DOI:10.1016/03044076(88)900772. URL http://www.sciencedirect.com/science/article/B6VC04582FT01K/2/731cd0c23ef342f8d074fdd4e9c41325.
 Crone [a] Sven Crone. NN3 Forecasting Competition. http://www.neuralforecastingcompetition.com/NN3/index.htm, a. Last update 26/05/2009. Visited on 05/07/2010.
 Crone [b] Sven Crone. NN5 Forecasting Competition. http://www.neuralforecastingcompetition.com/NN5/index.htm, b. Last update 27/05/2009. Visited on 05/07/2010.
 Crone [2009] Sven F. Crone. Mining the past to determine the future: Comments. International Journal of Forecasting, 25(3):456 – 460, 2009. ISSN 01692070. doi: DOI:10.1016/j.ijforecast.2009.05.022. URL http://www.sciencedirect.com/science/article/B6V924WN8H501/2/44b2ded1c6387e7f124db526162db270. Special Section: Time Series Monitoring.
 Crone and Kourentzes [2009] Sven F. Crone and Nikolaos Kourentzes. Inputvariable specification for neural networks: an analysis of forecasting low and high time series frequency. In Proceedings of the 2009 international joint conference on Neural Networks, IJCNN’09, pages 3221–3228, Piscataway, NJ, USA, 2009. IEEE Press. ISBN 9781424435494. URL http://portal.acm.org/citation.cfm?id=1704555.1704739.
 Curry and Morgan [2006] B. Curry and P.H. Morgan. Model selection in neural networks: Some difficulties. European Journal of Operational Research, 170(2):567–577, April 2006. URL http://ideas.repec.org/a/eee/ejores/v170y2006i2p567577.html.

Demšar [2006]
Janez Demšar.
Statistical comparisons of classifiers over multiple data sets.
J. Mach. Learn. Res., 7:1–30, 2006. ISSN 15324435.  Engle [1982] Robert F. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econometrica, 50(4):pp. 987–1007, 1982. ISSN 00129682. URL http://www.jstor.org/stable/1912773.
 Fan and Gijbels [1995] J. Fan and I. Gijbels. Adaptive order polynomial fitting: bandwidth robustification and bias reduction. J. Comp. Graph. Statist., 4:213–227, 1995.
 Friedman [1937] Milton Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200):pp. 675–701, 1937. ISSN 01621459. URL http://www.jstor.org/stable/2279372.
 Friedman [1940] Milton Friedman. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):pp. 86–92, 1940. ISSN 00034851. URL http://www.jstor.org/stable/2235971.
 García and Herrera [2009] Salvador García and Francisco Herrera. An extension on ”statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research, 9:2677–2694, 2009. URL http://www.jmlr.org/papers/volume9/garcia08a/garcia08a.pdf.
 Gooijer and Hyndman [2006] Jan G. De Gooijer and Rob J. Hyndman. 25 years of time series forecasting. International Journal of Forecasting, 22(3):443–473, 2006. ISSN 01692070. doi: http://dx.doi.org/10.1016/j.ijforecast.2006.01.001.
 Gooijer and Kumar [1992] Jan G. De Gooijer and Kuldeep Kumar. Some recent developments in nonlinear time series modelling, testing, and forecasting. International Journal of Forecasting, 8(2):135 – 156, 1992. ISSN 01692070. doi: DOI:10.1016/01692070(92)90115P. URL http://www.sciencedirect.com/science/article/B6V92469244K1/2/cb5cbac7df80324a85e47c96f4a1e290.
 Guillén et al. [2008] A. Guillén, D. Sovilj, F. Mateo, I. Rojas, and A. Lendasse. New methodologies based on delta test for variable selection in regression problems. In Workshop on Parallel Architectures and Bioinspired Algorithms, Toronto, Canada, October 2529 2008.
 Hamzaçebi et al. [2009] Coskun Hamzaçebi, Diyar Akay, and Fevzi Kutay. Comparison of direct and iterative artificial neural network forecast approaches in multiperiodic time series forecasting. Expert Systems with Applications, 36(2, Part 2):3839 – 3844, 2009. ISSN 09574174. doi: DOI:10.1016/j.eswa.2008.02.042. URL http://www.sciencedirect.com/science/article/B6V034S03RD56/2/8520e0674be6409b24eb1aca953bdb09.
 Hand [2008] D. Hand. Mining the past to determine the future: Problems and possibilities. International Journal of Forecasting, October 2008. ISSN 01692070. doi: 10.1016/j.ijforecast.2008.09.004. URL http://dx.doi.org/10.1016/j.ijforecast.2008.09.004.
 Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, 2009. URL http://wwwstat.stanford.edu/~tibs/ElemStatLearn/.
 Hylleberg [1992] Svend Hylleberg. Modelling seasonality. Oxford University Press, Oxford, UK, 2 edition, 1992.
 Iman and Davenport [1980] R L Iman and J M Davenport. Approximations of the critical region of the friedman statistic. Communications In Statistics, pages 571–595, 1980.
 Jacobs et al. [1991] R A Jacobs, Michael I Jordan, S J Nowlan, and G E Hinton. Adaptive Mixtures of Local Experts. Neural Computation, 3(1):79–87, 1991. ISSN 08997667. doi: 10.1162/neco.1991.3.1.79. URL http://www.mitpressjournals.org/doi/abs/10.1162/neco.1991.3.1.79.
 Jordan and Jacobs [1994] M. J. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6:181–214, 1994.
 Kantz and Schreiber [2004] Holger Kantz and Thomas Schreiber. Nonlinear time series analysis. Cambridge University Press, New York, NY, USA, 2004.
 Kline [2004] D. M. Kline. Methods for multistep time series forecasting with neural networks. In G. Peter Zhang, editor, Neural Networks in Business Forecasting, pages 226–250. Information Science Publishing, 2004.
 Lapedes and Farber [1987] A. Lapedes and R. Farber. Nonlinear signal processing using neural networks: prediction and system modelling. Technical Report LAUR872662, Los Alamos National Laboratory, Los Alamos, NM, 1987.
 Lendasse [2007] Amaury Lendasse, editor. ESTSP 2007: Proceedings, 2007. ISBN 9789512286010.
 Lendasse [2008] Amaury Lendasse, editor. ESTSP 2008: Proceedings, 2008. Multiprint Oy / Otamedia. ISBN: 9789512295449.
 Liitiäinen and Lendasse [2007] E. Liitiäinen and A. Lendasse. Variable scaling for time series prediction: Application to the ESTSP07 and the NN3 forecasting competitions. In IJCNN 2007, International Joint Conference on Neural Networks, Orlando, Florida, USA, pages 2812 – 2816. Documation LLC, Eau Claire, Wisconsin, USA, August 2007. doi: 10.1109/–IJCNN˝.2007.4371405.
 Makridakis et al. [1998] Spyros Makridakis, Steven C. Wheelwright, and Rob J. Hyndman. Forecasting: Methods and Applications. John Wiley & Sons, 1998.
 Maron and Moore [1997] Oden Maron and Andrew W. Moore. The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, 11(1):193–225, February 1997. doi: 10.1023/A:1006556606079. URL http://dx.doi.org/10.1023/A:1006556606079.

Mateo and and [2010]
F. Mateo and D. Sovilj and.
Approximate kNN delta test minimization method using genetic algorithms: Application to time series.
Neurocomputing, 73(1012):2017–2029, June 2010. doi: 10.1016/j.neucom.2009.11.032.  Mateo and Lendasse [2008] F. Mateo and A. Lendasse. A variable selection approach based on the delta test for extreme learning machine models. In M. Verleysen, editor, Proceedings of the European Symposium on Time Series Prediction, pages 57–66. dside publ. (Evere, Belgium), September 2008.
 Matías [2005] José M. Matías. Multioutput nonparametric regression. In Carlos Bento, Amílcar Cardoso, and Gaël Dias, editors, EPIA, volume 3808 of Lecture Notes in Computer Science, pages 288–292. Springer, 2005. ISBN 3540307370.
 McNames [1998] J. McNames. A nearest trajectory strategy for time series prediction. In Proceedings of the InternationalWorkshop on Advanced BlackBox Techniques for Nonlinear Modeling, pages 112–128, Belgium, 1998. K.U. Leuven.
 Micchelli and Pontil [2005] Charles A. Micchelli and Massimiliano A. Pontil. On learning vectorvalued functions. Neural Comput., 17(1):177–204, 2005. ISSN 08997667. doi: http://dx.doi.org/10.1162/0899766052530802.
 Mitchell [1997] Tom M. Mitchell. Machine Learning. McGrawHill, New York, 1997.

Montgomery et al. [2006]
Douglas C. Montgomery, Elizabeth A. Peck, and Geoffrey G. Vining.
Introduction to Linear Regression Analysis (4th ed.)
. Wiley & Sons, Hoboken, July 2006. ISBN 0471754951. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike0720&path=ASIN/0471754951.  Moody and Darken [1989] J. Moody and C. J. Darken. Fast learning in networks of locallytuned processing units. Neural Computation, 1(2):281–294, 1989.
 MurraySmith [1994] R. MurraySmith. A local model network approach to nonlinear modelling. PhD thesis, Department of Computer Science, University of Strathclyde, Strathclyde, UK, 1994.
 Nelson et al. [1999] M. Nelson, T. Hill, W. Remus, and M. O’Connor. Time series forecasting using neural networks: should the data be deseasonalized first? Journal of Forecasting, 18(5):359 – 367, 1999. URL http://www.sciencedirect.com/science/article/B6V92469244K1/2/cb5cbac7df80324a85e47c96f4a1e290.
 Palit and Popovic [2005] Ajoy K. Palit and Dobrivoje Popovic. Computational Intelligence in Time Series Forecasting: Theory and Engineering Applications (Advances in Industrial Control). SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2005. ISBN 1852339489.
 Pi and Peterson [1994] Hong Pi and Carsten Peterson. Finding the embedding dimension and variable dependencies in time series. Neural Comput., 6:509–520, May 1994. ISSN 08997667. doi: 10.1162/neco.1994.6.3.509. URL http://portal.acm.org/citation.cfm?id=1362347.1362357.
 Poggio and Girosi [1990] R. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247:978–982, 1990.
 Poskitt and Tremayne [1986] D. S. Poskitt and A. R. Tremayne. The selection and use of linear and bilinear time series models. International Journal of Forecasting, 2(1):101–114, 1986. ISSN 01692070. doi: http://dx.doi.org/10.1016/01692070(86)900336.
 Price [2009] Simon Price. Mining the past to determine the future: Comments. International Journal of Forecasting, 25(3):452–455, July 2009. URL http://ideas.repec.org/a/eee/intfor/v25y2009i3p452455.html.
 Raudys and Zliobaite [2006] Sarunas Raudys and Indre Zliobaite. The MultiAgent System for Prediction of Financial Time Series. Artificial Intelligence and Soft Computing ICAISC 2006, pages 653–662, 2006.

Rumelhart et al. [1986]
D. E. Rumelhart, G. E. Hinton, and R. K. Williams.
Learning representations by backpropagating errors.
Nature, 323(9):533–536, 1986.  Ruppert et al. [1995] D. Ruppert, S. J. Sheather, and M. P. Wand. An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association, 90(432):pp. 1257–1270, 1995. ISSN 01621459. URL http://www.jstor.org/stable/2291516.
 Saad et al. [1998] E. Saad, D. Prokhorov, and D. Wunsch. Comparative study of stock trend prediction using time delay, recurrent and probabilistic neural networks. Neural Networks, IEEE Transactions on, 9(6):1456–1470, 1998. doi: 10.1109/72.728395. URL http://dx.doi.org/10.1109/72.728395.
 Sauer [1994] T. Sauer. Time series prediction by using delay coordinate embedding. In A. S. Weigend and N. A. Gershenfeld, editors, Time Series Prediction: forecasting the future and understanding the past, pages 175–193. Addison Wesley, Harlow, UK, 1994.
 Schapire et al. [1998] Robert E Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5):1651–1686, 1998. ISSN 00905364. doi: 10.1214/aos/1024691352. URL http://projecteuclid.org:80/Dienst/getRecord?id=euclid.aos/1024691352/.
 Seber and Wild [1989] G. A. F. Seber and C. J. Wild. Nonlinear regression. Wiley, New York, 1989.
 Sorjamaa and Lendasse [2006] A. Sorjamaa and A. Lendasse. Time series prediction using dirrec strategy. In M. Verleysen, editor, ESANN06, European Symposium on Artificial Neural Networks, pages 143–148, Bruges, Belgium, April 2628 2006. European Symposium on Artificial Neural Networks.
 Sorjamaa et al. [2007] A. Sorjamaa, J. Hao, N. Reyhani, Y. Ji, and A. Lendasse. Methodology for longterm prediction of time series. Neurocomputing, 70(1618):2861–2869, October 2007. doi: 10.1016/j.neucom.2006.06.015.
 Takagi and Sugeno [1985] T. Takagi and M. Sugeno. Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics, 15(1):116–132, 1985.
 Tashman [2000] Leonard J. Tashman. Outofsample tests of forecasting accuracy: an analysis and review. International Journal of Forecasting, 16(4):437–450, 2000. ISSN 01692070. doi: http://dx.doi.org/10.1016/S01692070(00)000650.
 Tiao and Tsay [1994] George C. Tiao and Ruey S. Tsay. Some advances in nonlinear and adaptive modelling in timeseries. Journal of Forecasting, 13(2):109–131, 1994. doi: http://dx.doi.org/10.1002/for.3980130206.
 Timmermann [2006] A. Timmermann. Forecast combinations. In G. Elliott, C. Granger, and A. Timmermann, editors, Handbook of Economic Forecasting, pages 135–196. Elsevier Pub., 2006.
 Tong [1983] H. Tong. Threshold models in Nonlinear Time Series Analysis. Springer Verlag, Berlin, 1983.
 Tong [1990] H. Tong. Nonlinear Time Series: A Dynamical System Approach. Oxford University Press, 1990.
 Tong and Lim [1980] H. Tong and K. S. Lim. Threshold autoregression, limit cycles and cyclical data. Journal of the Royal Statistical Society. Series B (Methodological), 42(3):pp. 245–292, 1980. ISSN 00359246. URL http://www.jstor.org/stable/2985164.
 Tran et al. [2009] Van Tung Tran, BoSuk Yang, and Andy Chit Chiow Tan. Multistep ahead direct prediction for the machine condition prognosis using regression trees and neurofuzzy systems. Expert Syst. Appl., 36(5):9378–9387, 2009. ISSN 09574174. doi: http://dx.doi.org/10.1016/j.eswa.2009.01.007.
 Weigend and Gershenfeld [1994] A. S. Weigend and N. A. Gershenfeld, editors. Time series prediction: Forecasting the future and understanding the past, 1994. URL http://adsabs.harvard.edu/cgibin/nphbib_query?bibcode=1994tspf.conf.....W.
 Weigend et al. [1992] A. S. Weigend, B. A. Huberman, and D. E. Rumelhart. Predicting sunspots and exchange rates with connectionist networks. In M. Casdagli and S. Eubank, editors, Nonlinear modeling and forecasting, pages 395–432. AddisonWesley, 1992.
 Werbos [1974] P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, Cambridge, MA, 1974.
 Werbos [1988] Paul J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339 – 356, 1988. ISSN 08936080. doi: DOI:10.1016/08936080(88)90007X. URL http://www.sciencedirect.com/science/article/B6T08485RHDS7/2/037e956cda49bd2d2c66085cfccd7de4.
 Wichard [2010] Jörg D. Wichard. Forecasting the nn5 time series with hybrid models. International Journal of Forecasting, In Press, Corrected Proof:–, 2010. ISSN 01692070. doi: DOI:10.1016/j.ijforecast.2010.02.011. URL http://www.sciencedirect.com/science/article/B6V92504CN8T1/2/b65a180735d577e35146238251fed97e.
 Zhang and Qi [2005] G. Peter Zhang and Min Qi. Neural network forecasting for seasonal and trend time series. European Journal of Operational Research, 160(2):501 – 514, 2005. ISSN 03772217. doi: DOI:10.1016/j.ejor.2003.08.037. URL http://www.sciencedirect.com/science/article/B6VCT4B1SMWY9/2/24d67e60c11bd47d4d6d6eeac708caeb. Decision Support Systems in the Internet Age.
 Zhang et al. [1998] Guoqiang Zhang, B. Eddy Patuwo, and Michael Y. Hu. Forecasting with artificial neural networks:: The state of the art. International Journal of Forecasting, 14(1):35–62, 1998. ISSN 01692070. doi: http://dx.doi.org/10.1016/S01692070(97)000447.
 Zhang and Hutchinson [1994] X. Zhang and J. Hutchinson. Simple architectures on fast machines: practical issues in nonlinear time series prediction. In A. S. Weigend and N. A. Gershenfeld, editors, Time Series Prediction: forecasting the future and understanding the past, pages 219–241. Addison Wesley, 1994.
Comments
There are no comments yet.