The problem of long-term forecasting of time series is of high practical importance. For example, nowadays nearly everybody uses long-term weather forecasts Richardson (2007); Lorenc (1986) (24 hour, 7 days, etc.) provided by local weather forecasting platforms. Road traffic and jams forecasts De Wit et al. (2015); Herrera et al. (2010); Myr (2002) are being actively used in many modern navigating systems. Forecasts of energy consumption and costs Gaillard and Goude (2015), web traffic Oliveira et al. (2016) and stock prices Ding et al. (2015); Pai and Lin (2005) are also widely used in practice.
Many state-of-the-art (e.g. ARIMA Box et al. (2015)) and modern (e.g. Facebook Prophet111https://github.com/facebook/prophet Taylor and Letham (2018)) time series forecasting approaches produce a model that is capable of predicting arbitrarily many steps ahead. The advantage of such models is that when building the final forecast at each step for interval ahead, one may use forecasts made earlier at the steps . Forecasts of each step are made using less of the observed data. Nevertheless, they can be more robust to noise, outliers and novelty of the time interval . Thus, the usage of such outdated forecasts may prove useful, especially if time series is stationary.
In general, we consider the game-theoretic on-line learning model in which a master (aggregating) algorithm has to combine predictions from a set of experts. The problem setting we investigate can be considered as the part of Decision-Theoretic Online Learning (DTOL) or Prediction with Expert Advice (PEA) framework (see e.g. Littlestone and Warmuth (1994); Freund and Schapire (1997); Vovk (1990, 1998); Cesa-Bianchi and Lugosi (2006); Korotin et al. (2018) among others). In this framework the learner is usually called the aggregating algorithm. The aggregating algorithm combines the predictions from a set of experts in the online mode during time steps .
In practice for time series prediction the square loss function is widely used. The square loss function is mixableVovk (1998). For mixable loss functions Vovk’s aggregating algorithm (AA) Vovk (1998, 2001) is the most appropriate, since it has theoretically best performance among all known algorithms. We use the aggregating algorithm as the base and modify it for the long-term forecasting.
The long-term forecasting considered in this paper is a case of the forecasting with a delayed feedback. As far as we know, the problem of the delayed feedback forecasting was first considered by Weinberger and Ordentlich (2002).
In this paper we consider the two scenarios of the long-term forecasting. In the first one, at each step the learner has to combine the point forecasts of the experts issued for the time interval ahead. In the second scenario, at each step experts issue prediction functions, and the learner has to combine these functions into the single one, that will be used for long-term time-series prediction.
The first theoretical problem we investigate in the paper is the effective usage of the outdated forecasts. Formally, the learner is given basic forecasting models. Each model at every step produces infinite forecast for the steps ahead. The goal of the learner at each step is to combine the current models’ forecasts and the forecasts made earlier into one aggregated long-term forecast for the time interval ahead. We develop an algorithm to efficiently combine these forecasts.
Our main idea is to replicate any expert in an infinite sequence of auxiliary experts , where . Each expert
issues at time momentan infinite sequence of forecasts for time moments . Only a finite number of the experts are available at any time moment. The setting presented in this paper is valid also in case where only one expert () is given. At any time moment the AA uses predictions of each expert for the time interval (made by expert at time ). In our case, the performance of the AA on the step is measured by the regret which is the difference between the average loss of the aggregating algorithm suffered on time interval and the average loss of the best auxiliary expert suffered on the same time interval. Note that the recent related work is Adamskiy et al. (2017) where an algorithm with tight upper bound for predicting vector valued outcomes was presented.
In the second part of our paper we consider the online supervised learning scenario. The data is represented by pairs
of predictor-response variables. Instead of point or interval predictions, the experts and the learner present predictions in the form of functionsfrom signals . Signals appear gradually over time and allow to calculate forecasts as the values of these functions. For this problem we present method for smoothing regression using expert advice.
The article is structured as follows. In Section 2 we give some preliminary notions. In Section 3 we present the algorithm for combining long-term forecasts of the experts. Theorem 1 presents a performance bound for the regret of the corresponding algorithms.
In this section we recall the main ideas of prediction with expert advice theory. Let a pool of experts be given. Suppose that elements of a time series are revealed online – step by step. Learning proceeds in trials . At each time moment experts present their predictions and the aggregating algorithm presents its own forecast . When the corresponding outcome(s) are revealed, all the experts suffer their losses using a loss function: , . Let be the loss of the aggregating algorithm. The cumulative loss suffered by any expert and by AA during steps are defined as
The performance of the algorithm w.r.t. an expert can be measured by the regret .
The goal of the aggregating algorithm is to minimize the regret with respect to each expert. In order to achieve this goal, at each time moment , the aggregating algorithm evaluates performance of the experts in the form of a vector of experts’ weights , where and for all . The weight of an expert
is an estimate of the quality of the expert’s predictions at step. In classical setting (see Freund and Schapire (1997), Vovk (1990) among others), the process of expert weights updating is based on the method of exponential weighting with a learning rate :
where is some weight vector, for example, . In classical setting, we prepare weights for using at the next step or, in a more general case of the -th outcome ahead prediction, we define , where .
We consider the learning with a mixable loss function . Here is an element of some set of outcomes , and is an element of some set of forecasts . The experts present the forecasts .
In this case the main tool is a superprediction function
is a probability distribution on the set of all experts andis a vector of the experts predictions.
The loss function is mixable if for any probability distribution on the set of experts and for any set of experts predictions a value of exists such that
for all .
We fix some rule for computing a forecast satisfying (2). is called a substitution function.
for all .
A loss function is -exponential concave if for any the function is concave w.r.t. . By definition any -exponential concave function is -mixable.
For the -exponential concave loss function we can also use a more straightforward expression for the substitution function:
The inequality (2) also holds for all .
3 Algorithm for Combining Long-term Forecasts of Experts.
In this section we consider an extended setting. At each time moment each expert presents an infinite sequence of forecasts for the time moments . A sequence of the corresponding confidence levels also can be presented at time moment . Each element of this sequence is a number between 0 and 1. If , then it means that we use the forecast only partially (e.g. it may become obsolete with time). If then the corresponding forecast is not taken into account at all.222For example, in applications, it is convenient for some to set for all , since too far predictions become obsolete. Confidence levels can be set by the expert itself or by the learner.333 The setting of prediction with experts that report their confidences as a number in the interval was first studied by Blum and Mansour (2007) and further developed by Cesa-Bianchi et al. (2007).
At each time moment we observe sequences , issued by the experts at the time moments . To aggregate the forecasts of all experts, we convert any “real” expert into the infinite sequence of the auxiliary experts , where .
At each time moment expert presents his forecast which is the segment of the sequence of length starting at its th element. More precisely, the forecast of the auxiliary expert is a vector
where for we set .
We also denote the corresponding segments of confidence levels by
where for and for .
Using the losses suffered by the experts (for ) on the time interval , the aggregating algorithm updates the weights of all the experts by the rule (1). We denote these weights by and use them for computing the aggregated interval forecast for time moments ahead
We use the fixed point method by Chernov and Vovk (2009). Define the virtual forecasts of the experts :
where and .
We consider any confidence level as a probability distribution on a two element set.
First, we provide a justification of the algorithm presented below. Our goal is to define the forecast such that for
for each outcome . Here is the mathematical expectation with respect to the probability distribution . Also, is the weight of the auxiliary expert accumulated at the end of step .
We rewrite inequality (5) in a more detailed form: for any ,
for all . Therefore, the inequality (5) is equivalent to the inequality
holds for all . We use the generalized Hölder inequality and obtain
is the (averaged) loss of the aggregating algorithm suffered on the time interval and
is the (averaged) mean loss of the expert .
The protocol of algorithm for aggregating forecasts of experts is shown below.
Set , where , , .
IF THEN put for all and .
Observe the outcomes and predictions of the learner issued at the time moment .
Compute the loss of the learner on the time segment , where .
Compute the losses of the experts for , where for we set if and if .
for , .444These weights can be computed efficiently, since the divisor in (13) can be represented(14)
Prepare the weights: for and .
Receive predictions issued by the experts at the time moments and their confidence levels .
Extract the segments of forecasts of the the auxiliary experts , where for , and the segments of the corresponding confidences , where .555Here is a forecast of the real expert for the time moment issued at the time moment and .
We put these quantities to be 0 for . Also, for . Since by definition
Recall that be the algorithm (average) loss and be the (average) loss of the auxiliary expert .
Define the discounted (average) excess loss with respect to an expert at a time moment by
By definition of we can represent the discounted excess loss (16) as
We measure the performance of our algorithm by the cumulative discounted (average) excess loss with respect to any expert .
For any and , the following upper bound for the cumulative excess loss holds true:
for each expert such that and .
4 Online Smoothing Regression
In this section we consider the online learning scenario within the supervised setting (that is, data are pairs of predictor-response variables). A forecaster presents a regression function defined on a set of objects, which are called signals. After a pair be revealed the forecaster suffers a loss , where is some loss function. We assume that and that the loss function is -mixable for some .
An example is a linear regression, whereis a set of -dimensional vectors and a regression function is a linear function , where is a weight vector and is the square loss.
In the online mode, at any step , to define the forecast for step – a regression function , we use the prediction with expert advice approach. A feature of this approach is that we aggregate the regression functions for , each of which depends on the segment of the sample. At the end of step we define (initialize) the next regression function by the sample .
Since the forecast can potentially be applied to any future input value , we consider this method as a kind of long-term forecasting.
We briefly describe below the changes made in Algorithm 1. We introduce signals in the protocol from Section 3.
Set initial weights as in Algorithm 1.
Observe the pair and compute the losses suffered by the learner and by the expert regression functions: if and otherwise.
for . See also footnote to (13).
Let us analyze the performance of Algorithm 2 as a forecaster on steps ahead.
For any time moment a sequence is revealed. Denote by the average loss of the learner on time interval and by the average loss of any auxiliary expert .
The regret bound of Algorithm 2 does not depend on :
For any ,
Proof. The analysis of the performance of Algorithm 2 for the case of prediction on steps ahead is similar to that of Algorithm 1 for . Let and be given. Using the technics of Section A, we obtain for any ,
Summing this inequality by and dividing by , we obtain (22).
In particular, Theorem 2 implies that the total loss of Algorithm 2 at any time interval is no more (up to logarithmic regret) than the loss of the best regression algorithm constructed in the past.
Online regression with a sliding window. Some time series show a strong dependence on the latest information instead of all the data. In this case, it is useful to apply regression with a sliding window. In this regard, we consider the application of Algorithm 2 for the case of online regression with a sliding window. The corresponding expert represents some type of dependence between input and output data. If this relationship is relatively regular the corresponding experts based on past data can successfully compete with experts based on the latest data. Therefore, it may be useful to aggregate the predictions of all the auxiliary experts based on past data.
be the ridge regression function, where for, . Here is the matrix in which rows are formed by vectors ( is the transposed matrix ), is a unit matrix, is a parameter, and . For we set equal to some fixed value.
We use the square loss function and assume that for all . For each we define the aggregating regression function (the learner forecast) by (21) using the regression functions (the expert strategies) for , where each such a function is defined using a learning sample (a window) .999 The computationally efficient algorithm for recalculating matrices during the transition from to for some special type of online regression with a sliding window was presented by Arce and Salinas (2012). Similar effective options for regression using Algorithm 2 can also be developed.
Experiments. Let us present the results of experiments which were performed on synthetic data. The initial data was obtained as a result of sampling from a data generative model.
We start from a sequence of
-dimensional signals sampled i.i.d from the multidimensional normal distribution. The signals are revealed online and