1 Introduction
The problem of longterm forecasting of time series is of high practical importance. For example, nowadays nearly everybody uses longterm weather forecasts Richardson (2007); Lorenc (1986) (24 hour, 7 days, etc.) provided by local weather forecasting platforms. Road traffic and jams forecasts De Wit et al. (2015); Herrera et al. (2010); Myr (2002) are being actively used in many modern navigating systems. Forecasts of energy consumption and costs Gaillard and Goude (2015), web traffic Oliveira et al. (2016) and stock prices Ding et al. (2015); Pai and Lin (2005) are also widely used in practice.
Many stateoftheart (e.g. ARIMA Box et al. (2015)) and modern (e.g. Facebook Prophet^{1}^{1}1https://github.com/facebook/prophet Taylor and Letham (2018)) time series forecasting approaches produce a model that is capable of predicting arbitrarily many steps ahead. The advantage of such models is that when building the final forecast at each step for interval ahead, one may use forecasts made earlier at the steps . Forecasts of each step are made using less of the observed data. Nevertheless, they can be more robust to noise, outliers and novelty of the time interval . Thus, the usage of such outdated forecasts may prove useful, especially if time series is stationary.
In general, we consider the gametheoretic online learning model in which a master (aggregating) algorithm has to combine predictions from a set of experts. The problem setting we investigate can be considered as the part of DecisionTheoretic Online Learning (DTOL) or Prediction with Expert Advice (PEA) framework (see e.g. Littlestone and Warmuth (1994); Freund and Schapire (1997); Vovk (1990, 1998); CesaBianchi and Lugosi (2006); Korotin et al. (2018) among others). In this framework the learner is usually called the aggregating algorithm. The aggregating algorithm combines the predictions from a set of experts in the online mode during time steps .
In practice for time series prediction the square loss function is widely used. The square loss function is mixable
Vovk (1998). For mixable loss functions Vovk’s aggregating algorithm (AA) Vovk (1998, 2001) is the most appropriate, since it has theoretically best performance among all known algorithms. We use the aggregating algorithm as the base and modify it for the longterm forecasting.The longterm forecasting considered in this paper is a case of the forecasting with a delayed feedback. As far as we know, the problem of the delayed feedback forecasting was first considered by Weinberger and Ordentlich (2002).
In this paper we consider the two scenarios of the longterm forecasting. In the first one, at each step the learner has to combine the point forecasts of the experts issued for the time interval ahead. In the second scenario, at each step experts issue prediction functions, and the learner has to combine these functions into the single one, that will be used for longterm timeseries prediction.
The first theoretical problem we investigate in the paper is the effective usage of the outdated forecasts. Formally, the learner is given basic forecasting models. Each model at every step produces infinite forecast for the steps ahead. The goal of the learner at each step is to combine the current models’ forecasts and the forecasts made earlier into one aggregated longterm forecast for the time interval ahead. We develop an algorithm to efficiently combine these forecasts.
Our main idea is to replicate any expert in an infinite sequence of auxiliary experts , where . Each expert
issues at time moment
an infinite sequence of forecasts for time moments . Only a finite number of the experts are available at any time moment. The setting presented in this paper is valid also in case where only one expert () is given. At any time moment the AA uses predictions of each expert for the time interval (made by expert at time ). In our case, the performance of the AA on the step is measured by the regret which is the difference between the average loss of the aggregating algorithm suffered on time interval and the average loss of the best auxiliary expert suffered on the same time interval. Note that the recent related work is Adamskiy et al. (2017) where an algorithm with tight upper bound for predicting vector valued outcomes was presented.In the second part of our paper we consider the online supervised learning scenario. The data is represented by pairs
of predictorresponse variables. Instead of point or interval predictions, the experts and the learner present predictions in the form of functions
from signals . Signals appear gradually over time and allow to calculate forecasts as the values of these functions. For this problem we present method for smoothing regression using expert advice.2 Preliminaries
In this section we recall the main ideas of prediction with expert advice theory. Let a pool of experts be given. Suppose that elements of a time series are revealed online – step by step. Learning proceeds in trials . At each time moment experts present their predictions and the aggregating algorithm presents its own forecast . When the corresponding outcome(s) are revealed, all the experts suffer their losses using a loss function: , . Let be the loss of the aggregating algorithm. The cumulative loss suffered by any expert and by AA during steps are defined as
The performance of the algorithm w.r.t. an expert can be measured by the regret .
The goal of the aggregating algorithm is to minimize the regret with respect to each expert. In order to achieve this goal, at each time moment , the aggregating algorithm evaluates performance of the experts in the form of a vector of experts’ weights , where and for all . The weight of an expert
is an estimate of the quality of the expert’s predictions at step
. In classical setting (see Freund and Schapire (1997), Vovk (1990) among others), the process of expert weights updating is based on the method of exponential weighting with a learning rate :(1) 
where is some weight vector, for example, . In classical setting, we prepare weights for using at the next step or, in a more general case of the th outcome ahead prediction, we define , where .
The Vovk’s aggregating algorithm (AA) (Vovk (1990), Vovk (1998)) is the base algorithm in our study. Let us explain the main ideas of learning with AA.
We consider the learning with a mixable loss function . Here is an element of some set of outcomes , and is an element of some set of forecasts . The experts present the forecasts .
In this case the main tool is a superprediction function
where
is a probability distribution on the set of all experts and
is a vector of the experts predictions.The loss function is mixable if for any probability distribution on the set of experts and for any set of experts predictions a value of exists such that
(2) 
for all .
We fix some rule for computing a forecast satisfying (2). is called a substitution function.
It will be proved in Section A that using the rules (1) and (2) for defining weights and the forecasts in the online mode we obtain
for all .
A loss function is exponential concave if for any the function is concave w.r.t. . By definition any exponential concave function is mixable.
The square loss function is mixable for any such that , where and are a real numbers and for some , see Vovk (1990, 1998).
3 Algorithm for Combining Longterm Forecasts of Experts.
In this section we consider an extended setting. At each time moment each expert presents an infinite sequence of forecasts for the time moments . A sequence of the corresponding confidence levels also can be presented at time moment . Each element of this sequence is a number between 0 and 1. If , then it means that we use the forecast only partially (e.g. it may become obsolete with time). If then the corresponding forecast is not taken into account at all.^{2}^{2}2For example, in applications, it is convenient for some to set for all , since too far predictions become obsolete. Confidence levels can be set by the expert itself or by the learner.^{3}^{3}3 The setting of prediction with experts that report their confidences as a number in the interval was first studied by Blum and Mansour (2007) and further developed by CesaBianchi et al. (2007).
At each time moment we observe sequences , issued by the experts at the time moments . To aggregate the forecasts of all experts, we convert any “real” expert into the infinite sequence of the auxiliary experts , where .
At each time moment expert presents his forecast which is the segment of the sequence of length starting at its th element. More precisely, the forecast of the auxiliary expert is a vector
where for we set .
We also denote the corresponding segments of confidence levels by
where for and for .
Using the losses suffered by the experts (for ) on the time interval , the aggregating algorithm updates the weights of all the experts by the rule (1). We denote these weights by and use them for computing the aggregated interval forecast for time moments ahead
We use the fixed point method by Chernov and Vovk (2009). Define the virtual forecasts of the experts :
where and .
We consider any confidence level as a probability distribution on a two element set.
First, we provide a justification of the algorithm presented below. Our goal is to define the forecast such that for
(5) 
for each outcome . Here is the mathematical expectation with respect to the probability distribution . Also, is the weight of the auxiliary expert accumulated at the end of step .
We rewrite inequality (5) in a more detailed form: for any ,
(6)  
(7) 
for all . Therefore, the inequality (5) is equivalent to the inequality
(8) 
where
(9) 
According to the aggregating algorithm rule we can define for such that (8) and its equivalent (5) are valid. Here is the substitution function and
The outcomes will be fully revealed only at the time moment . The inequality (5) holds for and for the forecasts for all . By convexity of the exponent the inequality (5) implies that
(10) 
holds for all . We use the generalized Hölder inequality and obtain
(11) 
For more details of the Hölder inequality see A. The inequality (11) can be rewritten as
(12) 
where
is the (averaged) loss of the aggregating algorithm suffered on the time interval and
is the (averaged) mean loss of the expert .
The protocol of algorithm for aggregating forecasts of experts is shown below.
Algorithm 1
Set , where , , .
FOR
IF THEN put for all and .
ELSE

Observe the outcomes and predictions of the learner issued at the time moment .

Compute the loss of the learner on the time segment , where .

Compute the losses of the experts for , where for we set if and if .
ENDIF

Update weights:
(13) for , .^{4}^{4}4These weights can be computed efficiently, since the divisor in (13) can be represented
(14) 
Prepare the weights: for and .

Receive predictions issued by the experts at the time moments and their confidence levels .

Extract the segments of forecasts of the the auxiliary experts , where for , and the segments of the corresponding confidences , where .^{5}^{5}5Here is a forecast of the real expert for the time moment issued at the time moment and .
ENDFOR
Denote for
We put these quantities to be 0 for . Also, for . Since by definition
we have
Recall that be the algorithm (average) loss and be the (average) loss of the auxiliary expert .
Define the discounted (average) excess loss with respect to an expert at a time moment by
(16) 
By definition of we can represent the discounted excess loss (16) as
We measure the performance of our algorithm by the cumulative discounted (average) excess loss with respect to any expert .
Theorem 1.
For any and , the following upper bound for the cumulative excess loss holds true:
(17) 
4 Online Smoothing Regression
In this section we consider the online learning scenario within the supervised setting (that is, data are pairs of predictorresponse variables). A forecaster presents a regression function defined on a set of objects, which are called signals. After a pair be revealed the forecaster suffers a loss , where is some loss function. We assume that and that the loss function is mixable for some .
An example is a linear regression, where
is a set of dimensional vectors and a regression function is a linear function , where is a weight vector and is the square loss.In the online mode, at any step , to define the forecast for step – a regression function , we use the prediction with expert advice approach. A feature of this approach is that we aggregate the regression functions for , each of which depends on the segment of the sample. At the end of step we define (initialize) the next regression function by the sample .
Since the forecast can potentially be applied to any future input value , we consider this method as a kind of longterm forecasting.
We briefly describe below the changes made in Algorithm 1. We introduce signals in the protocol from Section 3.
Algorithm 2
Set initial weights as in Algorithm 1.
FOR

Observe the pair and compute the losses suffered by the learner and by the expert regression functions: if and otherwise.
ENDFOR
For the square loss , where , by (3) the regression function (20) can be defined in the closed form:
(21) 
for each or by the rule (4).^{8}^{8}8The most appropriate choices of are for the rule (3) and for (4). The more straightforward definition (4) results in four times more regret but easier for computation.
Let us analyze the performance of Algorithm 2 as a forecaster on steps ahead.
For any time moment a sequence is revealed. Denote by the average loss of the learner on time interval and by the average loss of any auxiliary expert .
The regret bound of Algorithm 2 does not depend on :
Theorem 2.
For any ,
(22) 
Proof. The analysis of the performance of Algorithm 2 for the case of prediction on steps ahead is similar to that of Algorithm 1 for . Let and be given. Using the technics of Section A, we obtain for any ,
Summing this inequality by and dividing by , we obtain (22).
In particular, Theorem 2 implies that the total loss of Algorithm 2 at any time interval is no more (up to logarithmic regret) than the loss of the best regression algorithm constructed in the past.
Online regression with a sliding window. Some time series show a strong dependence on the latest information instead of all the data. In this case, it is useful to apply regression with a sliding window. In this regard, we consider the application of Algorithm 2 for the case of online regression with a sliding window. The corresponding expert represents some type of dependence between input and output data. If this relationship is relatively regular the corresponding experts based on past data can successfully compete with experts based on the latest data. Therefore, it may be useful to aggregate the predictions of all the auxiliary experts based on past data.
Let
be the ridge regression function, where for
, . Here is the matrix in which rows are formed by vectors ( is the transposed matrix ), is a unit matrix, is a parameter, and . For we set equal to some fixed value.We use the square loss function and assume that for all . For each we define the aggregating regression function (the learner forecast) by (21) using the regression functions (the expert strategies) for , where each such a function is defined using a learning sample (a window) .^{9}^{9}9 The computationally efficient algorithm for recalculating matrices during the transition from to for some special type of online regression with a sliding window was presented by Arce and Salinas (2012). Similar effective options for regression using Algorithm 2 can also be developed.
Experiments. Let us present the results of experiments which were performed on synthetic data. The initial data was obtained as a result of sampling from a data generative model.
We start from a sequence of
dimensional signals sampled i.i.d from the multidimensional normal distribution. The signals are revealed online and
Comments
There are no comments yet.