# Long-Term Sequential Prediction Using Expert Advice

For the prediction with expert advice setting, we consider methods to construct forecasting algorithms that suffer loss not much more than of any expert in the pool. In contrast to the standard approach, we investigate the case of long-term interval forecasting of time series, that is, each expert issues a sequence of forecasts for a time interval ahead and the master algorithm combines these forecasts into one aggregated sequence of forecasts. Two new approaches for aggregating experts long-term interval predictions are presented. One is based on Vovk's aggregation algorithm and considers sliding experts, the other applies the approach of Mixing Past Posteriors method to the long-term prediction. The upper bounds for regret of these algorithms for adversarial case are obtained. We also present results of numerical experiments of time series long-term prediction.

## Authors

• 3 publications
• 19 publications
• 8 publications
03/18/2018

### Aggregating Strategies for Long-term Forecasting

The article is devoted to investigating the application of aggregating a...
12/15/2020

### Long-term prediction intervals with many covariates

Accurate forecasting is one of the fundamental focus in the literature o...
02/19/2018

### Large Scale Automated Forecasting for Monitoring Network Safety and Security

Real time large scale streaming data pose major challenges to forecastin...
08/09/2014

### Prediction with Advice of Unknown Number of Experts

In the framework of prediction with expert advice, we consider a recentl...
07/22/2018

### Long-Term Forecasts of Military Technologies for a 20-30 Year Horizon: An Empirical Assessment of Accuracy

During the 1990s, while exploring the impact of the collapse of the Sovi...
12/22/2016

### First-Person Activity Forecasting with Online Inverse Reinforcement Learning

We address the problem of incrementally modeling and forecasting long-te...
08/02/2018

### Online Aggregation of Unbounded Losses Using Shifting Experts with Confidence

We develop the setting of sequential prediction based on shifting expert...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The problem of long-term forecasting of time series is of high practical importance. For example, nowadays nearly everybody uses long-term weather forecasts Richardson (2007); Lorenc (1986) (24 hour, 7 days, etc.) provided by local weather forecasting platforms. Road traffic and jams forecasts De Wit et al. (2015); Herrera et al. (2010); Myr (2002) are being actively used in many modern navigating systems. Forecasts of energy consumption and costs Gaillard and Goude (2015), web traffic Oliveira et al. (2016) and stock prices Ding et al. (2015); Pai and Lin (2005) are also widely used in practice.

Many state-of-the-art (e.g. ARIMA Box et al. (2015)) and modern (e.g. Facebook Prophet Taylor and Letham (2018)) time series forecasting approaches produce a model that is capable of predicting arbitrarily many steps ahead. The advantage of such models is that when building the final forecast at each step for interval ahead, one may use forecasts made earlier at the steps . Forecasts of each step are made using less of the observed data. Nevertheless, they can be more robust to noise, outliers and novelty of the time interval . Thus, the usage of such outdated forecasts may prove useful, especially if time series is stationary.

In general, we consider the game-theoretic on-line learning model in which a master (aggregating) algorithm has to combine predictions from a set of experts. The problem setting we investigate can be considered as the part of Decision-Theoretic Online Learning (DTOL) or Prediction with Expert Advice (PEA) framework (see e.g. Littlestone and Warmuth (1994); Freund and Schapire (1997); Vovk (1990, 1998); Cesa-Bianchi and Lugosi (2006); Korotin et al. (2018) among others). In this framework the learner is usually called the aggregating algorithm. The aggregating algorithm combines the predictions from a set of experts in the online mode during time steps .

In practice for time series prediction the square loss function is widely used. The square loss function is mixable

Vovk (1998). For mixable loss functions Vovk’s aggregating algorithm (AA) Vovk (1998, 2001) is the most appropriate, since it has theoretically best performance among all known algorithms. We use the aggregating algorithm as the base and modify it for the long-term forecasting.

The long-term forecasting considered in this paper is a case of the forecasting with a delayed feedback. As far as we know, the problem of the delayed feedback forecasting was first considered by Weinberger and Ordentlich (2002).

In this paper we consider the two scenarios of the long-term forecasting. In the first one, at each step the learner has to combine the point forecasts of the experts issued for the time interval ahead. In the second scenario, at each step experts issue prediction functions, and the learner has to combine these functions into the single one, that will be used for long-term time-series prediction.

The first theoretical problem we investigate in the paper is the effective usage of the outdated forecasts. Formally, the learner is given basic forecasting models. Each model at every step produces infinite forecast for the steps ahead. The goal of the learner at each step is to combine the current models’ forecasts and the forecasts made earlier into one aggregated long-term forecast for the time interval ahead. We develop an algorithm to efficiently combine these forecasts.

Our main idea is to replicate any expert in an infinite sequence of auxiliary experts , where . Each expert

issues at time moment

an infinite sequence of forecasts for time moments . Only a finite number of the experts are available at any time moment. The setting presented in this paper is valid also in case where only one expert () is given. At any time moment the AA uses predictions of each expert for the time interval (made by expert at time ). In our case, the performance of the AA on the step is measured by the regret which is the difference between the average loss of the aggregating algorithm suffered on time interval and the average loss of the best auxiliary expert suffered on the same time interval. Note that the recent related work is Adamskiy et al. (2017) where an algorithm with tight upper bound for predicting vector valued outcomes was presented.

In the second part of our paper we consider the online supervised learning scenario. The data is represented by pairs

of predictor-response variables. Instead of point or interval predictions, the experts and the learner present predictions in the form of functions

from signals . Signals appear gradually over time and allow to calculate forecasts as the values of these functions. For this problem we present method for smoothing regression using expert advice.

The article is structured as follows. In Section 2 we give some preliminary notions. In Section 3 we present the algorithm for combining long-term forecasts of the experts. Theorem 1 presents a performance bound for the regret of the corresponding algorithms.

In Section 4 we apply PEA approach for a case of the online supervised learning and develop an algorithm for online smoothing regression. Also, we provide experiments conducted on synthetic data and show the effectiveness of the proposed method. In A some auxiliary results are presented.

## 2 Preliminaries

In this section we recall the main ideas of prediction with expert advice theory. Let a pool of experts be given. Suppose that elements of a time series are revealed online – step by step. Learning proceeds in trials . At each time moment experts present their predictions and the aggregating algorithm presents its own forecast . When the corresponding outcome(s) are revealed, all the experts suffer their losses using a loss function: , . Let be the loss of the aggregating algorithm. The cumulative loss suffered by any expert and by AA during steps are defined as

 LiT=T∑t=1lit and HT=T∑t=1ht.

The performance of the algorithm w.r.t. an expert can be measured by the regret .

The goal of the aggregating algorithm is to minimize the regret with respect to each expert. In order to achieve this goal, at each time moment , the aggregating algorithm evaluates performance of the experts in the form of a vector of experts’ weights , where and for all . The weight of an expert

is an estimate of the quality of the expert’s predictions at step

. In classical setting (see Freund and Schapire (1997), Vovk (1990) among others), the process of expert weights updating is based on the method of exponential weighting with a learning rate :

 wμi,t=wi,te−ηlitN∑j=1wj,te−ηljt, (1)

where is some weight vector, for example, . In classical setting, we prepare weights for using at the next step or, in a more general case of the -th outcome ahead prediction, we define , where .

The Vovk’s aggregating algorithm (AA) (Vovk (1990), Vovk (1998)) is the base algorithm in our study. Let us explain the main ideas of learning with AA.

We consider the learning with a mixable loss function . Here is an element of some set of outcomes , and is an element of some set of forecasts . The experts present the forecasts .

In this case the main tool is a superprediction function

 g(y)=−1ηlnN∑i=1e−ηλ(y,ci)pi,

where

is a probability distribution on the set of all experts and

is a vector of the experts predictions.

The loss function is mixable if for any probability distribution on the set of experts and for any set of experts predictions a value of exists such that

 λ(y,γ)≤g(y) (2)

for all .

We fix some rule for computing a forecast satisfying (2). is called a substitution function.

It will be proved in Section A that using the rules (1) and (2) for defining weights and the forecasts in the online mode we obtain

 HT≤min1≤i≤NLiT+lnNη

for all .

A loss function is -exponential concave if for any the function is concave w.r.t. . By definition any -exponential concave function is -mixable.

The square loss function is -mixable for any such that , where and are a real numbers and for some , see Vovk (1990, 1998).

By Vovk (1998) and Vovk (2001), for the square loss function, the corresponding forecast can be defined as

 γ=Subst(c,p)=14B(g(−B)−g(B))=14ηBlnN∑i=1pie−η(B−ci)2N∑i=1pie−η(B+ci)2. (3)

For the -exponential concave loss function we can also use a more straightforward expression for the substitution function:

 γ=Subst(c,p)=N∑i=1cipi. (4)

The inequality (2) also holds for all .

The square loss function is -exponential concave for . However, the definition (4) results in four times more regret (see Kivinen and Warmuth (1999) and Section A).

## 3 Algorithm for Combining Long-term Forecasts of Experts.

In this section we consider an extended setting. At each time moment each expert presents an infinite sequence of forecasts for the time moments . A sequence of the corresponding confidence levels also can be presented at time moment . Each element of this sequence is a number between 0 and 1. If , then it means that we use the forecast only partially (e.g. it may become obsolete with time). If then the corresponding forecast is not taken into account at all.222For example, in applications, it is convenient for some to set for all , since too far predictions become obsolete. Confidence levels can be set by the expert itself or by the learner.333 The setting of prediction with experts that report their confidences as a number in the interval was first studied by Blum and Mansour (2007) and further developed by Cesa-Bianchi et al. (2007).

At each time moment we observe sequences , issued by the experts at the time moments . To aggregate the forecasts of all experts, we convert any “real” expert into the infinite sequence of the auxiliary experts , where .

At each time moment expert presents his forecast which is the segment of the sequence of length starting at its th element. More precisely, the forecast of the auxiliary expert is a vector

 c(n,τ)t=(c(n,τ)t,1,…,c(n,τ)t,d),

where for we set .

We also denote the corresponding segments of confidence levels by

 p(n,τ)t=(p(n,τ)t,1,…,p(n,τ)t,d),

where for and for .

Using the losses suffered by the experts (for ) on the time interval , the aggregating algorithm updates the weights of all the experts by the rule (1). We denote these weights by and use them for computing the aggregated interval forecast for time moments ahead

 γt=(γt,1,…,γt,d).

We use the fixed point method by Chernov and Vovk (2009). Define the virtual forecasts of the experts :

 ~c(n,τ)t,s=⎧⎨⎩c(n,τ)t,s % with probability p(n,τ)t,s,γt,s with probability 1−p(n,τ)t,s.

where and .

We consider any confidence level as a probability distribution on a two element set.

First, we provide a justification of the algorithm presented below. Our goal is to define the forecast such that for

 e−ηλ(y,γt,s)≥N∑n=1∞∑τ=1Ep(n,τ)t,s[e−ηλ(y,~c(n,τ)t,s)]w(n,τ),t+d (5)

for each outcome . Here is the mathematical expectation with respect to the probability distribution . Also, is the weight of the auxiliary expert accumulated at the end of step .

We rewrite inequality (5) in a more detailed form: for any ,

 e−ηλ(y,γt,s)≥ N∑n=1∞∑τ=1Ep(n,τ)t,s[e−ηλ(y,~c(n,τ)t,s)]w(n,τ),t+d= (6) N∑n=1t∑τ=1p(n,τ)t,sw(n,τ),t+de−ηλ(y,c(n,τ)t,s)+ e−ηλ(y,γt,s)(1−N∑n=1t∑τ=1p(n,τ)t,sw(n,τ),t+d) (7)

for all . Therefore, the inequality (5) is equivalent to the inequality

 e−ηλ(γt,s,y)≥N∑n=1t∑τ=1w∗,s(n,τ),te−ηλ(y,c(n,τ)t,s), (8)

where

 w∗,s(n,τ),t=p(n,τ)t,sw(n,τ),t+d∑Nn′=1∑tτ′=1p(n′,τ′)t,sw(n′,τ′),t+d. (9)

According to the aggregating algorithm rule we can define for such that (8) and its equivalent (5) are valid. Here is the substitution function and

 w∗,st=(w∗,s(n,τ),t:1≤n≤N,1≤τ≤t),
 ct,s=(c(n,τ)t,s:1≤n≤N,1≤τ≤t).

The outcomes will be fully revealed only at the time moment . The inequality (5) holds for and for the forecasts for all . By convexity of the exponent the inequality (5) implies that

 e−ηλ(yt+s,γt,s)≥N∑n=1∞∑τ=1e−ηEp(n,τ)t,s[λ(yt+s,~c(n,τ)t,s)]w(n,τ),t+d. (10)

holds for all . We use the generalized Hölder inequality and obtain

 e−η1dd∑s=1λ(yt+s,γt,s)≥N∑n=1∞∑τ=1e−η1dd∑s=1Ep(n,τ)t,s[λ(yt+s,~c(n,τ)t,s)]w(n,τ),t+d. (11)

For more details of the Hölder inequality see A. The inequality (11) can be rewritten as

 e−ηht+d≥N∑n=1∞∑τ=1e−η^l(n,τ)t,sw(n,τ),t+d, (12)

where

 ht+d=1dd∑s=1λ(yt+s,γt,s)

is the (averaged) loss of the aggregating algorithm suffered on the time interval and

 ^l(n,τ)t,s=1dd∑s=1Ep(n,τ)t,s[λ(yt+s,~c(n,τ)t,s)]

is the (averaged) mean loss of the expert .

The protocol of algorithm for aggregating forecasts of experts is shown below.

Algorithm 1

Set , where , , .

FOR

IF THEN put for all and .

ELSE

1. Observe the outcomes and predictions of the learner issued at the time moment .

2. Compute the loss of the learner on the time segment , where .

3. Compute the losses of the experts for , where for we set if and if .

ENDIF

1. Update weights:

 wμ(n,τ),t=w(n,τ),te−ηl(n,τ)t∑Nn′=1∑∞τ′=1w(n′,τ′),te−ηl(n′,τ′)t (13)

for , .444These weights can be computed efficiently, since the divisor in (13) can be represented

(14)

2. Prepare the weights: for and .

3. Receive predictions issued by the experts at the time moments and their confidence levels .

4. Extract the segments of forecasts of the the auxiliary experts , where for , and the segments of the corresponding confidences , where .555Here is a forecast of the real expert for the time moment issued at the time moment and .

5. Compute long-term forecast of the learner, where

 γt,s=Subst(cs,t,w∗,st,s), w∗t,s=(w∗,s(n,τ),t:1≤n≤N,1≤τ≤t), w∗(n,τ),t=p(n,τ)t,sw(n,τ),t+d∑Nn′=1∑tτ′=1p(n′,τ′)t,sw(n′,τ′),t+d, (15) ct,s=(c(n,τ)t,s:1≤n≤N,1≤τ≤t)

for .666 For computation the values of the function , we can use the rules (3) or (4) from Section 2.

ENDFOR

Denote for

 l(n,τ)t,s=λ(yt−d+s,c(n,τ)t−d,s),
 ~l(n,τ)t,s=λ(yt−d+s,~c(n,τ)t−d,s),
 ^l(n,τ)t,s=Ep(n,τ)t−d,s[~l(n,τ)t,s],
 ht,s=λ(yt−d+s,γt−d,s).

We put these quantities to be 0 for . Also, for . Since by definition

 ^l(n,τ)t,s=p(n,τ)t−d,sl(n,τ)t,s+(1−p(n,τ)t−d,s)ht,s,

we have

 ht,s−^l(n,τ)t,s=p(n,τ)t−d,s(ht,s−l(n,τ)t,s).

Recall that be the algorithm (average) loss and be the (average) loss of the auxiliary expert .

Define the discounted (average) excess loss with respect to an expert at a time moment by

 r(n,τ)t=ht−^l(n,τ)t. (16)

By definition of we can represent the discounted excess loss (16) as

 r(n,τ)t=1dd∑s=1p(n,τ)t−d,s(ht,s−l(n,τ)t,s)= 1dd∑s=1p(n,τ)t−d,s(λ(yt−d+s,γt−d,s)−λ(yt−d+s,c(n,τ)t−d,s)).

We measure the performance of our algorithm by the cumulative discounted (average) excess loss with respect to any expert .

###### Theorem 1.

For any and , the following upper bound for the cumulative excess loss holds true:

 supτ≤T−dT∑t=τ+dr(n,τ)t≤dη(lnN+2ln(T−d+1)). (17)

Proof. Let . Let us apply Corollary 1 from Section A for the case where are experts for and , so, . Also, set

 lt=^lt=(^l(n,τ)t:1≤n≤N,1≤τ≤T−d)

and be unit vector of length whose th coordinate is 1. By (12) . Then by (25)

 T∑t=1ht−T∑t=1^l(n,τ)t≤dηln(N(T−d)(T−d+1)) (18)

for each expert such that and .

Since for , using (16), we obtain (17).

## 4 Online Smoothing Regression

In this section we consider the online learning scenario within the supervised setting (that is, data are pairs of predictor-response variables). A forecaster presents a regression function defined on a set of objects, which are called signals. After a pair be revealed the forecaster suffers a loss , where is some loss function. We assume that and that the loss function is -mixable for some .

An example is a linear regression, where

is a set of -dimensional vectors and a regression function is a linear function , where is a weight vector and is the square loss.

In the online mode, at any step , to define the forecast for step – a regression function , we use the prediction with expert advice approach. A feature of this approach is that we aggregate the regression functions for , each of which depends on the segment of the sample. At the end of step we define (initialize) the next regression function by the sample .

Since the forecast can potentially be applied to any future input value , we consider this method as a kind of long-term forecasting.

We briefly describe below the changes made in Algorithm 1. We introduce signals in the protocol from Section 3.

Algorithm 2

Set initial weights as in Algorithm 1.

FOR

1. Observe the pair and compute the losses suffered by the learner and by the expert regression functions: if and otherwise.

2. Update weights:

 wτ,t+1=wμτ,t=wτ,te−ηlτt∑∞τ′=1wτ′,te−ηlτ′t (19)

3. Initialize the next regression function using the sample and define the forecast of the learner for step

 Ft+1(x)=Subst(Ft(x),w∗t) for% any x∈X, (20)

where , , and for .777We extend the rules (3) and (4) to functional forecasts in a natural way, see (21) below. See also, the footnote for item 8 of Algorithm 1

ENDFOR

For the square loss , where , by (3) the regression function (20) can be defined in the closed form:

 Ft+1(x)=14ηBlnt+1∑τ=1wτ,t+1e−η(B−F1τ(x))2t+1∑τ=1wτ,t+1e−η(B+F1τ(x))2 (21)

for each or by the rule (4).888The most appropriate choices of are for the rule (3) and for (4). The more straightforward definition (4) results in four times more regret but easier for computation.

Let us analyze the performance of Algorithm 2 as a forecaster on steps ahead.

For any time moment a sequence is revealed. Denote by the average loss of the learner on time interval and by the average loss of any auxiliary expert .

The regret bound of Algorithm 2 does not depend on :

###### Theorem 2.

For any ,

 sup1≤τ≤T−1T∑t=τ+1ht−lτt≤2ηlnT. (22)

Proof. The analysis of the performance of Algorithm 2 for the case of prediction on steps ahead is similar to that of Algorithm 1 for . Let and be given. Using the technics of Section A, we obtain for any ,

 T∑t=τ+1λ(yt−s+1,Fτ(xt−s+1))−λ(yt−s+1,F1τ(xt−s+1))≤1ηln(T(T−1)).

Summing this inequality by and dividing by , we obtain (22).

In particular, Theorem 2 implies that the total loss of Algorithm 2 at any time interval is no more (up to logarithmic regret) than the loss of the best regression algorithm constructed in the past.

Online regression with a sliding window. Some time series show a strong dependence on the latest information instead of all the data. In this case, it is useful to apply regression with a sliding window. In this regard, we consider the application of Algorithm 2 for the case of online regression with a sliding window. The corresponding expert represents some type of dependence between input and output data. If this relationship is relatively regular the corresponding experts based on past data can successfully compete with experts based on the latest data. Therefore, it may be useful to aggregate the predictions of all the auxiliary experts based on past data.

Let

be the ridge regression function, where for

, . Here is the matrix in which rows are formed by vectors ( is the transposed matrix ), is a unit matrix, is a parameter, and . For we set equal to some fixed value.

We use the square loss function and assume that for all . For each we define the aggregating regression function (the learner forecast) by (21) using the regression functions (the expert strategies) for , where each such a function is defined using a learning sample (a window) .999 The computationally efficient algorithm for recalculating matrices during the transition from to for some special type of online regression with a sliding window was presented by Arce and Salinas (2012). Similar effective options for regression using Algorithm 2 can also be developed.

Experiments. Let us present the results of experiments which were performed on synthetic data. The initial data was obtained as a result of sampling from a data generative model.

We start from a sequence of

-dimensional signals sampled i.i.d from the multidimensional normal distribution. The signals are revealed online and