Distribution-Free One-Pass Learning

06/08/2017 ∙ by Peng Zhao, et al. ∙ Nanjing University 0

In many large-scale machine learning applications, data are accumulated with time, and thus, an appropriate model should be able to update in an online paradigm. Moreover, as the whole data volume is unknown when constructing the model, it is desired to scan each data item only once with a storage independent with the data volume. It is also noteworthy that the distribution underlying may change during the data accumulation procedure. To handle such tasks, in this paper we propose DFOP, a distribution-free one-pass learning approach. This approach works well when distribution change occurs during data accumulation, without requiring prior knowledge about the change. Every data item can be discarded once it has been scanned. Besides, theoretical guarantee shows that the estimate error, under a mild assumption, decreases until convergence with high probability. The performance of DFOP for both regression and classification are validated in experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With a rapid growth in collecting data, the volume of data generated makes it a challenge for traditional machine learning approaches. The main challenges are multi-faced, in general, accumulating and evolving properties are two of the most troublesome issues.

For the accumulating property, apparently, it’s impractical to store all the data entirely due to the limitation of memory and computation resources. Hence, an offline approach is not suitable any more in such tasks. Only in an online processing paradigm can prediction models be trained and updated incrementally. It’s also worthy to mention that an online streaming approach is called one-pass when it requires going through the training data only once without storing the entire dataset. The reason to pursue one-pass property is due to the fact that sometimes the raw data is discarded or no longer accessible after being processed. A one-pass approach guarantees the learning process be independent with the volume of data stream. Apparently, it is much more demanding and difficult.

Furthermore, the evolving nature of data stream also makes it challenging to directly apply traditional machine learning approaches because it’s not reasonable to assume that the current and future data are coming from the same distribution any more. In a non-stationary environment, which is very common in data generation process, the distribution underlying is likely to dynamically change over time. For instance, the clicking information collected in recommendation system is certainly evolving because customers’ interests probably change when looking through the product pages. Another example is credit scoring, the criteria of credit granting should properly alter since a changing economic conditions would have a great influence on people’s manner. Such phenomena above are typical examples of distribution change. Under this scenario, the performance of traditional approaches dramatically drop down and thus are not empirically and theoretically suitable for these tasks.

To simultaneously address these two issues, in this work, we propose DFOP, a distribution free one-pass learning approach to deal with data stream emerging distribution change under one-pass constraints. The advantages of our approach are following: firstly, by recursively solving the target, we guarantee that only one time would data stream be gone through. Secondly, based on a forgetting mechanism, the loss of older data is discounted without explicitly modelling the dynamics or assuming prior information about distribution change. In streaming regression, a theoretical guarantee is presented showing that the estimate error of dynamic concept would decrease until convergence with high probability. Meanwhile, empirical experiments on both synthesis and real-world datasets indicate the effectiveness and practicability of DFOP in both regression and classification scenarios.

The rest of this paper is organized as follows. In Section 2, we briefly review of some related work. Then, static scenario model is introduced in preliminaries. Next, in the non-stationary environment, DFOP is presented to handle dynamics in regression and classification scenarios. Both theoretical guarantee and empirical effectiveness have been examined. Finally, we conclude the paper.

2 Related Work

Online and One-Pass Algorithms. With a rapid growth of data volumn and velocity, it’s no longer practicable to adopt offline mode algorithm for streaming data learning tasks. Hence, online style algorithms become gradually attractive which update the current model with the most recent examples. In general, they are error driven, updating the current model depending on whether the current example is misclassified journals/csur/GamaZBPB14

. Representative algorithms include Perceptron 

journal/rosenblatt1958perceptron , Winnow journals/ml/Littlestone87 and many other variations. Besides, a paradigm of “prediction with expert advice” book/Cambridge/cesa2006prediction also inspires some interesting works, such as AddExpconf/icml/KolterM05 and DWMconf/icdm/KolterM03 ; journals/jmlr/KolterM07 . Most of those approaches require to store the entire or partial training data and scan data items multiple times. Recently, one-pass algorithms gradually draw more attentions demanding that each data item should be processed only once. Concretely speaking, after a data item has been processed and relevant statistics have been stored, the raw data item should be discarded and never be accessed any more. Obviously, one-pass constraints impose a higher degree of difficulty on algorithm design. Some efforts have been devoted conf/nips/WuBSD16 .

Nonstationary Learning. Owing to the effectiveness and simplicity, sliding window is usually adopted to handle data stream with distribution change. It only uses a fixed or variable number of recent data which are the most informative for current prediction journals/ml/Littlestone87 ; journals/ida/Klinkenberg04 . Usually, the model built is updated following two processes: one is a learning process, i.e., updating the model based on the new coming data; the other one is a forgetting process, i.e., discarding data items that are moving out of the window journals/csur/GamaZBPB14

. However, how to choose an appropriate window size is of great importance which now mainly depends on heuristics to a certian extent. Some efforts have been paid to select window size adaptively 

journals/ida/Klinkenberg04 ; conf/sbia/GamaMCR04 ; journals/ida/KunchevaZ09 . The common strategy to adjust window size is based on the performance or estimate of generalization error. SVM-ada journals/ida/Klinkenberg04 presents a theoretically supported approach , however, the computational efficiency issue makes it not practical in real-world applications.

Our proposed approach DFOP, short for Distribution-Free One-Pass, is a one-pass style algorithm, i.e., it could guarantee that only one time will data items been gone through. Besides, DFOP is distribution-free, i.e., different from those traditional approaches dealing with distribution change, we did not explicitly model the dynamics, and no prior information about distribution change is assumed.

3 Preliminaries

In this part, streaming regression model in a static scenario is briefly introduced.

In a streaming scenario, we denote a labeled dataset as , where is the feature of the -th instance and is a real-valued output. Furthermore, we assume a linear model as follows,

(1)

where is the noise sequence, is what we desire to estimate.

When in a static scenario, the sequence

is a constant vector denoted as

. Then, the least square could be adopted to minimize the residual sum of squares, which has a close-form solution. However, it fails when adding an online/one-pass constraint which demands the raw item is discarded after it has been processed. Recursive least square (RLS) and stochastic gradient descent (SGD) are two typical approaches to solve this problem in an online paradigm.

When in a non-stationary environment, especially when the distribution underlying changes, traditional approaches are not suitable since we could never expect the typical i.i.d assumption continue to work any longer. In the next sections, we propose to handle this scenario based on exponential forgetting mechanism without explicitly modelling the evolution of data stream, and theoretical support and empirical demonstration are presented.

In the following, denotes the -norm in space. Meanwhile, for a bounded real-valued sequence , denotes the upper bound of sequence, namely, .

4 Distribution Free One-Pass Learning

Since the sequence is changing over time in a dynamic environment, it is no longer reasonable to estimate current (i.e., at time ) concept via methods introduced previously. Instead, we introduce a sequence of discounted factors to downweight the loss of older data as follows,

(2)

where is a discounted factor to smoothly put less weight on older data. The intuition can be more easily obtained if we simplify all as a constant , then the target function is,

(3)

And the quantity is named as forgetting factor book/Pearson/haykin2008adaptive . The value of forgetting factor is, as a matter of fact, a trade-off between stability of past condition and sensitivity of future evolution.

It should be pointed out that the forgetting mechanism based on exponential forgetting factor could be also considered as a continuous analogy to sliding window approach to some extent. The older data items with a small enough weight can be somehow thought as exclusion from the window. Some. More discussions on relation with window size and forgetting factor are provided in Section 5.4.

4.1 Algorithm

For the optimization problem proposed in (3), obviously, by taking derivative of the target function, we can directly obtain the optimal solution in a closed-form,

(4)
  Input: A stream of data with , forgetting factor ;
  Output: Prediction (real value for regression and discrete-value for classification).
  Initialize ;
  for  to  do
     ;
     ;
     ;
     .    // for regression;
      // for classification
  end for
Algorithm 1 Distribution Free One-Pass Learning

However, above expression is an off-line estimate, namely, all the data items ahead of are needed. Instead of repeatedly solving (4), we estimate the underlying concept by adding a correction term to the previous estimate based on the information of new coming data item. With the forgetting factor recursive least square method book/Pearson/haykin2008adaptive , we could solve the target (3) in a one-pass paradigm. And to the best of our knowledge, this is the very first time to adopt traditional forgetting factor RLS to deal with such tasks with distribution change under the one-pass constraints. And we named this as DFOP(short for Distribution-Free One-Pass) summarized in Algorithm 1.

Besides, it should be pointed out that is by no means necessary chosen as a constant, we provide a generalized DFOP (short as G-DFOP) for a dynamic discount factor sequence , corresponding to target in (11), which is also provably a one-pass algorithm. Detailed proofs are provided in Section 1 of supplementary material.

For the classification scenario, is no longer a real-valued output but a discrete value, and we assume for convenience. A slight modification on original output step is applied in classification, where the effectiveness is empirically validated in the next section.

Assuming that the feature is -dimension, we only need to keep in memory during the algorithm processing procedure. In other words, the storage is always , which is independent to the number of training examples. Besides, at the -th time stamp, the update of is unrelated to the previous data items, namely every data item can be discarded once it has been scanned.

4.2 Theoretical Guarantee

In this section, we develop an estimate error bound in a non-stationary regression scenario.

Consider the additive model of drift in sequence ,

(5)

We assume that the adding term is an -valued martingale-difference

-dimension sub-Gaussian vector sequence, with corresponding variance proxy sequence

, whose formal definition will be given following. The -valued martingale-difference assumption is reasonable, in fact, in many real-world application, the drift of concepts are usually independent.

Similar to the analysis in journals/control/guo1993performance , we relax the assumptions to be more realistic in real-world applications and provide a non-deterministic estimate error bound based on vector concentration, showing that the estimate error is tending to convergence with high probability.

Now we give the formal definitions of sub-Gaussian random variable and random vector.

Definition 1.

(sub-Gaussian random variable) A random variable is said to be sub-Gaussian with variance proxy if

and its moment generating function satisfies

(6)
Definition 2.

(sub-Gaussian random vector) A random vector is called sub-Gaussian with variance proxy if all its coordinates are sub-Gaussian random variables with variance proxy .

To exploit concentration property of sub-Gaussian random vector, condition () proposed in Theorem 2.1 of journal/juditsky2008large shall be satisfied. Thus, first, we show that there exists a bounding sequence for a sub-Gaussian random vector sequence .

Lemma 1

For a sub-Gaussian random vector sequence with a variance proxy sequence , there exists a corresponding positive bounding sequence , such that

(7)

Lemma 1 guarantees the ”light tail” assumption of sub-Gaussian random vector. Then we could apply the following vector concentration, which is a corollary of Theorem 2.1 proposed in journal/juditsky2008large .

Theorem 1

(Corollary of Theorem 2.1 in journal/juditsky2008large ) In an Euclidean space , let E-valued martingale-difference sub-Gaussian sequence with a corresponding bounding sequence . Let , then for all and :

(8)

Based on Theorem 23, we could provide Lemma 2 and Lemma 3 to bound a sum of sub-Gaussian random vectors and random variables with exponential decrease, respectively.

Lemma 2

Let be an -valued martingale-difference -dimension sub-Gaussian random vector sequence, with corresponding bounding sequence , and . Then for , with a probability at least , we have

where and .

Lemma 3

Let be an independent (or -valued martingale-difference) sub-Gaussian random variable sequence, with corresponding bounding sequence (i.e., variance proxy sequence) , and . Then for , with a probability at least , we have

where .

Theorem 2

Assume following conditions be satisfied:

  • drift term is an -valued martingale-difference sub-Gaussian random vector sequence, with corresponding bounding sequence ;

  • output noise is an independent (or -valued martingale-difference) sub-Gaussian random variable sequence, with corresponding bounding sequence (i.e., variance proxy sequence) .

Then with a probability at least , we have

where , , and .

Remark. The estimate error bound can be decomposed into three parts, i.e., the first one is , second one is and third one is . Apparently, the first term is decreasing to zero as increases to infinity, second term is caused by the output noise which shall not be erased, and the third term is introduced by drift of . Ignoring the poly-logarithmic factors in and

, then, an asymptotic analysis gives the estimate error bound as,

where we use the notation to hide constant factors as well as poly-logarithmic factors in and , and will exponentially decrease to zero as .

Due to the page limits, we present the proofs of Theorem 1 and Theorem 2 (along with Lemma 1, 2 and 3) in Section 2 and 3 of supplementary material, respectively.

5 Experiments

In this section, we examine the empirical performance of the proposed DFOP on both regression and classification scenarios. Then, we analyze the parameter sensitivity in Section 5.4. However, due to the page limits, only results on the classification scenario are provided, and the regression ones are appended in the supplementary materials.

Moreover, considering that when dealing with real-world datasets, we could not grasp the evolving distribution, specifically, the start and end time of drift, the underlying distribution. As a consequence, it would be very incomplete to analyze the behaviour of algorithms. Hence, both synthesis and real-world datasets are included in the comparison experiments.

5.1 Comparisons Methods

We compare the proposed approach with six common methods on both synthesis and real-world datasets. The comparison methods are (a) RLS, least square approach solved in a recursive manner, (b) Sliding window approach, the classifier is constantly updated by the nearest data samples in the window. Base classifiers are 1NN and SVM, denoted as 1NN-win and SVM-win 

conf/sdm/SouzaSGB15 , (c) SVM-fix, batch implementation of SVM with a fixed window size conf/kdd/SyedLS99a , (d) SVM-ada,  batch implementation of SVM with an adaptive window size journals/ida/Klinkenberg04 , (e) DWM, dynamic weighted majority algorithm, an adaptive ensemble based on the traditional weighted majority algorithm Winnow conf/icdm/KolterM03 ; journals/jmlr/KolterM07 .

It’s noteworthy to emphasize that the above comparisons are not all fair enough, because DFOP requires each data item be processed only once. Moreover, DFOP only needs one instance to update the model. Not all comparison methods can meets these two constraints, specifically, 1NN-win, SVM-win, SVM-fix and SVM-ada are window-based algorithms, hence, they are not one-pass. Besides, SVM-fix and SVM-ada are not incremental but updated in a series of batches. DWM is incremental style but not one-pass because it needs to use data to update experts pool in addition.

5.2 Synthetic Datasets

First, we present the performance comparisons over synthetic datasets.

  • SEA conf/kdd/StreetK01 consists of three attributes , and . The target concept is , and there are 50,000 instances with 4 stages where .

  • hyperplane conf/kdd/Fan04

    , is generated uniformly in a 10-dimensional hyperplane with 90,000 instances in total over 9 different stages.

Besides, another 11 synthesis datasets for binary classification are also adopted. Detailed information are included in the supplementary materials.

The performance is measured by holdout accuracy since underlying joint distribution of synthetic datasets are known. Holdout accuracy is calculated over testing data generated according to the identical distribution as training data at each time stamp. Performance comparisons of seven approaches on SEA and hyperplane datasets are depicted in Figure 

1. Since the accuracy curves of SVM-ada, SVM-fix, 1NN-win and SVM-win are so unstable that they would shield all the other curves, we also present a relatively neat figure containing RLS, DWM and DFOP only.

Figure 1: Performance comparison of seven approaches on synthetic datasets in terms of holdout accuracy. Left sides presents all the seven approaches, on the right side, only RLS, DWM and DFOP are plotted for clearness.

As shown in Figure 1, the accuracy of all algorithms falls rapidly when the underlying distribution emerges abrupt drift, and then will rise up with more data coming. DFOP is significantly better than RLS which is a special case of DFOP, this phenomenon validates the effectiveness of forgetting mechanism. Furthermore, the best two algorithms, obviously, are DFOP and DWM, both of them can converge to new stage quickly. DFOP shows a slightly better performance than DWM, both in slope and asymptote. Moreover, DWM requires to dynamically maintain a set of experts and needs previous data to update experts pool and to decide whether to remove poorly performing experts. On the contrary, DFOP demonstrates a desirable performance requiring to scan each data item only once.

5.3 Real-world Datasets

Dataset SVM-win 1NN-win SVM-fix SVM-ada DWM RLS DFOP
SEA 73.94 0.12 77.27 0.04 86.19 0.06 83.47 0.09 87.04 0.03 84.54 0.47 87.99 0.05
hyperplane 83.74 0.03 70.66 0.03 87.98 0.03 81.94 0.07 88.36 0.25 69.67 1.40 90.14 0.05
1CDT 98.71 0.05 99.96 0.07 99.77 0.06 99.77 0.08 99.90 0.09 98.79 1.65 99.97 0.05
2CDT 94.86 0.06 94.62 0.10 95.19 0.13 95.18 0.15 90.21 0.67 62.24 0.23 96.36 0.09
1CHT 98.75 0.18 99.81 0.22 99.63 0.17 99.63 0.18 99.69 0.26 98.49 1.61 99.84 0.16
2CHT 87.70 0.04 85.69 0.05 89.48 0.12 88.89 0.13 85.92 0.72 62.57 0.23 89.91 0.07
1CSurr 97.99 0.04 98.12 0.11 94.24 1.08 93.56 1.08 96.31 0.50 67.82 0.22 93.24 1.44
UG-2C-2D 94.47 0.13 93.55 0.16 95.41 0.10 94.92 0.12 95.59 0.11 67.02 1.46 95.59 0.10
UG-2C-3D 93.60 0.73 92.83 0.93 95.05 0.64 94.48 0.71 95.14 0.62 61.95 2.60 95.37 0.61
UG-2C-5D 74.82 0.45 88.04 0.42 91.74 0.26 90.37 0.35 92.82 0.23 81.20 2.42 92.51 0.25
MG-2C-2D 90.20 0.07 87.84 0.09 84.98 0.06 84.22 0.06 90.15 0.06 57.18 3.66 85.06 0.06
G-2C-2D 95.54 0.01 99.61 0.00 95.41 0.01 95.26 0.02 95.82 0.02 95.84 0.01 95.83 0.02
Chess 69.67 1.51 79.58 0.54 77.73 1.56 69.18 3.65 73.77 0.66 78.70 0.83 79.15 0.62
Usenet-1 68.92 1.12 65.36 1.55 64.18 2.24 67.68 1.86 64.43 4.53 60.65 0.53 69.20 0.68
Usenet-2 74.44 0.71 71.03 0.60 73.99 0.69 72.64 0.84 73.37 0.93 73.16 0.67 75.60 0.57
Luxembourg 88.57 0.28 77.51 0.44 98.25 0.19 97.43 0.42 92.61 0.40 99.06 0.14 99.09 0.14
Spam 83.91 2.20 93.43 0.82 92.44 0.80 91.01 0.94 91.49 1.09 94.46 0.16 94.77 0.26
Weather 68.54 0.55 72.64 0.25 67.79 0.65 77.26 0.33 70.86 0.42 78.35 0.18 79.23 0.12
Powersupply 73.33 0.25 72.42 0.21 71.17 0.15 69.39 0.17 72.18 0.29 69.67 0.64 80.46 0.04
Electricity 74.20 0.08 85.33 0.09 62.01 0.59 58.69 0.58 78.60 0.41 74.20 0.63 76.94 0.26
DFOP W/ T/ L 18/ 1/ 1 14/ 4/ 2 19/ 1/ 0 18/ 2/ 0 14/ 3/ 3 19/ 1/ 0 -
Table 1:

Performance comparison in terms of mean and standard deviation of accuracy (both in percents). Bold values indicates the best performance. Besides,

() indicates that DFOP is significantly better (worse) than the compared method (paired -tests at 95% significance level). And Win/ Tie/ Loss are summarized in the last row.

To valid the effectiveness of DFOP in real-world applications, performance comparisons are presented over 8 real-world datasets. Detailed descriptions are provided in Section 5 of supplementary materials.

In real-world datasets, we can never expect to foreknow the underlying distribution at each data stamp. Thus, it’s not possible to still adopt holdout accuracy as performance measurement. In Table 4, we conduct all the experiments for 10 trails and report the overall mean and standard deviation of predictive accuracy over above real-world datasets as well as other 12 synthesis datasets.

In a total of 20 datasets, the number of instance vary from 533 to at most 200,000. DFOP achieves the best among all approaches in 15 over 20 datasets. Also, in other 5 datasets, DFOP ranks the second or the third. This validates the effectiveness of DFOP, especially under an unfair comparison condition.

Additionally, the robustnessconf/kdd/VlachosDGKK02 of all these different algorithms are compared. Briefly speaking, for a particular algorithm algo, the robustness is defined as the proportion between its accuracy and the smallest accuracy among all compared algorithms, i.e., . Hence, the sum of over all datasets indicates the robustness of for algorithm . The greater the value of the sum, the better the performance. DFOP achieves the best over 20 datasets, and RLS ranks last as expected since it didn’t consider the evolving distribution in datasets at all. Due to the page limits, detailed robustness comparison results could be found in Section 4.3 in supplementary materials.

5.4 Parameter Study

As stated previously, how to choose an appropriate forgetting factor is an important issue since it reflects a trade-off between stability of past condition and sensitivity to future evolution. To figure out how forgetting factor affects the performance, in classification problem, accumulated accuracy (short as ’AA’) is adopted as a performance measurement in the time series and is defined as,

(9)

where is indicator function which takes 1 if is true, and 0 otherwise, and are predictive and ground-truth label, respectively. Figure 2 shows the impact of different forgetting factor over four datasets. We notice that the accumulated accuracy of RLS almost decreases all the time. For a relatively small but not zero , the performance is satisfying without a significant gap. However, when is too large, say 0.5, the performance is even much worse than RLS. This is consistent with intuition since forgetting factor is so large and older data samples are exponentially downweighted that there are not sufficient effective training samples available to update the model.

Figure 2: Accumulated accuracy with different forgetting factors over four datasets with distribution change.

Now, here comes the question: how to choose an appropriate forgetting factor to adapt the distribution change in the data stream? To answer this question, let’s recall the target function in (3), when is close to , which is often the case in practice and validated in Figure 2, then we have

(10)

where we define as forgetting period. The contribution for prediction error of data items older than time will be discounted with a weight less than comparing to the current data. As a matter of fact, the forgetting period in forgetting mechanism is pretty similar to the window size in sliding window technique. It can be regarded as a soft relaxation of window size. Consequently, the forgetting factor shall be chosen according to the forgetting period , where the data distribution should be relatively smooth and stable during this forgetting period.

We validate this idea over synthesis datasets reported in Table 2. Theoretical recommended value and empirical appropriate value for forgetting factor are provides. Also, the relative proportions between them are calculated. We can see that these two value are very close over all datasets with no more than 20 times difference, even no more than 5 times in most datasets. This supports our strategy in choosing forgetting factor.

Dataset Dataset
1CDT 400 2.50E-03 1.00E-02 4 UG-2C-2D 1,000 1.00E-03 1.00E-03 1
2CDT 400 2.50E-03 1.00E-02 4 UG-2C-3D 2,000 5.00E-04 1.00E-03 2
1CHT 400 2.50E-03 1.00E-02 4 UG-2C-5D 2,000 5.00E-04 1.00E-03 2
2CHT 400 2.50E-03 1.00E-02 4 MG-2C-2D 2,000 5.00E-04 1.00E-03 2
1CSurr 600 1.67E-03 5.00E-03 3 G-2C-2D 2,000 5.00E-04 1.00E-03 2
hyperplane 9,000 1.11E-04 2.00E-03 18 SEA 10,000 1.00E-04 1.00E-03 10
Table 2: Datasets, the number of data items between consecutive distribution change, theoretical recommended value and empirical appropriate value for forgetting factor are listed below, and the last column provides relative proportion between theoretical recommended value and empirical appropriate value.

Certainly, the drifting properties of real-world datasets are not as clear as synthetic datasets. Nevertheless, we could still infer the forgetting period based on the domain knowledge and choose an appropriate value as forgetting factor. For instance, considering the weather forecast dataset, although we cannot foreknow the drifting property of distribution, a relative stable period can still be estimated.

6 Conclusion

In this paper, we proposed an approach based on forgetting mechanism called DFOP handling streaming learning problems with distribution change. The main idea is to downweight the older data items by introducing exponential forgetting factor without considering any prior about drifting information. Meanwhile, DFOP meets the one-pass constraints guaranteeing that only once will the data items be scanned without storing the entire dataset. Hence, DFOP. The storage requirement of DFOP is , where is the dimension of data, independent from the number of training examples. Both theoretical supports and empirical demonstrations for DFOP are presented to validate its effectiveness and practicality.

Besides, how to efficiently reduce the storage and make DFOP paralleled to adapt a even larger scale real-world applications would be an interesting future work.

Appendix

1 Generalized DFOP and Proofs

1.1 Recursive Algorithm for Dynamic Discounted Factors

When dynamic discounted factors are introduced to downweight the contribution of older data items, the target function can be written as,

(11)

In this part, we provide a provably recursive algorithm to directly solve (11) as shown in Algorithm 2 named generalized DFOP algorithm, short as G-DFOP.

  Input: A stream of data with feature   and  , discounted factor sequence ;
  Output: Prediction label (real value for regression and discrete-value for classification).
  Initialize ,
  for  to  do
     ,
     ,
     .
     .    // for regression;
      // for classification
  end for
Algorithm 2 Generalized DFOP

1.2 Proof of Generalized DFOP

In this part, we will prove the consistency between G-DFOP and target function in (11).

Lemma 4

Let and be matrices of compatible dimensions such that the product and the sum exist. Then we have

(12)
Proof.

Multiply the right-hand side of (12) by from the right, this gives

For convenience, let , then the close-form solution of optimization (11) can be calculated as follows

(13)

Now we will prove that the solution obtained by G-DFOP is equivalent to close-form solution in (13).  

Theorem. By the policy in Algorithm 2, we can achieve the same solution as result in (13).

Proof.

Denote , obviously

Then the solution in (13) can be rewritten into the following form:

Now, we introduce and then apply Lemma 4 to (1.2), this gives

Let , we can obtain the policy described in Algorithm 2.

Remark. Obviously, DFOP in paper is only a special case when fixing discounted factor sequence as . Note that, for a simplicity notations in estimate error analysis of DFOP, we slightly modified and multiplying by .

2 Proof of Theorem 1

2.1 Definition of Sub-Gaussian and Sub-Exponential

First we give typical definitions of sub-Gaussian random variable and random vector, meanwhile, definition of sub-Exponential random variable is also provided.

Definition 3.

(sub-Gaussian random variable) A random variable is said to be sub-Gaussian with variance proxy if and its moment generating function satisfies

(14)

In this case we write .

Definition 4.

(sub-Gaussian random vector) A random vector is called sub-Gaussian with variance proxy if all its coordinates are sub-Gaussian random variables with variance proxy .

Definition 5.

(sub-Exponential random variable) A random variable is said to be sub-Exponential with parameter if and its moment generating function satisfies

(15)

In this case we write .

Remark. Attention that definitions above all require a zero-mean constraint, which is not necessary in analysis ”light tail” property. Hence, for random variable that is not zero-mean but satisfies condition (14) is called generalized sub-Gaussian. And definitions for generalized sub-Gaussian random vector and generalized sub-Exponential random variable are similar.

2.2 Proof of Theorem 1

Theorem 1 in the paper presents a vector concentration inequality is shown for sub-Gaussian random vector sequence in the following which plays an important role in proving Lemma 2 and Lemma 3 in our paper. To prove Theorem 1 in the paper, first, we present following Lemma 5 to show that the norm of a sub-Gaussian random vector is a generalized sub-Gaussian random variable.

Lemma 5

If is a sub-Gaussian random vector with variance proxy , then

(16)
Proof.

The conclusion here is a direct corollary from Theorem 3.1 in journals/buldygin2010inequalities , where , , and .

Then we present the following Lemma 6 to show the equivalence between sub-Gaussian random variable and sub-Exponential variable, which plays an important role in proving Theorem 1.

Lemma 6

Let be a sub-Gaussian random variable, i.e., . Then the random variable is sub-Exponential: .

Proof.

We prove this lemma by definition.

(17)
(18)
(20)

where (17) and (18) because of Jensen’s Inequality. (2.2) holds because of in journal/vershynin2010intro . The last step (20) holds because the condition in definition of sub-Exponential, i.e., .

Remark. Attention that in the proof of Lemma 6, we didn’t use the zero mean property of sub-Gaussian random variable . Hence, Lemma 6 can be applied to a generalized sub-Gaussian random variable without requiring zero-mean condition.

Now, let’s begin prove the Lemma 2 in the paper stated as follows,

For a sub-Gaussian random vector sequence with a variance proxy sequence , there exists a corresponding positive bounding sequence , such that

(21)
Proof.

Consider any vector in the sequence, say . Because it is a sub-Gaussian random vector, directly applying Lemma 5, we have

which means for a sub-Gaussian random vector, its norm is a generalized sub-Gaussian random variable. Here, ”generalized” means it may not meet the zero-mean condition.

Because we didn’t use the zero mean property of sub-Gaussian random variable in the proof of Lemma 6, it can also be applied to generalized sub-Gaussian random variable. Let , then , specifically,

Thus, for , we have

(22)

Obviously, we can choose a sufficient small positive constant as , such that

And is exactly the bounding sequence as we desired. This completes the proof.

Now, we prove the Theorem 1 in the paper stated as follows,

Theorem 3

In an Euclidean space , let E-valued martingale-difference sub-Gaussian sequence with a corresponding bounding sequence . Let , then for all and :

(23)
Proof.

First of all, for a sub-Gaussian sequence with corresponding bounding sequence , from Theorem 1 in paper, we have . Besides, an Euclidean space is 1-smooth and 1-regular, which means . Then, it follows immediately according to Theorem 2.1 proposed in journal/juditsky2008large .

3 Proof of Theorem 2

Firstly, the following generalized summation by parts is essential for the proof of main theorem in our paper.

Lemma 7

Let and be two sequences. And denote , then for we have,

Proof.

The right-hand side can be easily verified by expanding in the left-hand as .

Remark. When , the same derivation gives the famous Abel transformation,

3.1 Proof of Lemma 2 in paper

Lemma 2 in our paper states as follows,

Let be an -valued martingale-difference -dimension sub-Gaussian random vector sequence, with corresponding bounding sequence , and . Then for , with a probability at least , we have

where and .

Proof.

Denote , then with the recursive property of , left side can be expanded by Lemma 7,

And it could be separated into two parts, i.e.,

For , based on Lemma 3, we have

taking by the union bound over , and let the following holds in a probability at least ,

Conditioning on all above concentration inequalities hold, for , we have

Hence, combining and , we complete the proof. ∎

3.2 Proof of Lemma 3 in paper

Lemma 3 in our paper states as follows,

Let be an independent (or -valued martingale-difference) sub-Gaussian random variable sequence, with corresponding bounding sequence (i.e., variance proxy sequence) , and . Then for , with a probability at least , we have