1 Introduction
With a rapid growth in collecting data, the volume of data generated makes it a challenge for traditional machine learning approaches. The main challenges are multifaced, in general, accumulating and evolving properties are two of the most troublesome issues.
For the accumulating property, apparently, it’s impractical to store all the data entirely due to the limitation of memory and computation resources. Hence, an offline approach is not suitable any more in such tasks. Only in an online processing paradigm can prediction models be trained and updated incrementally. It’s also worthy to mention that an online streaming approach is called onepass when it requires going through the training data only once without storing the entire dataset. The reason to pursue onepass property is due to the fact that sometimes the raw data is discarded or no longer accessible after being processed. A onepass approach guarantees the learning process be independent with the volume of data stream. Apparently, it is much more demanding and difficult.
Furthermore, the evolving nature of data stream also makes it challenging to directly apply traditional machine learning approaches because it’s not reasonable to assume that the current and future data are coming from the same distribution any more. In a nonstationary environment, which is very common in data generation process, the distribution underlying is likely to dynamically change over time. For instance, the clicking information collected in recommendation system is certainly evolving because customers’ interests probably change when looking through the product pages. Another example is credit scoring, the criteria of credit granting should properly alter since a changing economic conditions would have a great influence on people’s manner. Such phenomena above are typical examples of distribution change. Under this scenario, the performance of traditional approaches dramatically drop down and thus are not empirically and theoretically suitable for these tasks.
To simultaneously address these two issues, in this work, we propose DFOP, a distribution free onepass learning approach to deal with data stream emerging distribution change under onepass constraints. The advantages of our approach are following: firstly, by recursively solving the target, we guarantee that only one time would data stream be gone through. Secondly, based on a forgetting mechanism, the loss of older data is discounted without explicitly modelling the dynamics or assuming prior information about distribution change. In streaming regression, a theoretical guarantee is presented showing that the estimate error of dynamic concept would decrease until convergence with high probability. Meanwhile, empirical experiments on both synthesis and realworld datasets indicate the effectiveness and practicability of DFOP in both regression and classification scenarios.
The rest of this paper is organized as follows. In Section 2, we briefly review of some related work. Then, static scenario model is introduced in preliminaries. Next, in the nonstationary environment, DFOP is presented to handle dynamics in regression and classification scenarios. Both theoretical guarantee and empirical effectiveness have been examined. Finally, we conclude the paper.
2 Related Work
Online and OnePass Algorithms. With a rapid growth of data volumn and velocity, it’s no longer practicable to adopt offline mode algorithm for streaming data learning tasks. Hence, online style algorithms become gradually attractive which update the current model with the most recent examples. In general, they are error driven, updating the current model depending on whether the current example is misclassified journals/csur/GamaZBPB14
. Representative algorithms include Perceptron
journal/rosenblatt1958perceptron , Winnow journals/ml/Littlestone87 and many other variations. Besides, a paradigm of “prediction with expert advice” book/Cambridge/cesa2006prediction also inspires some interesting works, such as AddExpconf/icml/KolterM05 and DWMconf/icdm/KolterM03 ; journals/jmlr/KolterM07 . Most of those approaches require to store the entire or partial training data and scan data items multiple times. Recently, onepass algorithms gradually draw more attentions demanding that each data item should be processed only once. Concretely speaking, after a data item has been processed and relevant statistics have been stored, the raw data item should be discarded and never be accessed any more. Obviously, onepass constraints impose a higher degree of difficulty on algorithm design. Some efforts have been devoted conf/nips/WuBSD16 .Nonstationary Learning. Owing to the effectiveness and simplicity, sliding window is usually adopted to handle data stream with distribution change. It only uses a fixed or variable number of recent data which are the most informative for current prediction journals/ml/Littlestone87 ; journals/ida/Klinkenberg04 . Usually, the model built is updated following two processes: one is a learning process, i.e., updating the model based on the new coming data; the other one is a forgetting process, i.e., discarding data items that are moving out of the window journals/csur/GamaZBPB14
. However, how to choose an appropriate window size is of great importance which now mainly depends on heuristics to a certian extent. Some efforts have been paid to select window size adaptively
journals/ida/Klinkenberg04 ; conf/sbia/GamaMCR04 ; journals/ida/KunchevaZ09 . The common strategy to adjust window size is based on the performance or estimate of generalization error. SVMada journals/ida/Klinkenberg04 presents a theoretically supported approach , however, the computational efficiency issue makes it not practical in realworld applications.Our proposed approach DFOP, short for DistributionFree OnePass, is a onepass style algorithm, i.e., it could guarantee that only one time will data items been gone through. Besides, DFOP is distributionfree, i.e., different from those traditional approaches dealing with distribution change, we did not explicitly model the dynamics, and no prior information about distribution change is assumed.
3 Preliminaries
In this part, streaming regression model in a static scenario is briefly introduced.
In a streaming scenario, we denote a labeled dataset as , where is the feature of the th instance and is a realvalued output. Furthermore, we assume a linear model as follows,
(1) 
where is the noise sequence, is what we desire to estimate.
When in a static scenario, the sequence
is a constant vector denoted as
. Then, the least square could be adopted to minimize the residual sum of squares, which has a closeform solution. However, it fails when adding an online/onepass constraint which demands the raw item is discarded after it has been processed. Recursive least square (RLS) and stochastic gradient descent (SGD) are two typical approaches to solve this problem in an online paradigm.
When in a nonstationary environment, especially when the distribution underlying changes, traditional approaches are not suitable since we could never expect the typical i.i.d assumption continue to work any longer. In the next sections, we propose to handle this scenario based on exponential forgetting mechanism without explicitly modelling the evolution of data stream, and theoretical support and empirical demonstration are presented.
In the following, denotes the norm in space. Meanwhile, for a bounded realvalued sequence , denotes the upper bound of sequence, namely, .
4 Distribution Free OnePass Learning
Since the sequence is changing over time in a dynamic environment, it is no longer reasonable to estimate current (i.e., at time ) concept via methods introduced previously. Instead, we introduce a sequence of discounted factors to downweight the loss of older data as follows,
(2) 
where is a discounted factor to smoothly put less weight on older data. The intuition can be more easily obtained if we simplify all as a constant , then the target function is,
(3) 
And the quantity is named as forgetting factor book/Pearson/haykin2008adaptive . The value of forgetting factor is, as a matter of fact, a tradeoff between stability of past condition and sensitivity of future evolution.
It should be pointed out that the forgetting mechanism based on exponential forgetting factor could be also considered as a continuous analogy to sliding window approach to some extent. The older data items with a small enough weight can be somehow thought as exclusion from the window. Some. More discussions on relation with window size and forgetting factor are provided in Section 5.4.
4.1 Algorithm
For the optimization problem proposed in (3), obviously, by taking derivative of the target function, we can directly obtain the optimal solution in a closedform,
(4) 
However, above expression is an offline estimate, namely, all the data items ahead of are needed. Instead of repeatedly solving (4), we estimate the underlying concept by adding a correction term to the previous estimate based on the information of new coming data item. With the forgetting factor recursive least square method book/Pearson/haykin2008adaptive , we could solve the target (3) in a onepass paradigm. And to the best of our knowledge, this is the very first time to adopt traditional forgetting factor RLS to deal with such tasks with distribution change under the onepass constraints. And we named this as DFOP(short for DistributionFree OnePass) summarized in Algorithm 1.
Besides, it should be pointed out that is by no means necessary chosen as a constant, we provide a generalized DFOP (short as GDFOP) for a dynamic discount factor sequence , corresponding to target in (11), which is also provably a onepass algorithm. Detailed proofs are provided in Section 1 of supplementary material.
For the classification scenario, is no longer a realvalued output but a discrete value, and we assume for convenience. A slight modification on original output step is applied in classification, where the effectiveness is empirically validated in the next section.
Assuming that the feature is dimension, we only need to keep in memory during the algorithm processing procedure. In other words, the storage is always , which is independent to the number of training examples. Besides, at the th time stamp, the update of is unrelated to the previous data items, namely every data item can be discarded once it has been scanned.
4.2 Theoretical Guarantee
In this section, we develop an estimate error bound in a nonstationary regression scenario.
Consider the additive model of drift in sequence ,
(5) 
We assume that the adding term is an valued martingaledifference
dimension subGaussian vector sequence, with corresponding variance proxy sequence
, whose formal definition will be given following. The valued martingaledifference assumption is reasonable, in fact, in many realworld application, the drift of concepts are usually independent.Similar to the analysis in journals/control/guo1993performance , we relax the assumptions to be more realistic in realworld applications and provide a nondeterministic estimate error bound based on vector concentration, showing that the estimate error is tending to convergence with high probability.
Now we give the formal definitions of subGaussian random variable and random vector.
Definition 1.
(subGaussian random variable) A random variable is said to be subGaussian with variance proxy if
and its moment generating function satisfies
(6) 
Definition 2.
(subGaussian random vector) A random vector is called subGaussian with variance proxy if all its coordinates are subGaussian random variables with variance proxy .
To exploit concentration property of subGaussian random vector, condition () proposed in Theorem 2.1 of journal/juditsky2008large shall be satisfied. Thus, first, we show that there exists a bounding sequence for a subGaussian random vector sequence .
Lemma 1
For a subGaussian random vector sequence with a variance proxy sequence , there exists a corresponding positive bounding sequence , such that
(7) 
Lemma 1 guarantees the ”light tail” assumption of subGaussian random vector. Then we could apply the following vector concentration, which is a corollary of Theorem 2.1 proposed in journal/juditsky2008large .
Theorem 1
(Corollary of Theorem 2.1 in journal/juditsky2008large ) In an Euclidean space , let Evalued martingaledifference subGaussian sequence with a corresponding bounding sequence . Let , then for all and :
(8) 
Based on Theorem 23, we could provide Lemma 2 and Lemma 3 to bound a sum of subGaussian random vectors and random variables with exponential decrease, respectively.
Lemma 2
Let be an valued martingaledifference dimension subGaussian random vector sequence, with corresponding bounding sequence , and . Then for , with a probability at least , we have
where and .
Lemma 3
Let be an independent (or valued martingaledifference) subGaussian random variable sequence, with corresponding bounding sequence (i.e., variance proxy sequence) , and . Then for , with a probability at least , we have
where .
Theorem 2
Assume following conditions be satisfied:

drift term is an valued martingaledifference subGaussian random vector sequence, with corresponding bounding sequence ;

output noise is an independent (or valued martingaledifference) subGaussian random variable sequence, with corresponding bounding sequence (i.e., variance proxy sequence) .
Then with a probability at least , we have
where , , and .
Remark. The estimate error bound can be decomposed into three parts, i.e., the first one is , second one is and third one is . Apparently, the first term is decreasing to zero as increases to infinity, second term is caused by the output noise which shall not be erased, and the third term is introduced by drift of . Ignoring the polylogarithmic factors in and
, then, an asymptotic analysis gives the estimate error bound as,
where we use the notation to hide constant factors as well as polylogarithmic factors in and , and will exponentially decrease to zero as .
5 Experiments
In this section, we examine the empirical performance of the proposed DFOP on both regression and classification scenarios. Then, we analyze the parameter sensitivity in Section 5.4. However, due to the page limits, only results on the classification scenario are provided, and the regression ones are appended in the supplementary materials.
Moreover, considering that when dealing with realworld datasets, we could not grasp the evolving distribution, specifically, the start and end time of drift, the underlying distribution. As a consequence, it would be very incomplete to analyze the behaviour of algorithms. Hence, both synthesis and realworld datasets are included in the comparison experiments.
5.1 Comparisons Methods
We compare the proposed approach with six common methods on both synthesis and realworld datasets. The comparison methods are (a) RLS, least square approach solved in a recursive manner, (b) Sliding window approach, the classifier is constantly updated by the nearest data samples in the window. Base classifiers are 1NN and SVM, denoted as 1NNwin and SVMwin
conf/sdm/SouzaSGB15 , (c) SVMfix, batch implementation of SVM with a fixed window size conf/kdd/SyedLS99a , (d) SVMada, batch implementation of SVM with an adaptive window size journals/ida/Klinkenberg04 , (e) DWM, dynamic weighted majority algorithm, an adaptive ensemble based on the traditional weighted majority algorithm Winnow conf/icdm/KolterM03 ; journals/jmlr/KolterM07 .It’s noteworthy to emphasize that the above comparisons are not all fair enough, because DFOP requires each data item be processed only once. Moreover, DFOP only needs one instance to update the model. Not all comparison methods can meets these two constraints, specifically, 1NNwin, SVMwin, SVMfix and SVMada are windowbased algorithms, hence, they are not onepass. Besides, SVMfix and SVMada are not incremental but updated in a series of batches. DWM is incremental style but not onepass because it needs to use data to update experts pool in addition.
5.2 Synthetic Datasets
First, we present the performance comparisons over synthetic datasets.

SEA conf/kdd/StreetK01 consists of three attributes , and . The target concept is , and there are 50,000 instances with 4 stages where .

hyperplane conf/kdd/Fan04
, is generated uniformly in a 10dimensional hyperplane with 90,000 instances in total over 9 different stages.
Besides, another 11 synthesis datasets for binary classification are also adopted. Detailed information are included in the supplementary materials.
The performance is measured by holdout accuracy since underlying joint distribution of synthetic datasets are known. Holdout accuracy is calculated over testing data generated according to the identical distribution as training data at each time stamp. Performance comparisons of seven approaches on SEA and hyperplane datasets are depicted in Figure
1. Since the accuracy curves of SVMada, SVMfix, 1NNwin and SVMwin are so unstable that they would shield all the other curves, we also present a relatively neat figure containing RLS, DWM and DFOP only.As shown in Figure 1, the accuracy of all algorithms falls rapidly when the underlying distribution emerges abrupt drift, and then will rise up with more data coming. DFOP is significantly better than RLS which is a special case of DFOP, this phenomenon validates the effectiveness of forgetting mechanism. Furthermore, the best two algorithms, obviously, are DFOP and DWM, both of them can converge to new stage quickly. DFOP shows a slightly better performance than DWM, both in slope and asymptote. Moreover, DWM requires to dynamically maintain a set of experts and needs previous data to update experts pool and to decide whether to remove poorly performing experts. On the contrary, DFOP demonstrates a desirable performance requiring to scan each data item only once.
5.3 Realworld Datasets
Dataset  SVMwin  1NNwin  SVMfix  SVMada  DWM  RLS  DFOP 

SEA  73.94 0.12  77.27 0.04  86.19 0.06  83.47 0.09  87.04 0.03  84.54 0.47  87.99 0.05 
hyperplane  83.74 0.03  70.66 0.03  87.98 0.03  81.94 0.07  88.36 0.25  69.67 1.40  90.14 0.05 
1CDT  98.71 0.05  99.96 0.07  99.77 0.06  99.77 0.08  99.90 0.09  98.79 1.65  99.97 0.05 
2CDT  94.86 0.06  94.62 0.10  95.19 0.13  95.18 0.15  90.21 0.67  62.24 0.23  96.36 0.09 
1CHT  98.75 0.18  99.81 0.22  99.63 0.17  99.63 0.18  99.69 0.26  98.49 1.61  99.84 0.16 
2CHT  87.70 0.04  85.69 0.05  89.48 0.12  88.89 0.13  85.92 0.72  62.57 0.23  89.91 0.07 
1CSurr  97.99 0.04  98.12 0.11  94.24 1.08  93.56 1.08  96.31 0.50  67.82 0.22  93.24 1.44 
UG2C2D  94.47 0.13  93.55 0.16  95.41 0.10  94.92 0.12  95.59 0.11  67.02 1.46  95.59 0.10 
UG2C3D  93.60 0.73  92.83 0.93  95.05 0.64  94.48 0.71  95.14 0.62  61.95 2.60  95.37 0.61 
UG2C5D  74.82 0.45  88.04 0.42  91.74 0.26  90.37 0.35  92.82 0.23  81.20 2.42  92.51 0.25 
MG2C2D  90.20 0.07  87.84 0.09  84.98 0.06  84.22 0.06  90.15 0.06  57.18 3.66  85.06 0.06 
G2C2D  95.54 0.01  99.61 0.00  95.41 0.01  95.26 0.02  95.82 0.02  95.84 0.01  95.83 0.02 
Chess  69.67 1.51  79.58 0.54  77.73 1.56  69.18 3.65  73.77 0.66  78.70 0.83  79.15 0.62 
Usenet1  68.92 1.12  65.36 1.55  64.18 2.24  67.68 1.86  64.43 4.53  60.65 0.53  69.20 0.68 
Usenet2  74.44 0.71  71.03 0.60  73.99 0.69  72.64 0.84  73.37 0.93  73.16 0.67  75.60 0.57 
Luxembourg  88.57 0.28  77.51 0.44  98.25 0.19  97.43 0.42  92.61 0.40  99.06 0.14  99.09 0.14 
Spam  83.91 2.20  93.43 0.82  92.44 0.80  91.01 0.94  91.49 1.09  94.46 0.16  94.77 0.26 
Weather  68.54 0.55  72.64 0.25  67.79 0.65  77.26 0.33  70.86 0.42  78.35 0.18  79.23 0.12 
Powersupply  73.33 0.25  72.42 0.21  71.17 0.15  69.39 0.17  72.18 0.29  69.67 0.64  80.46 0.04 
Electricity  74.20 0.08  85.33 0.09  62.01 0.59  58.69 0.58  78.60 0.41  74.20 0.63  76.94 0.26 
DFOP W/ T/ L  18/ 1/ 1  14/ 4/ 2  19/ 1/ 0  18/ 2/ 0  14/ 3/ 3  19/ 1/ 0   
Performance comparison in terms of mean and standard deviation of accuracy (both in percents). Bold values indicates the best performance. Besides,
() indicates that DFOP is significantly better (worse) than the compared method (paired tests at 95% significance level). And Win/ Tie/ Loss are summarized in the last row.To valid the effectiveness of DFOP in realworld applications, performance comparisons are presented over 8 realworld datasets. Detailed descriptions are provided in Section 5 of supplementary materials.
In realworld datasets, we can never expect to foreknow the underlying distribution at each data stamp. Thus, it’s not possible to still adopt holdout accuracy as performance measurement. In Table 4, we conduct all the experiments for 10 trails and report the overall mean and standard deviation of predictive accuracy over above realworld datasets as well as other 12 synthesis datasets.
In a total of 20 datasets, the number of instance vary from 533 to at most 200,000. DFOP achieves the best among all approaches in 15 over 20 datasets. Also, in other 5 datasets, DFOP ranks the second or the third. This validates the effectiveness of DFOP, especially under an unfair comparison condition.
Additionally, the robustnessconf/kdd/VlachosDGKK02 of all these different algorithms are compared. Briefly speaking, for a particular algorithm algo, the robustness is defined as the proportion between its accuracy and the smallest accuracy among all compared algorithms, i.e., . Hence, the sum of over all datasets indicates the robustness of for algorithm . The greater the value of the sum, the better the performance. DFOP achieves the best over 20 datasets, and RLS ranks last as expected since it didn’t consider the evolving distribution in datasets at all. Due to the page limits, detailed robustness comparison results could be found in Section 4.3 in supplementary materials.
5.4 Parameter Study
As stated previously, how to choose an appropriate forgetting factor is an important issue since it reflects a tradeoff between stability of past condition and sensitivity to future evolution. To figure out how forgetting factor affects the performance, in classification problem, accumulated accuracy (short as ’AA’) is adopted as a performance measurement in the time series and is defined as,
(9) 
where is indicator function which takes 1 if is true, and 0 otherwise, and are predictive and groundtruth label, respectively. Figure 2 shows the impact of different forgetting factor over four datasets. We notice that the accumulated accuracy of RLS almost decreases all the time. For a relatively small but not zero , the performance is satisfying without a significant gap. However, when is too large, say 0.5, the performance is even much worse than RLS. This is consistent with intuition since forgetting factor is so large and older data samples are exponentially downweighted that there are not sufficient effective training samples available to update the model.
Now, here comes the question: how to choose an appropriate forgetting factor to adapt the distribution change in the data stream? To answer this question, let’s recall the target function in (3), when is close to , which is often the case in practice and validated in Figure 2, then we have
(10) 
where we define as forgetting period. The contribution for prediction error of data items older than time will be discounted with a weight less than comparing to the current data. As a matter of fact, the forgetting period in forgetting mechanism is pretty similar to the window size in sliding window technique. It can be regarded as a soft relaxation of window size. Consequently, the forgetting factor shall be chosen according to the forgetting period , where the data distribution should be relatively smooth and stable during this forgetting period.
We validate this idea over synthesis datasets reported in Table 2. Theoretical recommended value and empirical appropriate value for forgetting factor are provides. Also, the relative proportions between them are calculated. We can see that these two value are very close over all datasets with no more than 20 times difference, even no more than 5 times in most datasets. This supports our strategy in choosing forgetting factor.
Dataset  Dataset  

1CDT  400  2.50E03  1.00E02  4  UG2C2D  1,000  1.00E03  1.00E03  1 
2CDT  400  2.50E03  1.00E02  4  UG2C3D  2,000  5.00E04  1.00E03  2 
1CHT  400  2.50E03  1.00E02  4  UG2C5D  2,000  5.00E04  1.00E03  2 
2CHT  400  2.50E03  1.00E02  4  MG2C2D  2,000  5.00E04  1.00E03  2 
1CSurr  600  1.67E03  5.00E03  3  G2C2D  2,000  5.00E04  1.00E03  2 
hyperplane  9,000  1.11E04  2.00E03  18  SEA  10,000  1.00E04  1.00E03  10 
Certainly, the drifting properties of realworld datasets are not as clear as synthetic datasets. Nevertheless, we could still infer the forgetting period based on the domain knowledge and choose an appropriate value as forgetting factor. For instance, considering the weather forecast dataset, although we cannot foreknow the drifting property of distribution, a relative stable period can still be estimated.
6 Conclusion
In this paper, we proposed an approach based on forgetting mechanism called DFOP handling streaming learning problems with distribution change. The main idea is to downweight the older data items by introducing exponential forgetting factor without considering any prior about drifting information. Meanwhile, DFOP meets the onepass constraints guaranteeing that only once will the data items be scanned without storing the entire dataset. Hence, DFOP. The storage requirement of DFOP is , where is the dimension of data, independent from the number of training examples. Both theoretical supports and empirical demonstrations for DFOP are presented to validate its effectiveness and practicality.
Besides, how to efficiently reduce the storage and make DFOP paralleled to adapt a even larger scale realworld applications would be an interesting future work.
Appendix
1 Generalized DFOP and Proofs
1.1 Recursive Algorithm for Dynamic Discounted Factors
1.2 Proof of Generalized DFOP
In this part, we will prove the consistency between GDFOP and target function in (11).
Lemma 4
Let and be matrices of compatible dimensions such that the product and the sum exist. Then we have
(12) 
Proof.
For convenience, let , then the closeform solution of optimization (11) can be calculated as follows
(13) 
Now we will prove that the solution obtained by GDFOP is equivalent to closeform solution in (13).
Proof.
Now, we introduce and then apply Lemma 4 to (1.2), this gives
Let , we can obtain the policy described in Algorithm 2.
Remark. Obviously, DFOP in paper is only a special case when fixing discounted factor sequence as . Note that, for a simplicity notations in estimate error analysis of DFOP, we slightly modified and multiplying by .
2 Proof of Theorem 1
2.1 Definition of SubGaussian and SubExponential
First we give typical definitions of subGaussian random variable and random vector, meanwhile, definition of subExponential random variable is also provided.
Definition 3.
(subGaussian random variable) A random variable is said to be subGaussian with variance proxy if and its moment generating function satisfies
(14) 
In this case we write .
Definition 4.
(subGaussian random vector) A random vector is called subGaussian with variance proxy if all its coordinates are subGaussian random variables with variance proxy .
Definition 5.
(subExponential random variable) A random variable is said to be subExponential with parameter if and its moment generating function satisfies
(15) 
In this case we write .
Remark. Attention that definitions above all require a zeromean constraint, which is not necessary in analysis ”light tail” property. Hence, for random variable that is not zeromean but satisfies condition (14) is called generalized subGaussian. And definitions for generalized subGaussian random vector and generalized subExponential random variable are similar.
2.2 Proof of Theorem 1
Theorem 1 in the paper presents a vector concentration inequality is shown for subGaussian random vector sequence in the following which plays an important role in proving Lemma 2 and Lemma 3 in our paper. To prove Theorem 1 in the paper, first, we present following Lemma 5 to show that the norm of a subGaussian random vector is a generalized subGaussian random variable.
Lemma 5
If is a subGaussian random vector with variance proxy , then
(16) 
Proof.
The conclusion here is a direct corollary from Theorem 3.1 in journals/buldygin2010inequalities , where , , and .
Then we present the following Lemma 6 to show the equivalence between subGaussian random variable and subExponential variable, which plays an important role in proving Theorem 1.
Lemma 6
Let be a subGaussian random variable, i.e., . Then the random variable is subExponential: .
Proof.
We prove this lemma by definition.
(17)  
(18)  
(20) 
where (17) and (18) because of Jensen’s Inequality. (2.2) holds because of in journal/vershynin2010intro . The last step (20) holds because the condition in definition of subExponential, i.e., .
Remark. Attention that in the proof of Lemma 6, we didn’t use the zero mean property of subGaussian random variable . Hence, Lemma 6 can be applied to a generalized subGaussian random variable without requiring zeromean condition.
Now, let’s begin prove the Lemma 2 in the paper stated as follows,
For a subGaussian random vector sequence with a variance proxy sequence , there exists a corresponding positive bounding sequence , such that
(21) 
Proof.
Consider any vector in the sequence, say . Because it is a subGaussian random vector, directly applying Lemma 5, we have
which means for a subGaussian random vector, its norm is a generalized subGaussian random variable. Here, ”generalized” means it may not meet the zeromean condition.
Because we didn’t use the zero mean property of subGaussian random variable in the proof of Lemma 6, it can also be applied to generalized subGaussian random variable. Let , then , specifically,
Thus, for , we have
(22) 
Obviously, we can choose a sufficient small positive constant as , such that
And is exactly the bounding sequence as we desired. This completes the proof.
Now, we prove the Theorem 1 in the paper stated as follows,
Theorem 3
In an Euclidean space , let Evalued martingaledifference subGaussian sequence with a corresponding bounding sequence . Let , then for all and :
(23) 
Proof.
First of all, for a subGaussian sequence with corresponding bounding sequence , from Theorem 1 in paper, we have . Besides, an Euclidean space is 1smooth and 1regular, which means . Then, it follows immediately according to Theorem 2.1 proposed in journal/juditsky2008large .
3 Proof of Theorem 2
Firstly, the following generalized summation by parts is essential for the proof of main theorem in our paper.
Lemma 7
Let and be two sequences. And denote , then for we have,
Proof.
The righthand side can be easily verified by expanding in the lefthand as .
Remark. When , the same derivation gives the famous Abel transformation,
3.1 Proof of Lemma 2 in paper
Lemma 2 in our paper states as follows,
Let be an valued martingaledifference dimension subGaussian random vector sequence, with corresponding bounding sequence , and . Then for , with a probability at least , we have
where and .
Proof.
Denote , then with the recursive property of , left side can be expanded by Lemma 7,
And it could be separated into two parts, i.e.,
For , based on Lemma 3, we have
taking by the union bound over , and let the following holds in a probability at least ,
Conditioning on all above concentration inequalities hold, for , we have
Hence, combining and , we complete the proof. ∎
3.2 Proof of Lemma 3 in paper
Lemma 3 in our paper states as follows,
Let be an independent (or valued martingaledifference) subGaussian random variable sequence, with corresponding bounding sequence (i.e., variance proxy sequence) , and . Then for , with a probability at least , we have
Comments
There are no comments yet.