Distributed Learning with Dependent Samples

02/10/2020 ∙ by Shao-Bo Lin, et al. ∙ 0

This paper focuses on learning rate analysis of distributed kernel ridge regression for strong mixing sequences. Using a recently developed integral operator approach and a classical covariance inequality for Banach-valued strong mixing sequences, we succeed in deriving optimal learning rate for distributed kernel ridge regression. As a byproduct, we also deduce a sufficient condition for the mixing property to guarantee the optimal learning rates for kernel ridge regression. Our results extend the applicable range of distributed learning from i.i.d. samples to non-i.i.d. sequences.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the development of data mining, data of massive size are collected in numerous application regions including recommendable systems, medical analysis, search engineering, financial analysis, online text, sensor network monitoring and social activity mining. For example [34], Twitter admits 328 million active users’ microblog (share 280-character updates) with their followers per month; Google has seen 30 trillion URLs, crawls over 20 billion of those a day, and answers 100 billion search queries a month; International Data Corporation (IDC) reports that global investment in big data and business analytics (BDA) will grow from $130.1 billion in 2016 to more than $203 billion in 2020. These massive data certainly bring benefits for data analysis in terms of improving the prediction capability [39], discovering potential structure of data which cannot be reflected by data of small size [9], and creating new growth opportunities to combine and analyze industry data [3]. However, they also cause several challenges, called massive data challenges, in data analysis as follows:

Parallel and distributed computation: Massive data are distributively stored across numerous servers. The cost of data communications between different servers and risk of loading data in a remote server create new problems in tackling these distributively stored data. All these require to develop distributed learning strategies without data communications.

Data privacy protection: Massive data such as clinical records, personal social networks and financial fraud detection frequently builds machine learning models by training on sensitive data and thus requires to protect the data privacy. This raises privacy concerns since adversaries may be able to infer individual information from the massive data. The data privacy issue becomes more urgent when data are owned by different organizations that wish to collaboratively use them.

Dependent samples: Massive data are usually collected via different time and behave as temporal data in numerous applications such as medical research and revenue management. The massive temporal data, including time series data and dynamical systemic data, pose strong dependent relationship among data. Furthermore, it is unreasonable to assume data of massive size are collected via an independent and identical (i.i.d.) manner. Under this circumstance, the dependence issue poses an urgent problem to the existing distributed learning theory to incorporate non i.i.d. samples.

I-a Distributed learning

Distributed learning is a natural and preferable approach to conquer the parallel and distributed computation issue of massive data challenges. As shown in Figure 1, the data subset is stored on a local machine and a specific learning algorithm is implemented in each local machine to yield a local estimator. Then, communications between different local machines are conducted to exchange exclusive information of different data subsets to improve the quality of local estimators. Finally, all the obtained local estimators are transmitted to a global machine to produce a global estimator. Therefore, there are three ingredients of the distributed learning: local processing, communication, and synthesization.

Fig. 1: Training flow of distributed learning

There are roughly two types of learning strategies to yield local estimators. One is the parametric regression (or classification) which searches a linear combination of a set of fixed basis functions [38]. In this case, coefficients of the basis functions in each local machine should be transmitted to the global machine and communications between different local machines [13] are necessary to guarantee some statistical properties of distributed algorithms. Numerous efficient communication approaches have been proposed to equip distributed learning with parametric regression and their learning performances are rigorously verified in [38, 13, 20, 19, 35]. The problem is, however, communications bring additional risks to disclose the individual privacy of distributively stored data subsets.

Fig. 2: Training and testing flows of nonparametric distributed leraning

The other is the nonparametric distributed learning (NDL) which devotes to combining the prediction results from local machines without sharing any individual information of data subsets. In NDL, the query point should be sent to each local machine and the local estimator is a real number rather than a set of coefficients, just as Figure 2 purports to show. Then all local estimators are transmitted to the global machine without mutual communications to synthesize the global estimator. Since only a real number is communicated to the local machine, NDL succeeds in conquering the computation and privacy issues of massive data challenges.

If the data are i.i.d. drawn, it was proved in [39, 7, 24] that NDL performs the same as its non-distributed version that runs learning algorithms with whole data stored on a large enough machine, provided the number of data sets is not too large. However, facing with non-i.i.d. data, the dependence nature plausibly reduces the effective data [37, 25] and thus degenerates the prediction performance of learning algorithms. Our purpose is to establish optimal generalization error bounds for NDL with non-i.i.d. data. To this end, we formulate a sufficient condition for the mixing property of data to guarantee the optimal learning rates of NDL.

I-B Contributions

In this paper, we take the widely used distributed kernel ridge regression (DKRR) [39, 6] for example to illustrate the feasibility of NDL for tackling strong mixing sequences (or -mixing sequences). Our results can be easily extended to other kernel-based learning algorithms such as the distributed kernel-based gradient descent algorithm [24] and distributed kernel-based spectral algorithms [15, 27, 22]. Our novelty can be described into the following three aspects.

Methodology: Bernstein-type concentration inequalities [37, 25] is a classical tool to analyze the generalization performance of learning algorithms with strong mixing sequences. In this paper, we formulate a novel integral operator theory framework to analyze the learning performance of kernel ridge regression (KRR) with strong mixing sequences. By the help of the Banach-valued covariance inequality for strong mixing sequences [11], we develop novel operator representations for the generalization error of KRR and DKRR and conduct a refined error analysis. The advantages of the proposed framework is to improve the generalization error estimates and relax the mixing condition from geometrical -mixing to algebraic -mixing.

KRR theory: By the help of the integral operator approach, we deduce a sufficient condition for the mixing property of data to guarantee the optimal learning rates for KRR. In particular, we prove that if the samples are algebraic -mixing and the corresponding regression function [10] is smooth, then KRR can achieve the optimal learning rates established for i.i.d. samples [5]. This is the first result, to the best of our knowledge, to show optimal learning rates for learning algorithms to tackle non-i.i.d. data.

DKRR theory: Due to the dependence nature, the classical error decomposition for DKRR [6, 15] which divides the generalization error into approximation error, sample error and distributed error does not hold. Using a delicate analysis for the operator representation, we succeed in deducing a similar error decomposition for DKRR to tackle strong mixing sequences. With this, we deduce optimal learning rates for DKRR, provided the number of local machines is relatively small. Our results show that NDL is a feasible strategy to conquer the distributive storage, privacy and dependence issues of massive data challenges.

The rest of the paper is organized as follows. In the next section, we present distributed kernel ridge regression for tackling distributively stored strong mixing sequences. In Section III, we give main results of the paper, which includes optimal learning rates for KRR and DKRR. In Section IV, we compare our results with some related work and present some comparisons. In Section V, we introduce the main tools for analysis and provide a novel error decomposition for distributed learning with strong mixing data. In the last section, we prove our results.

Ii DKRR with Strong Mixing Sequences

In this section, we introduce strong mixing sequences to quantify the dependence of data and then present DKRR to tackle these dependent data.

Ii-a Strong mixing sequences

It is impossible to establish satisfactory generalization error bounds for learning algorithms without imposing any restrictions on the dependence. An extreme case is that all samples are generated from a single data, making the number of effective samples be 1. Therefore, we introduce several quantities to describe the dependence of samples. The strong mixing condition [28] is a widely used restriction which is much weaker than the so-called -mixing condition [37] and -mixing condition [33].

Let be a set of random sequences. We at first introduce the stationarity [37, 25] of as follows.

Definition 1

A random sequence is stationary if for all and all non-negative integers and

, the random vectors

and have the same distribution, where .

According to Definition 1, stationarity shows that the marginal distribution of is independent of . Without the stationarity condition, it was shown in [1] that convergence of the training error to the expected risk could occur arbitrarily slowly. For two -fields and , define the -coefficient as

(1)

Denote by the

-filed generated by random variables

. The strong mixing condition is defined as follows.

Definition 2

A set of random sequence is said to satisfy a strong mixing condition (or -mixing condition) if

(2)

Here, is called the strong mixing coefficient. If there are some constants such that

(3)

then is said to be geometrical -mixing. If there are some constants such that

(4)

then is said to be algebraic -mixing.

It is easily to derive from Definition 2 that i.i.d. random sequences imply for all . Thus, the strong mixing is a reasonable extension of the classical i.i.d. sampling. In the following, we provide several examples in time series and dynamical systems to generate strong mixing sequences.

Example 1 (Nonparametric ARX(p,q) model): Suppose is generated according to

where and are independent with . Then it can be found in [12, p.102] (see also [8, Proposition 1]) that under certain conditions on and and some boundedness assumption of , is strong mixing.

Example 2: (ARMA process) Suppose that the time series satisfy the ARMA equation

where , are real matrices and is an i.i.d. random vector. Then, it can be found in [26] that under some restrictions on , is strong mixing.

Example 3(Dynamic Tobit process): Suppose satisfies

where are fixed parameters, is an exogenous regressor and is a disturbance term. It was proved in [16] that under certain conditions on , is strong mixing.

Besides the proposed examples, there are numerous strong mixing sequences including AR(p) process, MA(p) process, ARIMA process, Nonlinear ARCH process, GARCH process, Harris Chains and linear time-invariant dynamical systems. We refer the readers to [12] for more examples of strong mixing sequences.

Ii-B DKRR with strong mixing sequences

Strong mixing condition describes the relation between past random variables and future random variables and therefore poses strict restriction on the order of samples. This brings additional difficulty to design distributed learning algorithms, since a suitable order of data subsets is required. Let be the -th data subset stored on the -th local machine, where denotes the cardinality of . Assume for . Our aim is to design distributed learning algorithms to tackle these distributively stored and dependent data. We are interested in two popular scenarios.

Scenario1 (Tandem case): Let is a strong mixing sequence with -mixing coefficients . When is too large, must be distributively stored on local machines with a suitable order. Under this circumstance, is also a strong mixing sequence with -mixing coefficients for each .

Scenario 2 (Parallel case): Let , , be strong mixing sequences with -mixing coefficients . Assume further data stored on different are independent. Then, it is easy derived from (2) that is also a strong mixing sequence with -mixing coefficients . The independence assumption of different data subsets is mild. Indeed, if we set the first samples in , to be independent, then Examples 1-3 succeed in generating strong mixing sequences as required.

Under both scenarios, we are interested in implementing DKRR on . Let be a Mercer kernel on a compact metric (input) space and be the corresponding reproduced kernel Hilbert space (RKHS). Then, for any , the reproducing property yields

(5)

where .

Distributed kernel ridge regression ( DKRR) is defined [39, 23] with a regularization parameter by

(6)

where

(7)

Optimal learning rates of DKRR have been established in [39, 23, 6, 27], provided data in are i.i.d. sampled and is not so large. However, the learning performance of remains open when the i.i.d. assumption is removed.

Iii Main Results

In this section, we present our main results on analyzing learning performances of KRR and DKRR for tackling strong mixing data.

Iii-a Setup and assumptions

Our analysis is carried out in a standard learning theory setting [10]. In each local machine, with and , , is a realization of some stochastic process . We assume the distribution of stochastic process to be an unknown but definite distribution , where is the marginal distribution and is the conditional distribution conditioned on . Our purpose is to find an estimator based on to minimize the expectation risk

Noting that the well known regression minimizes the expectation risk [10], the learning task is then to find an estimate to minimize

where denotes the norm of the Hilbert space . For this purpose, we introduce four types of assumptions concerning the mixing property of samples, boundedness of outputs, capacity of and regularity of , respectively.

Assumption 1: Assume that samples in are stationary and strong mixing with -mixing coefficient . Furthermore, we assume that there is a suitable arrangement of such that is also a stationary strong mixing sequence with -mixing coefficient .

The stationary and strong mixing assumption is a widely used condition to quantify the dependence of samples. It has been adopted in [25, 36, 33, 14, 17] to derive learning rates for KRR and is looser than the widely used -mixing condition in [37, 8, 2]. Due to Scenarios 1 and 2, the stationary and strong mixing assumption of is mild in practice.

Assumption 2: There exists a positive constant such that almost surely.

Since we are always concerned with learning problems with finite samples, it is easy to derive an upper bound of the output. Due to Assumption 2, it follows from the definition that for any . The Mercer kernel defines an integral operator on (or ) by

Our third assumption is on the capacity of measured by the effective dimension [15, 23],

where denotes the trace of the trace-class operator . Assumption 3: There exists some such that

(8)

where is a constant independent of .

Condition (8) with is always satisfied by taking . For , it was shown in [15]

that the assumption is more general than the eigenvalue decaying assumption in the literature

[5, 31, 39]. Assumption 3 has been employed in [15, 23, 6] to derive optimal learning rates for kernel-based algorithms.

Assumption 4: For , assume

(9)

where denotes the -th power of .

Condition (9) describes the regularity of and has been adopted in a large literature [29, 5, 15, 23] to quantify learning rates for some algorithms. It can be found in Example 1 that there are numerous regression functions and data generation mechanisms satisfying Assumptions 1-4.

Iii-B Optimal learning rates for KRR

KRR is one of the most popular learning algorithms in learning theory [10, 30]. Its optimal learning rates for i.i.d. samples were established in [5, 31, 6, 23] by using the integral operator approach. Based on Bernstein-type concentration inequalities [37, 25], generalization error bounds for non-i.i.d. sequences have also been derived in [36, 32, 17] under some mixing conditions. However, there is a dilemma in the existing literature that the dependence of samples always degenerates the learning performance of KRR, no matter which mixing condition is imposed. An extreme case is that even for geometrical -mixing sequences satisfying (3) with very large [36], there is a gap between i.i.d. samples and non-i.i.d. sequences.

In this section, we borrow the integral operator approach from [23, 15] and a Banach-valued covariance inequality from [11], and then succeed in providing a sufficient condition of the -mixing coefficient to guarantee the optimal learning rates of KRR. The following theorem presents learning rates analysis of KRR under the norm.

Theorem 1

Let . Under Assumption 1-Assumption 4 with and , if , then

where is a constant independent of .

Based on Theorem 1, we can deduce the following corollary directly.

Corollary 1

Under Assumption 1-Assumption 4 with satisfying (4), and , if and in (4) satisfies

(11)

then

(12)

where is a constant independent of .

It can be found in [23, 6] that, under (11), the derived learning rate (12) for KRR achieves the existing optimal one for i.i.d. samples. It should be noted that (11) is satisfied for numerous applications in time series and dynamical system, including Examples 1- 4. Since (3) implies (11), Corollary 1 holds for any geometrical -mixing sequences.

In the following theorem, we present an error estimate for DKK under the norm.

Theorem 2

Let . Under Assumption 1-Assumption 4 with , and , if , then

where is a constant independent of .

With the help of the Theorem 2, we can deduce the following sufficient condition for optimal learning rates.

Corollary 2

Under Assumption 1-Assumption 4 with , and , if and in (4) satisfies

(14)

then

(15)

where is a constant independent of .

It can be found in [5, 31, 23] that under the same condition, is the optimal learning rate of KRR for i.i.d. samples. Corollary 2 therefore provides a sufficient condition for the mixing property of dependent samples to guarantee the optimal learning rate. The rate in (15) shows that under (14), strong mixing sequences behave similar as i.i.d. samples for KRR. Compared with the i.i.d. case further, additional restrictions on the regularity of and capacity of , i.e. are imposed in our results. This is a technical assumption and is much stricter than the widely used condition in [5, 31, 23]. We believe that it can be relaxed to , just as Theorem 1 does for the norm. We leave it as our future studies. Since and implies , a special case for Corollary 2 is that the capacity-independent (i.e. in Assumption 3) optimal learning rates hold for KRR with strong mixing sequences.

Iii-C Optimal learning rates for DKRR

In this part, we study the learning performance of DKRR defined by (6). It was shown in [39, 23] that the defined DKRR maintains the learning performance of KRR with whole data, provided the number of local machines is not so large and samples are i.i.d. drawn. Our result shows that under stricter condition on the number of local machines, DKRR for algebraic -mixing sequences can also achieve the optimal learning rate.

Theorem 3

Let . Under Assumption 1-Assumption 4 with , and , if , then

(16)

where is a constant independent of .

Theorem 3 follows the following corollary directly.

Corollary 3

Let . Under Assumption 1-Assumption 4 with , and , if , in (4) satisfies (14), and satisfies

(17)

then

(18)

and

(19)

where is a constant independent of .

Compared Corollary 3 with [23, Corollary 11], our condition on the number of local machines is much stricter, which exhibits a difficulty for dealing with distributively stored non-i.i.d. samples. Based on Theorem 3 and Corollary 3, we rigorously verify that DKRR with algebraic -mixing sequences can reach the optimal learning rate, provided the number of local machines is relatively small. This extends the existing distributed learning theory from i.i.d. samples to strong mixing sequences and extends the applicable range for distributed learning.

Iv Related Work and Comparisons

In this section, we compare our results with some related work and present some discussions on distributed learning for non-i.i.d. samples. Studying learning performance of KRR for strong mixing sequences is a classical and long-standing research topic. It can date back to 1996, when [25] derived a Bernstein-type concentration inequality for -mixing sequences satisfying (3). It is a nice result which was utilized in [36, 32, 17] to derive learning rates for KRR. The problem is, however, the concentration inequality for -mixing sequences is somewhat worse than that for i.i.d. samples, since the dependence of data is doomed to reduce the number of effective samples. As a result, all learning rates derived from such a Bernstein-type concentration inequality and its improvements are worse than the optimal learning rates established in [5]. In particular, it can be found in [32, Example 2.5] and [17, Example 4.3] that under an additional marginal assumption

for some and depending only on , learning rates of KRR are of order , where is a positive number depending on the and the dimension of .

Noticing this dilemma, [33] is the first work, to the best of our knowledge, to study the learning rate without using the Bernstein-type concentration inequalities. Instead, [33] adopted a Banach-valued covariance inequality developed in [11] and the integral operator approach in [29] to derive learning rates of KRR for strong mixing sequences satisfying (4). However, their results are capacity-independent, i.e. in Assumption 4, making the learning rate be sub-optimal. In this paper, we follow the idea of [33] but use a recently developed integral operator approach in [23, 15]. We improve learning rates of KRR to be optimal with looser restriction on the mixing conditions. We believe this is the first result on optimal learning rates of KRR for non-i.i.d. samples.

In the era of big data, nonparametric distribution learning (NDL) is a popular strategy to tackle massive and sensitive data. If the samples are i.i.d. drawn, optimal learning rates of NDL have been established in [39, 23, 15, 27, 24, 22]. A general conclusion is that under Assumptions 2-4, if and , then DKRR for i.i.d. samples achieves the optimal learning rates. Our Theorem 3 extends the results from i.i.d. samples to algebraic -mixing sequences, the price of which is a stricter restriction on the number of local machines, i.e. from to . It should be highlighted that such a degradation is reasonable since the dependence of samples destroys the classical error decomposition for distributed learning [15, 6] and requires a novel error analysis technique. Based on our results, we find that NDL, especially DKRR, is a good candidate for overcoming the massive data challenges since it is efficient for distributively stored, private and dependent data.

V Step-stones for Analysis

Since analysis technique based on Bernstein-type concentration inequality [25] cannot provide optimal learning rates, we turn to utilizing the integral operator approach, just as [33] did. However, our main tool is a recently developed integral operator technique in [4, 23, 15], which succeeds in deriving tight bounds for different differences between the integral operator and its empirical version. The following covariance inequality for Hilbert-valued strong mixing sequences [11, Lemma 2.2], which has been used in [33] is another tool in our analysis.

Lemma 1

Let and be random variables with values in a separable Hilbert space measurable -field and and having finite -th and

-th moments respectively. If

with or , , then

(20)

where denotes the -th moment as if and , and is defined by (1).

V-a Integral operator approach for strong mixing sequences

Let be the sampling operator [29] defined by

Its scaled adjoint is given by

Define

Then, it can be found in [23] that

(21)

and

(22)

where .

Our first tool is to derive upper bounds for operator differences and in the following lemma, whose proof will be given in Appendix.

Lemma 2

Let . For an -mixing sequence , we have

(23)

and

(24)

As shown in [15, 24], the product plays a crucial role in deriving optimal learning rates for KRR and DKRR with i.i.d. sampling. In the following lemma, we adopt the recently developed second order decomposition for operator differences in [23, 15] to present an upper bound of for strong mixing sequences, whose proof is also postponed to Appendix.

Lemma 3

Let . For an -mixing sequence , there holds

With the help of Lemma 1, we provide our final tool which focuses on tight bounds for the difference between functions and in the following lemma.

Lemma 4

Let and almost surely. For an -mixing sequence , we have

(25)

and

(26)

V-B Error decompositions for KRR

We present different error decompositions for KRR under the norm and norm. Define the population version of to be

(27)

Then (21) and (27) yield the following two decompositions.

and

(29)

Denote

(30)
(31)
(32)
(33)
(34)

We have the following error decomposition for KRR under the norm.

Proposition 1

Under Assumption 4 with , we have

(35)

Proof. It follows from the triangle inequality that

(36)

But Assumption 4 with implies [6]

(37)

Thus, it suffices to provide an upper bound for the variance

. Due to the basic inequality [4]

(38)

for positive operators and , it follows from the Schwarz inequality, (30), (31), (32) and (29) that

(39)

Since Assumption 4 holds with , it is easy to check

(40)

Plugging (40) into (39) and inserting the obtain estimate together with (37) into (36), we obtain (35) directly. This completes the proof of Proposition 1.   

Error decomposition under the norm is more sophisticate, for strong mixing sequences. Different from the previous decomposition under the norm based on (V-B) and the method in [29, 23] based on (29), our approach adopts both (V-B) and (29). In fact, our analysis are cut to two stages: the first is to bound by using (V-B) and the other is to estimate by utilizing (29).

Proposition 2

If Assumption 4 holds with , then

(41)