Learning Under Distributed Features

05/29/2018 ∙ by Bicheng Ying, et al. ∙ EPFL 0

This work studies the problem of learning under both large data and large feature space scenarios. The feature information is assumed to be spread across agents in a network, where each agent observes some of the features. Through local cooperation, the agents are supposed to interact with each other to solve the inference problem and converge towards the global minimizer of the empirical risk. We study this problem exclusively in the primal domain, and propose new and effective distributed solutions with guaranteed convergence to the minimizer. This is achieved by combining a dynamic diffusion construction, a pipeline strategy, and variance-reduced techniques. Simulation results illustrate the conclusions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction and Problem Formulation

Large-scale optimization problems are common in data-intensive machine learning problems

[2, 3, 4, 5]

. For applications, where both the size of the dataset and the dimension of the feature space are large, it is not uncommon for the dataset to be too large to be stored or even processed effectively at a single location or by a single agent. In this article, we examine the situation where the feature data is split across agents either due to privacy considerations or because they are already physically collected in a distributed manner by means of a networked architecture and aggregation of the data at a central location entails unreasonable costs. More specifically, the entries (blocks) of the feature vector are assumed to be distributed over a collection of

networked agents, as illustrated in Fig. 1. For instance, in sensor network applications, [6, 7], multiple sensors are normally employed to monitor an environment; the sensors are distributed over space and can be used to collect different measurements. Likewise, in multi-view learning problems[8, 9], the observed model is represented by multiple feature sets. Another example is the Cournot competition problem in networked markets [10, 11, 12], where individual factories have information about their local markets, which will not share with each other. Distributed dictionary learning problems[13] also fit into this scenario if viewing the dictionary as feature.

Fig. 1: Distributing the feature across the networked agents.

In this work, we focus on empirical risk minimization problems, which, by ergodicity arguments, provide a good approximation for average risks [14, 7, 15, 16, 17, 18, 19, 20]. Formally, we consider an empirical risk of the form:

(1)

where the unknown parameter model (or separating hyperplane) is designated by

, while denotes the -th feature vector and the corresponding scalar label. Moreover, the notation

refers to the loss function and is assumed to be a differentiable and convex function over

. In most problems of interest, the loss function is dependent on the inner product rather than the individual terms . The factor represents the regularization term. We denote the minimizer of in (1) by . Although we are assuming and to be column vectors, and to be a scalar, the analysis can be easily extended to matrix quantities and to vector labels , which we will illustrate later in the simulations.

Since the entries of the feature vector are assumed distributed over agents, we partition each , and similarly the weight vector , into   sub-vectors denoted by , where :

(2)

Each sub-feature vector and sub-vector are assumed to be located at agent . The dimensions of can vary over , i.e., over agents. In this way, the empirical risk function can be rewritten in the form

(3)

where we are also assuming that the regularization term satisfies an additive factorization of the form

(4)

with regularization applied to each sub-vector . This property holds for many popular regularization choices, such as , , KL-divergence, etc. Observe that in the form (3), the argument of the loss function is now a sum over the inner products . That is, we have a “cost-of-sum” form similar to what was discussed in [13]. Our objective is to optimize (3) over the and to seek the optimal values in a distributed manner.

I-a Related Works

Problems of this type have been pursued before in the literature by using duality arguments, such as those in [15, 13, 22, 23, 24]. One common way to do that is to transform problem (3) into a constrained problem, say, as:

(5)

Introducing the dual variable , we get the Lagrangian function:

(6)

Next, exploiting a duality argument, problem (5) is equivalent to solving the following dual problem:

(7)

where the scalar denotes the dual variable corresponding to the th constraint, and and represent the conjugate functions, i.e., , of and , respectively. Note that the function in (7) has the format of a “sum-of-cost” and each term inside the summation can be computed by each agent alone. Therefore, problem (7) can be solved in a number of traditional distributed algorithms [18, 25, 19, 26, 14]. However, the resulting algorithms suffer from some limitations. One limitation is that the term inside of has complexity to compute. Another limitation is that the resulting algorithms rely on the use of conjugate functions, which are not always available in closed form; this will greatly depend on the nature of the loss function . This limitation is worse for nonlinear models, say, for:

(8)

where

is some nonlinear function. These difficulties do not arise if we pursue a solution directly in the primal domain. For example, the case of nonlinear models can be handled through the chain rule of differentiation (as in backpropagation), when deriving stochastic gradient algorithms. Furthermore, we are often interested more directly in the primal rather than the dual variable.

With regards to the large feature space, one may be motivated to consider coordinate descent techniques [27, 28], which pursue optimization one coordinate at a time. However, these techniques still face difficulties in both the primal and dual domains. For instance, in the primal domain [29, 30, 31], they will generally require two time-scales: one scale governs the rate for updating the gradients and a second faster scale for running averaging iterations multiple times. This feature limits the rate at which data can be sampled because the inner calculation will need to be completed before the computation of the next datum. The same difficulties mentioned above for the dual methods will again arise if coordinate descent solutions are implemented in the dual domain [32, 33, 23]. For these reasons, the approach proposed in this work does not rely directly on coordinate descent implementations.

Other useful approaches to solving problems of the form (3) are the Alternating Direction Method of Multipliers (ADMM) [34, 22, 35] and primal dual-methods [36, 37, 18, 25]. These techniques have good convergence properties but continue to suffer from high computational costs and two-time scale communications. This can be observed from (6) where we see that the gradient relative to the dual variable involves the term . This term requires extra communication.

I-B Novelty and Contributions

In this work, we propose a stochastic solution method to (3) that operates directly in the primal domain. By avoiding the dual formulation, we will arrive at a simple and effective method even for large scale applications. We exploit the idea of dynamic consensus algorithm[38, 39], which has been adopted in the distributed optimization algorithms to track the average of gradients, see [40, 30, 41, 42, 43]. Meanwhile, we are interested in tracking the sum of score, , due to the different problem setting. More importantly, we will show that the proposed method is able to converge at a linear rate to the exact minimizer of the empirical risk even under constant step-size learning. We will also exploit variance-reduced techniques [44] and a pipeline strategy [45] to obtain a reduced complexity algorithm that requires only operations per iteration. The algorithm will not require two (or separate) time-scales and will not necessitate the solution of auxiliary sub-optimization problems as is common in prior methods in the literature.

Problems similar to (3) were also studied in [46] using similar primal methods like us but in a deterministic setting, but the approach is not well-suited for big-data applications.

Notation

: We use plain letters for deterministic variables, and boldface letters for random variables. We also use

to denote the expectation with respect to , to denote a column vector formed by stacking , to denote transposition, and for the 2-norm of a matrix or the Euclidean norm of a vector. Through the paper, we use the subscript as the index of data, the subscript as the index of iteration/time, and as the index of agent. We also put as the superscript with the same meaning, i.e., the index of iteration/time. The notation .

Ii Preliminary Naïve Solution

We first propose a simple and naïve solution, which will be used to motivate the more advanced algorithms in later sections.

Ii-a Networked Agent Model and Consensus Strategy

To begin with, we introduce several preliminary concepts that will be exploited in later sections.

We consider the graph model shown in Fig. 1. In this construction, the communication network of the agents is modeled as a fixed directed graph , where is the set of edges. The edge means that agent can send a message to agent , where we associate the weight as a nonnegative factor that scales the information from agent to agent . We assume the combination matrix is symmetric and doubly-stochastic, i.e.,

(9)

We also assume that for at least one agent and the underlying graph is strongly connected.

Now assume there is a signal at each agent . Then, a well-studied and traditional algorithm for the agents to learn the average value of all the signals is to run the consensus iteration[26, 14, 19, 47]:

(10)

where the notation denotes the set of neighbors of agent . In this way, each agents starts from its own observation vector and continually averages the state values of its neighbors. After sufficient iterations, it is well-known that

(11)

under some mild conditions on [48, 47, 49, 50, 19].

Ii-B Naïve Solution

Now, let us consider the problem of minimizing (3) by means of a stochastic gradient recursion. Let denote the inner product that is available at agent at time and define

(12)

which is the argument of in (3):

(13)

If we denote the average of the local inner products by

(14)

then the variable is a scaled multiple of , namely, . Now, the stochastic-gradient step to solving (13) will involve approximating the true gradient vector of by the gradient vector of the loss function evaluated at some randomly selected data pair , where at iteration denotes the index of the sample pair selected uniformly at random from the index set . Doing so, the stochastic gradient recursion will take the form:

(15)

Note that is independent of the iterates . Recalling that and are partitioned into blocks, we can decompose (15) into parallel recursions run at the local agents:

(16)

One main problem with (16) is that it is not a distributed solution because agents need to calculate at whose value depends on all sub-vectors from all agents and not just on from agent . This difficulty suggests one initial solution method.

Since the desired variable is proportional to the average value , then a consensus-type construction can be used to approximate this average. For some total number of iterations , each agent would start from and repeat the following calculations times:

(17)

However, this mode of operation requires the agents to complete consensus updates between two data arrivals and requires a two-time scale operation: a faster time-scale for the consensus iterations and a slower time-scale for the data sampling and computing the gradient. One simplification is to set and to have each agent perform only one single combination step to generate the variable:

(18)

where we are expressing the result of this single combination by

to indicate that this is the estimate for

that is computed at agent at iteration . Observe also that we are scaling the quantity inside the sum by since, as explained before, . We list the resulting algorithm in (19)–(21).

Observe that this implementation requires all agents to use the same random index at iteration . Although this requirement may appear restrictive, it can still be implemented in distributed architectures. For example, each agent can be set with the same random seed so that they can generate the same index variable at iteration . To agree on the same random seed in a fully distributed manner, one way is to run the consensus algorithm on the seed in the setting phase. Specifically, each agent generates a random seed number, then runs the consensus algorithm until it converges and rounds the result to the nearest integer. Alternatively, agents can sample the data in a cyclic manner instead of uniform sampling. But in the main body of this paper, we assume each agent will sample the same index at iteration for simplicity.

 
Algorithm 1 (Naïve feature-distributed method for agent )
 
Initialization: Set .
Repeat for
:

(19)
(20)
(21)

End
 

Ii-C Limitations

Algorithm 1 is easy to implement. However, it suffers from two major drawbacks. First, the variable generated by the combination step (20) is not generally a good approximation for the global variable . This approximation error affects the stability of the algorithm and requires the use of very small step-sizes. A second drawback is that the stochastic-gradient implementation (19)–(21) will converge to a small neighborhood around the exact minimizer rather than to the exact minimizer itself [51, 14]. In the following sections, we will design a more effective solution.

Iii Correction the Approximation Error

Iii-a Dynamic Diffusion Strategy

Motivated by the dynamic average consensus method [38, 39], we will design a stochastic diffusion-based algorithm to correct the error introduced by (20). The motivation for dynamic diffusion is similar to what been used before in the distributed optimization literature [40, 41, 42], with two main differences. First, in the current setting, we will be interested in tracking several different quantities and not a single quantity. Second, we will need to develop a stochastic (rather than deterministic) version to cope with stochasticity in inner products and/or gradients; thus resulting in an operation per iteration.

To motivate the dynamic diffusion algorithm, let us step back and assume that each agent in the network is observing some dynamic input signal, , that changes with time . Assume we want to develop a scheme to ensure that each agent is able to track the average of all local signals, i.e., . For that purpose, we consider an optimization problem of the form:

(22)

where the cost function is changing with time . The global minimizer of is the desired average . However, we would like the agents to attain this solution in a distributed fashion. To this end, we can apply the exact diffusion algorithm developed in [18, 52] to solve (22). For this case, the algorithm simplifies to the following recursions:

(23)
(24)
(25)

Each agent has a state variable at time . Step (23) uses the input signal at agent to update its state to the intermediate value . The second step (24) corrects to , and the third step (25) combines the intermediate values in the neighborhood of agent to obtain the update state . The process continues in this manner at all agents. Based on the results from [52] applied to (22), we can set in (23) and combine three recursions to get

(26)

with for any . It can be shown that if the signals change slowly, the will track the mean well[38, 39]. Also, using induction and the initial boundary conditions, it is easy to verify that (26) has the unbiasedness property:

(27)

In this paper, we are interested in a second useful property that:

(28)

This means if the signals converge, then the of all agents will converge to the mean of the limit values. We refer to (26) as the dynamic diffusion method.

We now apply this intermediate result to the earlier recursion (16) to transform it into a distributed solution. Recall that there we need to evaluate the variable

(29)

Calculating this quantity is similar to solving problem (22), where each corresponds to the inner product . However, there is one key difference: the signal is not deterministic but stochastic and it varies randomly with the data index . At any particular iteration , we do not know beforehand which random index is going to be chosen. This suggests that in principle we should keep track of variables , one for each possible . For large datasets, this is of course too costly to implement it. Instead, we propose a more efficient solution where the data is sparsely sampled. Assume first, for the sake of argument only, that we move ahead and compute the variable for every possible value of . If we do so, we would need to repeat construction (26) a total of times at each node , one for each , as follows:

(30)
(31)
(32)

In this description, we are adding a superscript to each to indicate the iteration index. In this way, each will be able to track the sum . However, since the data size is usually very large, it is too expensive to communicate and update all per iteration. We propose a stochastic algorithm in which only one datum is selected at iteration and only the corresponding entry be updated while all other will stay unchanged for :

(33)

where the index in the first equation refers to the most recent iteration where the same index was chosen the last time. Note that the value depends on and the history of sampling, and therefore we need to store the inner product value that is associated with it. To fetch and easily, we introduce two auxiliary variables:

(34)

If we view as one long vector, the update (30)—(32) resembles a coordinate descent algorithm[28, 30].

An unbiasedness property similar to (27) will continue to hold:

(35)

It is easy to verify the validity of (35) by taking the summation over (33) and combining the initialization conditions.

Iii-B Variance-Reduction Algorithm

We can enhance the algorithm further by accounting for the gradient noise that is present in (15): this noise is the difference between the gradient of the risk function , which is unavailable, and the gradient of the loss function that is used in (15). It is known, e.g., from [53, 14, 51, 54] that under constant step-size adaptation, and due to this gradient noise, recursion (15) will only approach an neighborhood around the global minimizer of (3). We can modify the recursion to ensure convergence to the exact minimizer as follows.

There is a family of variance-reduction algorithms such as SVRG[55], SAGA[44], and AVRG[56] that can approach the exact solution of the empirical risk function with constant step-size. In this work, we exploit the SAGA construction because the variables can readily be used in that implementation. Let us consider an agent in a non-cooperative scenario where the vanilla SAGA recursion would be

(36)
(37)

It is proved in [44] that the variance of the gradient noise, i.e., the difference between the gradient step in 36 and the full gradient, introduced in SAGA will vanish in expectation. Therefore, SAGA will converge to the exact solution of problem (1). Also, note that the -summation term can be calculated in an online manner since only one term is updated, i.e.,

(38)

This online calculation results in complexity per iteration.

Note that the vanilla SAGA needs to store the gradient for . However, we have already stored and the additional storage of the gradient is unnecessary.

Hence, the stochastic gradient recursion(16) at each agent will be modified to (41) with two correction terms. The resulting algorithm is summarized in Algorithm 2.

 
Algorithm 2 [Variance-reduced dynamic diffusion () for learning from distributed features]
 
Initialization: Set ; .
Repeat for :

(39)
(40)
(41)
(42)
(43)

End
 

Iv Acceleration With Pipeline

Algorithm 2 can be shown to converge to the solution of problem (1) for sufficiently small step-sizes. However, it is observed in numerical experiments that its convergence rate can be slow. One reason is that the variable generated by (40) converges slowly to . To accelerate convergence, it is necessary to run (40) multiple times before the gradient descent step (41), which, however, will take us back to in a two-time-scale algorithm. In this section, we propose a pipeline method that accelerates the convergence of while maintaining the one-time-scale structure.

A pipeline is a set of data processing elements connected in series, where the output of one element is the input to the next element[45]. We assume each agent stores variables at iteration :

(44)

At every iteration, agent runs a one-step average consensus recursion on its state vector (44):

(45)

Then, agent pops up the variable

from memory and uses it to continue the stochastic gradient descent steps. Note that the variable

can be interpreted as the result of applying consensus iterations and, therefore, it is a better approximation for . At iteration , agent will push a new variable into the buffer and update its state to

(46)
Fig. 2: Illustration of the pipeline strategy with buffer length .

Recursion (IV) employs the pipeline strategy. For example, variable is updated at iteration . This new output will become the second input at iteration and is used to produce the output . Next, the output will be the third input at iteration and is used to produce the output . If we follow this procedure, the output will be reached at iteration . At that time, we can pop up and use it in the stochastic gradient update. The pipeline procedure is summarized in the “Pipeline function” shown above, and Fig. 2 illustrates the pipeline strategy.

 

Pipeline function

 
Initialization: for any
Function Pipeline (, )

Push   [, ]  into the queue (47)
(48)
Pop  [, ] out of the queue (49)

Return [, ]
 

The pipeline strategy has two advantages. First, it is able to calculate without inner loop, which accelerates the algorithm and maintains the one-time-scale structure. Second, in one iteration, the two-time-scale solution sends a scalar times while the pipeline solution sends a -length vector just once. Though the communication load is the same, the pipeline solution is usually faster than the two-time-scale approach in practice. That is because sending a scalar with time needs all agents to synchronize for times, which can take longer time than the one-time communication.

Observe that when is popped out of the pipeline after iterations, the required should also be from iterations ago. This means that we need to store the past values. One solution is to push the auxiliary along with into the pipeline but without doing any processing inside the pipeline. Since this value will be used iterations into the future, this variable should be denoted by .

 
Algorithm 3 [Pipelined variance-reduced dynamic diffusion () learning]
 
Initialization: Set ; .
Repeat for :

(50)
(51)
(52)
(53)
(54)

End
 

V Algorithm Analysis

V-a Delayed gradient calculations

First, we have to point out that the pipeline solution is not equivalent to the two-time-scale solution. It is observed that the gradient used in the stochastic gradient descent step (52) at iteration is where , which is -time out-of-date. The cause of the delay is that the variable has to conduct updates in the pipeline. When pops up from the pipeline, the iteration index has arrived to . As a result, the pipeline solution introduces delays in the gradient. Due to this delay, it does not necessarily follow that deeper pipelines lead to better performance. Actually there is a trade off between depth and performance.

Fortunately, problems involving delays in gradient calculations are well-studied in the distributed machine learning and signal processing literature [57, 58, 59]. These works show that convergence can still occur albeit the stability ranges are affected. Although, in most literature, the delayed gradient is usually caused by unbalanced computation loads or fragile communication environments instead of pipeline structure, the proof in this paper is inspired from these investigations to some extent.

V-B Convergence Analysis

To establish the convergence theorem, we require two assumptions

Assumption 1 (Risk Function)

The loss function is differentiable, and has an -Lipschitz continuous gradient with respect to and a -Lipschitz continuous gradient with respect to , for every , i.e., for any :

(55)
(56)

where . For the regularization term , it is convex and has -Lipschitz continuous gradient:

(57)

We also assume that the risk is -strongly convex, namely,

(58)

Assumption 2 (Topology)

The underlying topology is strongly connected, and the combination/weighted adjacency matrix is symmetric and doubly stochastic, i.e.,

(59)

where is a vector with all unit entries. We further assume that for at least one agent .

Under assumption 2, we can show that matrix is primitive[49, 14]

and that the second largest magnitude of the eigenvalue of

, denoted by , satisfies[50]:

(60)
Theorem 1 ( Convergence of PVRD)

Algorithm PVRD converges at a linear rate for sufficiently small step-sizes , i.e.,

(61)

for some constant , where:

(62)

See Appendix A.

This theorem indicates that the convergence rate of the algorithm depends on the network topology through , the depth of pipeline , and the strong convexity parameter . When is not very large and the first term in (62) is larger than the second one, the convergence rate will depend more on the network topology and the depth of the pipeline. However, when is large enough so that the second term is dominant, the algorithm performance will depend on the strong convexity parameter , as in the single agent case. Compared with other distributed learning algorithms[52, 25, 26], the convergence behavior of PVRD closer to the behavior of single agent SGD than multi-agent SGD; this is also clear from recursion (80) in the proof.

Remark: The theorem indicates that the algorithm converges to the minimizer of empirical risk (3) exactly with linear rate. In many cases we are instead interested in the minimizer of the average risk . By an ergodicity argument, the two minimizers will become closer when the number of data samples increases.

Notice that algorithm VRD is a special case of algorithm PVRD by setting . Thus, we establish the following corollary from Theorem 1.

Corollary 1 (Convergence of VRD)

Algorithm VRD converges at a linear rate for sufficiently small step-sizes , i.e.,