The increasing popularity of large-scale and fully decentralized computational architectures, fueled for instance by the advent of the “Internet of Things”, motivates the development of efficient optimization algorithms adapted to this setting. An important application is machine learning in wired and wireless networks of agents (sensors, connected objects, mobile phones,etc.), where the agents seek to minimize a global learning objective which depends of the data collected locally by each agent. In such networks, it is typically impossible to efficiently centralize data or to globally aggregate intermediate results: agents can only communicate with their immediate neighbors (agents within a small distance), often in a completely asynchronous fashion. Standard distributed optimization and machine learning algorithms (implemented for instance using MapReduce/Spark) require a coordinator node and/or to maintain synchrony, and are thus unsuitable for use in decentralized networks.
are tailored to this setting because they only rely on simple peer-to-peer communication: each agent only exchanges information with one neighbor at a time. Various gossip algorithms have been proposed to solve the flagship problem of decentralized optimization, namely to find a parameter vectorwhich minimizes an average of convex functions , where the data is only known to agent . The most popular algorithms are based on (sub)gradient descent (Johansson et al., 2010; Nedić & Ozdaglar, 2009; Ram et al., 2010; Bianchi & Jakubowicz, 2013), ADMM (Wei & Ozdaglar, 2012, 2013; Iutzeler et al., 2013) or dual averaging (Duchi et al., 2012; Yuan et al., 2012; Lee et al., 2015; Tsianos et al., 2015), some of which can also accommodate constraints or regularization on . The main idea underlying these methods is that each agent seeks to minimize its local function by applying local updates (gradient steps) while exchanging information with neighbors to ensure a global convergence to the consensus value.
In this paper, we tackle the problem of minimizing an average of pairwise functions of the agents’ data:
This problem finds numerous applications in statistics and machine learning, Area Under the ROC Curve (AUC) maximization (Zhao et al., 2011), distance/similarity learning (Bellet et al., 2015), ranking (Clémençon et al., 2008), supervised graph inference (Biau & Bleakley, 2006) and multiple kernel learning (Kumar et al., 2012), to name a few. As a motivating example, consider a mobile phone application which locally collects information about its users. The provider could be interested in learning pairwise similarity functions between users in order to group them into clusters or to recommend them content without having to centralize data on a server (which would be costly for the users’ bandwidth) or to synchronize phones.
The main difficulty in Problem (1) comes from the fact that each term of the sum depends on two agents and , making the local update schemes of previous approaches impossible to apply unless data is exchanged between nodes. Although gossip algorithms have recently been introduced to evaluate such pairwise functions for a fixed (Pelckmans & Suykens, 2009; Colin et al., 2015), to the best of our knowledge, efficiently finding the optimal solution in a decentralized way remains an open challenge. Our contributions towards this objective are as follows. We propose new gossip algorithms based on dual averaging (Nesterov, 2009; Xiao, 2010) to efficiently solve Problem (1) and its constrained or regularized variants. Central to our methods is a light data propagation scheme which allows the nodes to compute biasedestimates of the gradients of functions in (1). We then propose a theoretical analysis of our algorithms both in synchronous and asynchronous settings establishing their convergence under an additional hypothesis that the bias term decreases fast enough over the iterations (and we have observed such a fast decrease in all our experiments). Finally, we present some numerical simulations on Area Under the ROC Curve (AUC) maximization and metric learning problems. These experiments illustrate the practical performance of the proposed algorithms and the influence of network topology, and show that in practice the influence of the bias term is negligible as it decreases very fast with the number of iterations.
The paper is organized as follows. Section 2 formally introduces the problem of interest and briefly reviews the dual averaging method, which is at the root of our approach. Section 3 presents the proposed gossip algorithms and their convergence analysis. Section 4 displays our numerical simulations. Finally, concluding remarks are collected in Section 5.
2.1 Definitions and Notation
For any integer , we denote by the set and by the cardinality of any finite set . We denote an undirected graph by , where is the set of vertices and is the set of edges. A node has degree . is connected if for all there exists a path connecting and ; it is bipartite if there exist such that , and . The graph Laplacian of is denoted by , where and are respectively the degree and the adjacency matrices of .
The transpose of a matrix is denoted by . A matrix is termed stochastic whenever and , where , and bi-stochastic whenever both and are stochastic. We denote by
the identity matrix in, by the canonical basis of , by the indicator function of any event and by the usual -norm. For and , we denote by the gradient of at . Finally, given a collection of vectors , we denote by its empirical mean.
2.2 Problem Statement
We represent a network of agents as an undirected graph , where each node corresponds to an agent and if nodes and can exchange information directly (i.e., they are neighbors). For ease of exposition, we assume that each node holds a single data point . Though restrictive in practice, this assumption can easily be relaxed, but it would lead to more technical details to handle the storage size, without changing the overall analysis (see supplementary material for details).
Given , let a differentiable and convex function with respect to the first variable. We assume that for any , there exists such that is -Lipschitz (with respect to the -norm). Let be a non-negative, convex, possibly non-smooth, function such that, for simplicity, . We aim at solving the following optimization problem:
In a typical machine learning scenario, Problem (2) is a (regularized) empirical risk minimization problem and corresponds to the model parameters to be learned. The quantity is a pairwise loss measuring the performance of the model on the data pair , while represents a regularization term penalizing the complexity of . Common examples of regularization terms include indicator functions of a closed convex set to model explicit convex constraints, or norms enforcing specific properties such as sparsity (a canonical example being the -norm).
Many machine learning problems can be cast as Problem (2). For instance, in AUC maximization (Zhao et al., 2011), binary labels are assigned to the data points and we want to learn a (linear) scoring rule which hopefully gives larger scores to positive data points than to negative ones. One may use the logistic loss
and the regularization term can be the square -norm of (or the -norm when a sparse model is desired). Other popular instances of Problem (2) include metric learning (Bellet et al., 2015), ranking (Clémençon et al., 2008), supervised graph inference (Biau & Bleakley, 2006) and multiple kernel learning (Kumar et al., 2012).
For notational convenience, we denote by the partial function for and by . Problem (2) can then be recast as:
Note that the function is -Lipschitz, since all the are -Lipschitz.
Throughout the paper we assume that the function is differentiable, but we expect all our results to hold even when is non-smooth, for instance in -regression problems or when using the hinge loss. In this case, one simply needs to replace gradients by subgradients in our algorithms, and a similar analysis could be performed.
2.3 Centralized Dual Averaging
In this section, we review the stochastic dual averaging optimization algorithm (Nesterov, 2009; Xiao, 2010) to solve Problem (2) in the centralized setting (where all data lie on the same machine). This method is at the root of our gossip algorithms, for reasons that will be made clear in Section 3
. To explain the main idea behind dual averaging, let us first consider the iterations of Stochastic Gradient Descent (SGD), assumingfor simplicity:
where , and is a non-negative non-increasing step size sequence. For SGD to converge to an optimal solution, the step size sequence must satisfy and . As noticed by Nesterov (2009), an undesirable consequence is that new gradient estimates are given smaller weights than old ones. Dual averaging aims at integrating all gradient estimates with the same weight.
Let be a positive and non-increasing step size sequence. The dual averaging algorithm maintains a sequence of iterates , and a sequence of “dual” variables which collects the sum of the unbiased gradient estimates seen up to time . We initialize to . At each step
, we compute an unbiased estimateof . The most common choice is to take where and are drawn uniformly at random from . We then set and generate the next iterate with the following rule:
When it is clear from the context, we will drop the dependence in and simply write .
Note that is related to the proximal operator of a function defined by . Indeed, one can write:
For many functions of practical interest, has a closed form solution. For instance, when , corresponds to a simple scaling, and when it is a soft-thresholding operator. If is the indicator function of a closed convex set , then is the projection operator onto .
The dual averaging method is summarized in Algorithm 1. If then for any :
where , is the averaged iterate and is the expectation over all possible sequences . A precise statement of this result along with a proof can be found in the supplementary material for completeness.
Notice that dual averaging cannot be easily adapted to our decentralized setting. Indeed, a node cannot compute an unbiased estimate of its gradient: this would imply an access to the entire set of data points, which violates the communication and storage constraints. Therefore, data points have to be appropriately propagated during the optimization procedure, as detailed in the following section.
3 Pairwise Gossip Dual Averaging
We now turn to our main goal, namely to develop efficient gossip algorithms for solving Problem (2) in the decentralized setting. The methods we propose rely on dual averaging (see Section 2.3). This choice is guided by the fact that the structure of the updates makes dual averaging much easier to analyze in the distributed setting than sub-gradient descent when the problem is constrained or regularized. This is because dual averaging maintains a simple sum of sub-gradients, while the (non-linear) smoothing operator is applied separately.
Our work builds upon the analysis of Duchi et al. (2012), who proposed a distributed dual averaging algorithm to optimize an average of univariate functions . In their algorithm, each node computes unbiased estimates of its local function that are iteratively averaged over the network. Unfortunately, in our setting, the node cannot compute unbiased estimates of : the latter depends on all data points while each node only holds . To go around this problem, we rely on a gossip data propagation step (Pelckmans & Suykens, 2009; Colin et al., 2015) so that the nodes are able to compute biased estimates of while keeping the communication and memory overhead to a small level for each node.
3.1 Synchronous Setting
In the synchronous setting, we assume that each node has access to a global clock such that every node can update simultaneously at each tick of the clock. Although not very realistic, this setting allows for simpler analysis. We assume that the scaling sequence is the same for every node. At any time, each node has the following quantities in its local memory register: a variable (the gradient accumulator), its original observation , and an auxiliary observation , which is initialized at but will change throughout the algorithm as a result of data propagation.
The algorithm goes as follows. At each iteration, an edge of the graph is drawn uniformly at random. Then, nodes and average their gradient accumulators and , and swap their auxiliary observations and . Finally, every node of the network performs a dual averaging step, using their original observation and their current auxiliary one to estimate the partial gradient. The procedure is detailed in Algorithm 2, and the following proposition adapts the convergence rate of centralized dual averaging under the hypothesis that the contribution of the bias term decreases fast enough over the iterations.
Sketch of proof.
First notice that at a given (outer) iteration , is updated as follows:
where is a biased estimate of . Let be the bias, so that we have .
Let us define . Using convexity of , the gradient’s definition and the fact that the functions and are both -Lipschitz, we obtain: for and ,
Using Lemma 4 (see supplementary material), the terms (5)-(6) can be bounded by . The term (7) requires a specific analysis because the updates are performed using biased estimates. We decompose it as follows:
The rate of convergence in Proposition 1 is divided into three parts: is a data dependent term which corresponds to the rate of convergence of the centralized dual averaging, while and are network dependent terms since , where is the second smallest eigenvalue of the graph Laplacian , also known as the spectral gap of . The convergence rate of our algorithm thus improves when the spectral gap is large, which is typically the case for well-connected graphs (Chung, 1997). Note that corresponds to the network dependence for the distributed dual averaging algorithm of Duchi et al. (2012) while the term comes from the bias of our partial gradient estimates. In practice, vanishes quickly and has a small impact on the rate of convergence, as shown in Section 4.
3.2 Asynchronous Setting
For any variant of gradient descent over a network with a decreasing step size, there is a need for a common time scale to perform the suitable decrease. In the synchronous setting, this time scale information can be shared easily among nodes by assuming the availability of a global clock. This is convenient for theoretical considerations, but is unrealistic in practical (asynchronous) scenarios. In this section, we place ourselves in a fully asynchronous setting where each node has a local clock, ticking at a Poisson rate of , independently from the others. This is equivalent to a global clock ticking at a rate Poisson process which wakes up an edge of the network uniformly at random (see Boyd et al., 2006, for details on clock modeling).
With this in mind, Algorithm 2 needs to be adapted to this setting. First, one cannot perform a full dual averaging update over the network since only two nodes wake up at each iteration. Also, as mentioned earlier, each node needs to maintain an estimate of the current iteration number in order for the scaling factor to be consistent across the network. For , let denote the probability for the node to be picked at any iteration. If the edges are picked uniformly at random, then one has . For simplicity, we focus only on this case, although our analysis holds in a more general setting.
Let us define an activation variable such that for any ,
One can immediately see that. Let us define such that and for , . Since are Bernoulli random variables, is an unbiased estimate of the time .
Using this estimator, we can now adapt Algorithm 2 to the fully asynchronous case, as shown in Algorithm 3. The update step slightly differs from the synchronous case: the partial gradient has a weight instead of so that all partial functions asymptotically count in equal way in every gradient accumulator. In contrast, uniform weights would penalize partial gradients from low degree nodes since the probability of being drawn is proportional to the degree. This weighting scheme is essential to ensure the convergence to the global solution. The model averaging step also needs to be altered: in absence of any global clock, the weight cannot be used and is replaced by , where corresponds to the average number of times that node has been selected so far.
The following result is the analogous of Theorem 1 for the asynchronous setting.
Let be a connected and non bipartite graph. Let be defined as for some constant and . For , let , , , and be generated as described in Algorithm 3. Then, there exists some constant such that, for , and ,
The proof is given in the supplementary material.
In the asynchronous setting, no convergence rate was known even for the distributed dual averaging algorithm of Duchi et al. (2012), which deals with the simpler problem of minimizing univariate functions. The arguments used to derive Theorem 2 can be adapted to derive a convergence rate (without the bias term) for an asynchronous version of their algorithm.
We have focused on the setting where all pairs of observations are involved in the objective. In practice, the objective may depend only on a subset of all pairs. To efficiently apply our algorithm to this case, one should take advantage of the potential structure of the subset of interest: for instance, one could attach some additional concise information to each observation so that a node can easily identify whether a pair contributes to the objective, and if not set the loss to be zero. This is essentially the case in the AUC optimization problem studied in Section 4, where pairs of similarly labeled observations do not contribute to the objective. If the subset of pairs cannot be expressed in such a compact form, then one would need to provide each node with an index list of active pairs, which could be memory-intensive when is large.
4 Numerical Simulations
In this section, we present numerical experiments on two popular machine learning problems involving pairwise functions: Area Under the ROC Curve (AUC) maximization and metric learning. Our results show that our algorithms converge and that the bias term vanishes very quickly with the number of iterations.
To study the influence of the network topology, we perform our simulations on three types of network (see Table 1 for the corresponding spectral gap values):
Complete graph: All nodes are connected to each other. It is the ideal situation in our framework, since any pair of nodes can communicate directly. In this setting, the bias of gradient estimates should be very small, as one has for any and any , . For a network size , the complete graph achieves the highest spectral gap: , see Bollobás (1998, Ch.9) or Chung (1997, Ch.1) for details.
Cycle graph: This is the worst case in terms of connectivity: each node only has two neighbors. This network has a spectral gap of order , and gives a lower bound in terms of convergence rate.
Watts-Strogatz: This random network generation technique (Watts & Strogatz, 1998) relies on two parameters: the average degree of the network and a rewiring probability . In expectation, the higher the rewiring probability, the better the connectivity of the network. Here, we use and to achieve a compromise between the connectivities of the complete graph and the cycle graph.
We first present an application of our algorithms to AUC maximization on a real dataset. Given a set of data points with associated binary labels , the goal is to learn a linear scoring rule parameterized by which maximizes:
It corresponds to the probability that the scoring rule associated with outputs a higher score on a positively labeled sample than on a negatively labeled one. This formulation leads to a non-smooth optimization problem; therefore, one typically minimizes a convex surrogate such as the logistic loss:
We do not apply any regularization (i.e., ), and use the Breast Cancer Wisconsin dataset,111https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original) which consists of points in dimensions.
We initialize each to and for each network, we run 50 times Algorithms 2 and 3 with .222Even if this scaling sequence does not fulfill the hypothesis of Theorem 2 for the asynchronous setting, the convergence rate is acceptable in practice. Figure 1(a) shows the evolution of the objective function and the associated standard deviation (across nodes) with the number of iterations in the synchronous setting. As expected, the average convergence rate on the complete and the Watts-Strogatz networks is much better than on the poorly connected cycle network. The standard deviation of the node estimates also decreases with the connectivity of the network.
The results for the asynchronous setting are shown in Figure 1(b). As expected, the convergence rate is slower in terms of number of iterations (roughly times) than in the synchronous setting. Note however that much fewer dual averaging steps are performed: for instance, on the Watts-Strogatz network, reaching a loss requires (partial) gradient computations in the synchronous setting and only in the asynchronous setting. Moreover, the standard deviation of the estimates is much lower than in the synchronous setting. This is because communication and local optimization are better balanced in the asynchronous setting (one optimization step for each gradient accumulator averaged) than in the synchronous setting ( optimization steps for gradient accumulators averaged).
The good practical convergence of our algorithm comes from the fact that the bias term vanishes quite fast. Figure 1(c) shows that its average value quickly converges to on all networks. Moreover, its order of magnitude is negligible compared to the objective function. In order to fully estimate the impact of this bias term on the performance, we also compare our algorithm to the ideal but unrealistic situation where each node is given an unbiased estimate of its partial gradient: instead of adding to , a node will add where is picked uniformly at random. As shown in Figure 2, the performance of both methods are very similar on well-connected networks.
We now turn to a metric learning application. We consider the family of Mahalanobis distances parameterized by , where is the cone of positive semi-definite real-valued matrices. Given a set of data points with associated labels , the goal is to find which minimizes the following criterion (Jin et al., 2009):
where , , and if and otherwise. We use a synthetic dataset of points generated as follows: each point is drawn from a mixture of Gaussians in
(each corresponding to a class) with all Gaussian means contained in a 5d subspace and their shared covariance matrix proportional to the identity with a variance factor such that some overlap is observed.
Figure 3(a) shows the evolution of the objective function and its standard deviation for the asynchronous setting. As in the case of AUC maximization, the algorithm converges much faster on the well-connected networks than on the cycle network. Again, we can see in Figure 3(b) that the bias vanishes very quickly with the number of iterations.
We refer to the supplementary material for a metric learning experiment on a real dataset.
In this work, we have introduced new synchronous and asynchronous gossip algorithms to optimize functions depending on pairs of data points distributed over a network. The proposed methods are based on dual averaging and can readily accommodate various popular regularization terms. We provided an analysis showing that they behave similarly to the centralized dual averaging algorithm, with additional terms reflecting the network connectivity and the gradient bias. Finally, we proposed some numerical experiments on AUC maximization and metric learning which illustrate the performance of the proposed algorithms, as well as the influence of network topology. A challenging line of future research consists in designing and analyzing novel adaptive gossip schemes, where the communication scheme is dynamic and depends on the network connectivity properties and on the local information carried by each node.
Appendix A Outline of the Supplementary Material
The supplementary material is organized as follows. In Section B, we recall the standard proof of convergence rate for the (centralized) dual averaging. Then, in Section C, we improve the analysis of the decentralized version of the dual averaging algorithm for simple sums of functions, and provide insights to analyze the case of sum of pairwise functions. Our asynchronous variant is investigated in Section D. Technical details on how to extend our framework to the case with multiple points per node are given in Section E. Finally, additional numerical results are discussed in Section F.
Appendix B Centralized Dual Averaging
b.1 Deterministic Setting
We introduce the dual averaging algorithm for minimizing the sum , in a context where is convex and smooth, , is convex, non-negative and possibly non-smooth, with a proximity operator simple to compute. In the centralized framework, this algorithm reads as follows:
for any , where represents a scale factor similar to a gradient step size use in standard gradient descent algorithms, and is a sequence of gradient of taken at . Moreover we initialize . The function we consider is here of the form , where each is assumed -Lipschitz for simplicity (so is then). We denote . As a reminder, note that the Centralized dual averaging method is explicitly stated in Algorithm 4.
This particular formulation was introduced in (Xiao, 2009, 2010), extending the method introduced by (Nesterov, 2009) in the specific case of indicator functions. In this work, we borrow the notation from (Xiao, 2010).
In order to perform a theoretical analysis of this algorithm, we introduce the following functions. Let us define, for
Remark that with the assumption that , then . We also define the smoothing function that plays a crucial role in the dual algorithm formulation:
Strong convexity in of the objective function, ensures that the solution of the optimization problem is unique. The following lemma links the function and the algorithm update and is a simple application of the results from (Xiao, 2009, Lemma 10):
For any , one has:
and the following statements hold true: for any
and for any ,
With this notation one can write the dual averaging rule as , where , with the convention . Moreover, adapting (Xiao, 2009, Lemma 11) we can state:
For any and any non-increasing sequence , we have
We also need a last technical result that we will use several times in the following:
Let , and let be a non-increasing and non-negative sequence sequence (with the convention ), then for any :
Use the definition of to get the following upper bound
From the last display, the following holds:
Summing the former for yields
Remark that and , so the previous display can be reduced to:
Combining with (15), the lemma holds true. ∎
Bounding the error of the dual averaging is provided in the next theorem, where we remind that :
Let be a non increasing sequence. Let , , and be generated according to Algorithm 4. Assume that the function is -Lipschitz and that , then for any , one has: