## I Introduction

In this work we focus on solving optimization problems of the form

(1) |

where each function is convex over a convex set . This formulation applies widely in machine learning scenarios, where measures the loss of model with respect to data point , and is the average loss over data points. In particular, we are interested in the behavior of online distributed optimization algorithms for this sort of problem as the number of data points tends to infinity. We describe a distributed algorithm which, for strongly convex functions , converges at a rate . To the best of our knowledge this is the first distributed algorithm to achieve this converge rate for constrained optimization without relying on smoothness assumptions on the objective or non-trivial communication mechanisms between the nodes. The result is true both in the online and the batch optimization setting.

When faced with a non-linear convex optimization problem, gradient-based methods can be applied to find the solution. The behavior of these algorithms is well-understood in the single-processor (centralized) setting. Under the assumption that the objective is -Lipschitz continuous, projected gradient descent-type algorithms converge at a rate [1, 2]. This rate is achieved both in an online setting where the ’s are revealed to the algorithm sequentially and in the batch setting where all are known in advance. If the cost functions are also strongly convex then gradient algorithms can achieve linear rates, , in the batch setting [3] and nearly-linear rates, , in the online setting [4]. Under additional smoothness assumptions, such as Lipschitz continuous gradients, the same rate of convergence can also be achieved by second order methods in the online setting [5, 6], while accelerated methods can achieve a quadratic rate in the batch setting; see [7] and references therein.

The aim of this work is to extend the aforementioned results to the distributed setting where a network of processors jointly optimize a similar objective. Assuming the network is arranged as an expander graph with constant spectral gap, for general convex cost functions that are only -Lipschitz continuous, the rate at which existing algorithms on a network of processors will all reach the optimum value is , i.e., similar to the optimal single processor algorithms up to a logarithmic factor [8, 9]. This is true both in a batch setting and in an online setting, even when the gradients are corrupted by noise. The technique proposed in [10] makes use of mini-batches to obtain asymptotic rates

for online optimization of smooth cost functions that have Lipschitz continuous gradients corrupted by bounded variance noise, and

for smooth strongly convex functions. However, this technique requires that each node exchange messages with every other node at the end of each iteration. Finally, if the objective function is strongly convex and three times differentiable, a distributed version of Nesterov’s accelerated method [11] achieves a rate of for unconstrained problems in the batch setting, but the dependence on is not characterized.The algorithm presented in this paper achieves a rate for strongly convex functions. Our formulation allows for convex constraints in the problem and assumes the objective function is Lipschitz continuous and strongly convex; no higher-order smoothness assumptions are made. Our algorithm works in both the online and batch setting and it scales nearly-linearly in number of iterations for network topologies with fast information diffusion. In addition, at each iteration nodes are only required to exchange messages with a subset of other nodes in the network (their neighbors).

The rest of the paper is organized as follows. Section II introduces notation and formalizes the problem. Section III describes the proposed algorithm and states our main results. These results are proven in Section IV, and Section V extends the analysis to the case where gradients are noisy. Section VI presents the results of numerical experiments illustrating the performance of the algorithm, and the paper concludes in Section VII.

## Ii Online Convex Optimization

Consider the problem of minimizing a convex function over a convex set . Of particular interest is the setting where the algorithm sequentially receives noisy samples of the (sub)gradients of . This setting arises in online loss minimization for machine learning when the data arrives as a steam and the (sub)gradient is evaluated using an individual data point at each step [1]. Suppose the th data point is drawn i.i.d. from an unknown distribution , and let denote the loss of this data point with respect to a particular model . In this setting one would like to find the model that minimizes the expected loss , possibly with the constraint that be restricted to a model space . Clearly, as , the objective , and so if the data stream is finite this motivates minimizing the empirical loss .

An online convex optimization algorithm observes a data stream , and sequentially chooses a sequence of models , after each observation. Upon choosing , the algorithm receives a subgradient . The goal is for the sequence to converge to a minimizer of .

The performance of an online optimization algorithm is measured in terms of the regret:

(2) |

The regret measures the gap between the cost accumulated by the online optimization algorithm over steps and that of a model chosen to simultaneously minimize the total regret over all cost terms. If the costs are allowed to be arbitrary convex functions then it can be shown that the best achievable rate for any online optimization algorithm is , and this bound is also achievable [1]. The rate can be significantly improved if the cost functions has more favourable properties.

### Ii-a Assumptions

###### Assumption 1

We assume for the rest of the paper that each cost function is -strongly convex for all ; i.e., there is a such that for all and all

(3) |

If each is -strongly convex, it follows that is also -strongly convex. Moreover, if is strongly convex then it is also strictly convex, and so has a unique minimizer which we denote by .

###### Assumption 2

We also assume that the subgradients of each cost function are bounded by a known constant ; i.e., where is the () Euclidean norm.

### Ii-B Example: Training a Classifier

For a specific example of this setup, consider the problem of training an SVM classifier using a hinge-loss with

regularization [4]. In this case, the data stream consists of pairs such that and . The goal is to minimize the misclassification error as measured by the -regularized hinge loss. Formally, we wish to find the that solves(4) |

which is -strongly convex^{1}^{1}1Although the hinge loss itself is not strongly convex, adding a strongly convex regularizer makes the overall cost function strongly convex.

. For these types of problems, using a single-processor stochastic gradient descent algorithm, one can achieve

[4] or [12] by using different update schemes.### Ii-C Distributed Online Convex Optimization

In this paper, we are interested in solving online convex optimization problems with a network of computers. The computers are organized as a network with nodes, and messages are only exchanged between nodes connected with an edge in .

###### Assumption 3

In this work we assume that is connected and undirected.

Each node receives a stream of data , similar to the serial case, and the nodes must collaborate to minimize the network-wide objective

(5) |

where is the cost incurred at processor at time . In the distributed setting, the definition of regret is naturally extended to

(6) |

For general convex cost functions, the distributed algorithm proposed in [8] has been proven to have an average regret that decreases at a rate , similar to the serial case, and this result holds even when the algorithm receives noisy, unbiased, observations of the true subgradients at each step. In the next section, we present a distributed algorithm that achieves a nearly-linear rate of decrease of the average regret (up to a logarithmic factor) when the cost functions are strongly convex.

## Iii Algorithm

Nodes must collaborate to solve the distributed online convex optimization problem described in the previous section. To that end, the network is endowed with a consensus matrix which respects the structure of , in the sense that if . We assume that is doubly stochastic, although generalizations to the case where is row stochastic or column stochastic (but not both) are also possible [13, 14].

A detailed description of the proposed algorithm, *distributed online gradient descent* (DOGD), is given in Algorithm 1. In the algorithm, each node performs a total of updates. One update involves processing a single data point at each processor. The updates are performed over rounds, and updates are performed in round . The main steps within each round (lines 9–11) involve updating an accumulated gradient variable, , by simultaneously incorporating the information received from neighboring nodes and taking a local gradient-descent like step. The accumulated gradient is projected onto the constraint set to obtain , where

(7) |

denotes the Euclidean projection of onto , and then this projected value is merged into a running average . The step size parameter remains constant within each round, and the step size is reduced by half at the end of each round. The number of updates per round doubles from one round to the next.

Note that the algorithm proposed here differs from the distributed dual averaging algorithm described in [8], where a proximal projection is used rather than the Euclidean projection. Also, in contrast to the distributed subgradient algorithms described in [15], DOGD maintains an accumulated gradient variable in which is updated using as opposed to the primal feasible variables . Finally, key to achieving fast convergence is the exponential decrease of the learning rate after performing an exponentially increasing number of gradient steps together with a proper initialization of the learning rate.

The next section provides theoretical guarantees on the performance of DOGD.

## Iv Convergence Analysis

Our main convergence result, stated below, guarantees that the average regret decreases at a rate which is nearly linear.

###### Theorem 1

Let Assumptions 1–3 hold and suppose that the consensus matrix is doubly stochastic with constant . Let be the minimizer of . Then the sequence produced by nodes running DOGD to minimize obeys

(8) |

where is the number of rounds executed during a total of gradient steps per node, and is the running average maintained locally at each node.

Remark 1: We state the result for the case where is constant. This is the case when is, e.g., a complete graph or an expander graph [16]. For other graph topologies where shrinks with and consensus does not converge fast, the convergence rate dependence on is going to be worse due to a factor in the denominator; see the proof of Theorem 1 below for the precise dependence on the spectral gap .

Remark 2: The theorem characterizes performance of the online algorithm DOGD, where the data and cost functions are processed sequentially at each node in order to minimize an objective of the form

(9) |

However, as pointed out in [4], if the entire dataset is available in advance, we can use the same scheme to do batch minimization by effectively setting , where is the objective function accounting for the entire dataset available to node . Thus, the same result holds immediately for a batch version of DOGD.

The remainder of this section is devoted to the proof of Theorem 1. Our analysis follows arguments that can be found in [1, 12, 8] and references therein. We first state and prove some intermediate results.

### Iv-a Properties of Strongly Convex Functions

Recall the definition of -strong convexity given in Assumption 1. A direct consequence of this definition is that if is -strongly convex then

(10) |

Strong convexity can be combined with the assumptions above to upper bound the difference for an arbitrary point .

###### Lemma 1

Let be the minimizer of . For all , we have .

### Iv-B The Lazy Projection Algorithm

The analysis of DOGD below involves showing that the average state, , evolves according to the so-called (single processor) lazy projection algorithm [1], which we discuss next. The lazy projection algorithm is an online convex optimization scheme for the serial problem discussed at the beginning of Section II. A single processor sequentially chooses a new variable and receives a subgradient of . The algorithm chooses by repeating the steps

(11) | ||||

(12) |

By unwrapping the recursive form of (11), we get

(13) |

The following is a typical result for subgradient descent-style algorithms, and is useful towards eventually characterizing how the regret accumulates. Its proof can be found in the appendix of the extended version of [1].

###### Theorem 2 (Zinkevich [1])

### Iv-C Evolution of Network-Average Quantities in DOGD

We turn our attention to Algorithm 1. A standard approach to studying convergence of distributed optimization algorithms, such as DOGD, is to keep track of the discrepancy between every node’s state and an average state sequence defined as

(15) |

Observe that evolves in a simple recursive manner,

(16) | ||||

(17) | ||||

(18) | ||||

(19) | ||||

(20) |

where equation (19) holds since is doubly stochastic. Notice (cf. eqn. (13)) that the states evolve according to the lazy projection algorithm with gradients and learning rate . In the sequel, we will also use an analytic expression for derived by back substituting in its recursive update equation. After some algebraic manipulation, we obtain

(21) |

and since the projection in non-expansive and ,

(22) | ||||

(23) | ||||

(24) | ||||

(25) | ||||

(26) | ||||

(27) |

### Iv-D Analysis of One Round of DOGD

Next, we focus on bounding the amount of regret accumulated during the th round of DOGD (lines 5–12 of Algorithm 1) during which the learning rate remains fixed at . Using Assumptions 1, 2, and the triangle inequality we have that

(28) | ||||

(29) | ||||

(30) | ||||

(31) |

For the first summand we have

(32) | ||||

(33) | ||||

(34) |

To bound term we invoke Theorem 2 for the average sequences and .

(35) | ||||

(36) | ||||

(37) | ||||

(38) | ||||

(39) |

Collecting now all the partial results and bounds, so far we have shown that

(40) |

and since the projection operator is non-expansive, we have

(41) | ||||

The first two terms are standard for subgradient algorithms using a constant step size. The last two terms depend on the error between each node’s iterate and the network-wide average , which we bound next.

### Iv-E Bounding the Network Error

What remains is to bound the term which describes an error induced by the network since the different nodes do not agree on the direction towards the optimum. By recalling that is doubly stochastic and manipulating the recursive expressions (21) and (20) for and using arguments similar to those in [8, 14], we obtain the bound,

(42) | ||||

(43) |

The norm can be bounded using Lemma 2, which is stated and proven in the Appendix, and using (27) we arrive at

(44) |

where

is the second largest eigenvalue of

. Using this bound in equation (41), along with the fact that is convex, we conclude that(45) | ||||

(46) | ||||

(47) |

where .

### Iv-F Analysis of DOGD over Multiple Rounds

As our last intermediate step, we must control the learning rate and update of from round-to-round to ensure linear convergence of the error. From strong convexity of we have

(48) |

and thus

(49) |

Now, from Theorem in [1] which is a direct consequence of Theorem 2 for the average sequence viewed as a single processor lazy projection algorithm, we have that after executing gradient steps in round ,

(50) |

and by repeatedly using strong convexity and Theorem 2 we see that

(51) | ||||

(52) | ||||

(53) |

Now, let us fix positive integers and , and suppose we use the following rules to determine the step size and number of updates performed within each round:

(54) | ||||

(55) |

Combining (53) with (49) and invoking Lemma 1, we have

(56) |

To ensure convergence to zero, we need and or . Given these restrictions, let us make the choices

(57) |

To simplify the exposition, let us assume that is an integer. Using the selected values, we obtain

Comments

There are no comments yet.