A Provably Communication-Efficient Asynchronous Distributed Inference Method for Convex and Nonconvex Problems

03/16/2019 ∙ by Jineng Ren, et al. ∙ 0

This paper proposes and analyzes a communication-efficient distributed optimization framework for general nonconvex nonsmooth signal processing and machine learning problems under an asynchronous protocol. At each iteration, worker machines compute gradients of a known empirical loss function using their own local data, and a master machine solves a related minimization problem to update the current estimate. We prove that for nonconvex nonsmooth problems, the proposed algorithm converges with a sublinear rate over the number of communication rounds, coinciding with the best theoretical rate that can be achieved for this class of problems. Linear convergence is established without any statistical assumptions of the local data for problems characterized by composite loss functions whose smooth parts are strongly convex. Extensive numerical experiments verify that the performance of the proposed approach indeed improves -- sometimes significantly -- over other state-of-the-art algorithms in terms of total communication efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Due to rapid developments in information and computing technology, modern applications often involve vast amounts of data, rendering local processing (e.g., in a single machine, or on a single processing core) computationally challenging or even prohibitive. To deal with this problem, distributed and parallel implementations are natural methods that can fully leverage multi-core computing and storage technologies. However, one drawback of distributed algorithms is that the communication cost can be very expensive in terms of raw bytes transmitted, latency, or both, as machines (i.e., computation nodes) need to frequently transmit and receive information between each other. Therefore, algorithms that require less communication are preferred in this case.

In this paper we study a general communication-efficient distributed algorithm which can be applied to a broad class of nonconvex nonsmooth inference problems. Assume that we have available some data samples. We consider a general problem appearing frequently in signal processing and machine learning applications; we aim to solve

(1)

where each is a loss function associated with the -th data sample, and is assumed smooth but possibly nonconvex with Lipschitz continuous gradient and is a convex (proper and lower semi-continuous) function that is possibly nonsmooth. Problem (2) covers many important machine learning and signal processing problems such as the localization with wireless acoustic sensor networks (WASNs) [1]

, support vector machine (SVM)

[2]

, the independent principal component analysis (ICA) reconstruction problem

[3], and the sparse principal component analysis (PCA) problem [4].

For our distributed approach, we consider a network of total machines having a star topology, where one node designated as the “Master” node (node , without loss of generality) is located at the center of the star, and the remaining nodes (with indices ) are the “Worker” nodes (see Figure 1). Without loss of generality, assume that the number of data samples is evenly divisible by , i.e., for some integer , and each machine stores unique data samples. Then (1) can be reformulated to the following problem:

(2)

where is the loss function corresponding to the -th sample of the -th machine.

Fig. 1: -nodes network with a star topology

I-a Main Results

We propose an Efficient Distributed Algorithm for Nonconvex Nonsmooth Inference (EDANNI), and show that, for general problems of the form of (2), EDANNI converges to the set of stationary points if the algorithm parameters are chosen appropriately according to the maximum network delay. Our results differ significantly from existing works [5, 6, 7] which are all developed for convex problems. Therefore, the analysis and algorithm proposed here are applicable not only to standard convex learning problems but also to important nonconvex problems. To the best of our knowledge, this is the first communication-efficient algorithm exploiting local second-order information that is guaranteed to be convergent for general nonconvex nonsmooth problems. Moreover, linear convergence is also proved in the strongly convex setting with no statistical assumption of the data stored in each local machine, which is another improvement on existing works. The synchronization inherent in previous works, including [5, 6, 7, 8, 9, 10], slows down those methods because the master needs to wait for the slowest worker during each iteration; here, we propose an asynchronous approach that can accelerate the corresponding inference tasks significantly (as we will demonstrate in the experimental results).

I-B Related Work

There is a large body of work on distributed optimization for modern data-intensive applications with varied accessibility; see, for example, [5, 6, 7, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. (Parts of the results presented here appeared in our conference paper [22] without theoretical analysis.) Early works including [6, 11, 15]

mainly considered the convergence of parallelizing stochastic gradient descent schemes which stem from the idea of the seminal text by Bertsekas and Tsitsiklis

[16]. Niu et al. [12] proposed a lock-free implementation of distributed SGD called Hogwild! and provided its rates of convergence for sparse learning problems. That was followed up by many variants like [24, 25]. For solving large scale problems, works including [17], [18, 19], and [26] studied distributed optimizations based on a parameter server framework and parameters partition. Chang et al. [20] studied asynchronous distributed optimizations based on the alternating direction method of multipliers (ADMM). By formulating the optimization problem as a consensus problem, the ADMM can be used to solve the consensus problem in a fully parallel fashion over networks with a star topology. One drawback of such approaches is that they can be computationally intensive, since each worker machine is required to solve a high dimensional subproblem. As we will show, these methods also converge more slowly (in terms of communication rounds) as compared to the proposed approach (see Section IV).

A growing interest on distributed algorithms also appears in the statistics community [27, 28, 29, 30, 31]. Most of these algorithms depend on the partition of data, so their work usually involves statistical assumptions that handle the correlation between the data in local machines. A popular approach in early works is averaging estimators generated locally by different machines [15, 28, 32, 33]. Yang [34], Ma et al. [35], and Jaggi et al. [36] studied distributed optimization based on stochastic dual coordinate descent, however, their communication complexity is not better than that of first-order approaches. Shamir et al. [37] and Zhang and Xiao [38] proposed truly communication-efficient distributed optimization algorithms which leveraged the local second-order information, though these approaches are only guaranteed to work for convex and smooth objectives. In a similar spirit, Wang et al. [8], Jordan et al. [9], and Ren et al. [10] developed communication-efficient algorithms for sparse learning with regularization. However, each of these works needs an assumption about the strong convexity of loss functions, which may limit their approaches to only a small set of real-world applications. Here we describe an algorithm with similar flavor, but with more general applicability, and establish its convergence rate in both strongly convex and nonconvex nonsmooth settings. Moreover, unlike [8, 9, 10, 37, 38] where the convergence analyses rely on certain statistical assumptions on the data stored in machines, our convergence analysis is deterministic and characterizes the worst-case convergence conditions.

Notation. For a vector and we write ; for this is a norm. Usually is briefly written as . The set of natural numbers is denoted by . For an integer , we write as shorthand for the set .

Ii Algorithm

In this section, we describe our approach to computing the minimizer of (2). Recall that we have machines. Let us denote as the iteration number, then is defined as the index of a subset of worker machines from which the master receives updated gradient information during iteration ; worker is said to be “arrived” if . At iteration , the master machine solves a subproblem to obtain an updated estimate, and communicates this to the worker machines in the subset . After receiving the updated estimate, the worker machines will compute the corresponding gradients of local empirical loss functions. These gradients are then communicated back to the master machine, and the process continues.

Formally, let

be the empirical loss at each machine. Let be the latest time (in terms of the iteration count) when the worker is arrived up to and including iteration .

In the -th iteration, the master (machine ) solves the following subproblem to update

(3)

This is communicated to the worker machines that are free, where it is used to compute their local gradient . Since machine is assumed to be the master machine, is actually .

Now one question is: which partial sets of worker machines (with indices in ) from which the master receives updated gradient information during iteration are sufficient to ensure convergence of a distributed approach? Firstly, let be a maximum tolerable delay, that is, the maximum number of iterations for which every worker machine can be inactive. The set should satisfy:

[Bounded delay] For all and iteration , it holds that .

To satisfy Assumption II, should contain at least the indices of the worker machines that have been inactive for longer than iterations. That is, the master needs to wait until those workers finish their current computation and have arrived. Note that by the definition of , it holds that

Assumption II requires that every worker is arrived at least once within the period . In other words, the gradient information used by the master must be at most iterations old. To guarantee the bounded delay, at every iteration the master needs to wait for the workers who have not been active for iterations, if such workers exist. Note that, when , one has for all , which reduces to the synchronous case where the master always waits for all the workers at every iteration.

The proposed approach is presented in Algorithm 1, which specifies respectively the steps for the workers and the master. Algorithm 1 has three prominent differences compared with its synchronous alternatives. First, only the workers in update the gradient and transmit it to the master machine. For the workers in , the master uses their latest gradient information before , i.e., . Second, the variables ’s are introduced to count the delays of the workers since their last updates. is set to zero if worker is arrived at the current iteration; otherwise, is increased by one. Therefore, to ensure Assumption II holds at each iteration, the master should wait if there exists at least one worker whose . Third, after solving subproblem (II), the master transmits the up-to-date variable only to the arrived workers. In general both the master and fast workers in the asynchronous approach can update more frequently and have less waiting time than their synchronous counterparts.

0:  Loss functions , parameter ,
         initial point . Set and
         ;
  for  do
     Worker machines:
     for  do
        if Receive from the master then
        Calculate gradient and transmit it to the master.
        end
     end for
     Master:
     Receive from worker machines in a set such that ,  
     then
        Update
        Solve the subproblem (II) with the specified
        to obtain . Broadcast to the worker
        machines that are free.
  end for
Algorithm 1 Efficient Distributed Algorithm for Nonconvex-Nonsmooth Inference (EDANNI)

Iii Theoretical Analysis

Solving subproblem is inspired by the approaches of Shamir et al. [37], et al., Wang et al. [8], and Jordan et al.[9], and is designed to take advantage of both global first-order information and local higher-order information. Indeed, when and is quadratic, (II) has the following closed form solution:

which is similar to a Newton updating step. The more general case has a proximal Newton flavor; see, e.g., [39] and the references therein. However, our method is different from their methods in the proximal term as well as the first order term. Intuitively, if we have a first-order approximation

(4)

then reduces to

(5)

which is essentially a first-order proximal gradient updating step.

We consider the convergence of the proposed approach under the asynchronous protocol where the master has the freedom to make updates with gradients from only a partial set of worker machines. We start with introducing important conditions that are used commonly in previous work [13, 20, 40].

The function is differentiable and has Lipschitz continuous gradient for all , i.e.,

The proof of the linear convergence relies on the following strong convexity assumption. For all , the function is strongly convex with modulus , which means that

for all , .

For all , the parameter in (II) is chosen large enough such that:

  • and , for some constant , where represents the convex modulus of the function .

  • There exists a constant such that

Moreover the following concept is needed in the first part of Theorem III.

We say a function is coercive if

Define

(6)

where is a proximal operator defined by . Usually is called the proximal gradient of ; is a stationary point when .

Based on these assumptions, now we can present the main theorem. Suppose Assumption II, III, and III are satisfied. Then we have the following claims for the sequence generated by Algorithm 1 (EDANNI).

  • (Boundedness of Sequence). The gap between and converges to , i.e.,

    If is coercive, then the sequence generated by Algorithm 1 is bounded.

  • (Convergence to Stationary Points). Every limit point of the iterates generated by Algorithm 1 is a stationary point of problem (2). Furthermore, , as .

  • (Sublinear Convergence Rate). Given , let us define to be the first time for the optimality gap to reach below , i.e.,

    Then there exists a constant such that

    where equals to a positive constant times for some . Therefore, the optimality gap converges to in a sublinear manner.

The theorem suggests that the iterates may or may not be bounded without the coerciveness property of . However, it guarantees that the optimality measure converges to sublinearly. We remark that [19] also analyzed the convergence of a proximal gradient method based communication-efficient algorithm for nonconvex problems, but they did not give a specific convergence rate. Note that such sublinear complexity bound is tight when applying first-order methods for nonconvex unconstrained problems (see [41, 42]). Let us define

The gap between and is denoted by , for . The proof of Theorem III relies on Lemma III, III, and III in the following.

Suppose Assumption III and Assumption III (I) are satisfied. then the following is true for iterates generated by Algorithm 1 (EDANNI)

(7)

Under the assumptions of Theorem III for any we have

(8)

Suppose Assumption III is satisfied. Then for generated by (EDANNI), there exists some constants and such that

The proofs of these lemmata are in the Appendix. Now in the following we prove Theorem III.

Proof of Theorem Iii.

We begin by establishing the first conclusion of the theorem. Summing inequality (III) in Lemma III over yields

Now define

by Assumption III we have and , therefore . It holds that

(9)

Note that by Lemma III the LHS of (9) is bounded from below. By letting , it follows that

Moreover, Lemma III shows that is bounded, but due to the coerciveness assumption

(10)

so we know is bounded. Therefore the first conclusion is proved.

We now establish the second conclusion of the Theorem. From (II), we know that

where is a proximal operator defined by . This implies that

(11)

Note that here inequality holds because of the nonexpansiveness of the operator . The last inequality follows from Assumption III.

Let be the set of stationary points of problem (2), and let

denote the distance between and the set . Now we prove

Suppose there exists a subsequence of such that but

(12)

Then it is obvious that . Therefore there exists some , such that

(13)

On the other hand, from (III) and the lower semi-continuity of we have , so by the definition of the distance function we have

(14)

Combining (13) and (14), we must have

This contradicts to (12), so the second result is proved.

We finally prove the third conclusion of the Theorem. Summing (III) over yields

(15)

Combining (9) and (III) we have

where .

Let . Then the above inequality implies

Thus it follows that

where , proving Theorem III. ∎

Besides the convergence in the nonconvex setting, in the following theorem we show that the proposed algorithm converges linearly if is strongly convex. Quite interestingly, comparing with the results of [8, 9, 10], here the linear convergence is established without any statistical assumption of the data stored in each local machine. Suppose Assumption II, III, and III are satisfied. If is sufficiently large such that

and

for some and , then it holds for the sequence generated by (EDANNI) that

where .

Note the above conditions can be satisfied when is sufficiently larger than the order of and the exponential of and is larger than the order of . Theorem III asserts that with the strongly convexity of ’s, the augmented optimality gap decreases linearly to zero under these conditions. Moreover, Assumption III can be replaced by only requiring each is convex and is strongly convex with modulus . To prove Theorem III, we need the following lemma to bound the optimality gap of function .

Suppose Assumption II, III, and III hold and for some , then it follows that

(16)

The proof of Lemma III is in the Appendix. Now we begin to prove Theorem III.

Proof of Theorem Iii.

We begin by defining . Then from the proof of Lemma III it holds that

(17)

Note that from (III) of Lemma III we have

(18)

By combining (III) and (III), we have the following bound of the LHS:

(19)

Inequality (III) gives us an relation between and . Let us define and

then by applying (III) recursively we have

where we use the fact that

The inequality holds because the coefficient of in the summation is less than <