Due to rapid developments in information and computing technology, modern applications often involve vast amounts of data, rendering local processing (e.g., in a single machine, or on a single processing core) computationally challenging or even prohibitive. To deal with this problem, distributed and parallel implementations are natural methods that can fully leverage multi-core computing and storage technologies. However, one drawback of distributed algorithms is that the communication cost can be very expensive in terms of raw bytes transmitted, latency, or both, as machines (i.e., computation nodes) need to frequently transmit and receive information between each other. Therefore, algorithms that require less communication are preferred in this case.
In this paper we study a general communication-efficient distributed algorithm which can be applied to a broad class of nonconvex nonsmooth inference problems. Assume that we have available some data samples. We consider a general problem appearing frequently in signal processing and machine learning applications; we aim to solve
where each is a loss function associated with the -th data sample, and is assumed smooth but possibly nonconvex with Lipschitz continuous gradient and is a convex (proper and lower semi-continuous) function that is possibly nonsmooth. Problem (2) covers many important machine learning and signal processing problems such as the localization with wireless acoustic sensor networks (WASNs) 
, support vector machine (SVM)
, the independent principal component analysis (ICA) reconstruction problem, and the sparse principal component analysis (PCA) problem .
For our distributed approach, we consider a network of total machines having a star topology, where one node designated as the “Master” node (node , without loss of generality) is located at the center of the star, and the remaining nodes (with indices ) are the “Worker” nodes (see Figure 1). Without loss of generality, assume that the number of data samples is evenly divisible by , i.e., for some integer , and each machine stores unique data samples. Then (1) can be reformulated to the following problem:
where is the loss function corresponding to the -th sample of the -th machine.
I-a Main Results
We propose an Efficient Distributed Algorithm for Nonconvex Nonsmooth Inference (EDANNI), and show that, for general problems of the form of (2), EDANNI converges to the set of stationary points if the algorithm parameters are chosen appropriately according to the maximum network delay. Our results differ significantly from existing works [5, 6, 7] which are all developed for convex problems. Therefore, the analysis and algorithm proposed here are applicable not only to standard convex learning problems but also to important nonconvex problems. To the best of our knowledge, this is the first communication-efficient algorithm exploiting local second-order information that is guaranteed to be convergent for general nonconvex nonsmooth problems. Moreover, linear convergence is also proved in the strongly convex setting with no statistical assumption of the data stored in each local machine, which is another improvement on existing works. The synchronization inherent in previous works, including [5, 6, 7, 8, 9, 10], slows down those methods because the master needs to wait for the slowest worker during each iteration; here, we propose an asynchronous approach that can accelerate the corresponding inference tasks significantly (as we will demonstrate in the experimental results).
I-B Related Work
There is a large body of work on distributed optimization for modern data-intensive applications with varied accessibility; see, for example, [5, 6, 7, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. (Parts of the results presented here appeared in our conference paper  without theoretical analysis.) Early works including [6, 11, 15]
mainly considered the convergence of parallelizing stochastic gradient descent schemes which stem from the idea of the seminal text by Bertsekas and Tsitsiklis. Niu et al.  proposed a lock-free implementation of distributed SGD called Hogwild! and provided its rates of convergence for sparse learning problems. That was followed up by many variants like [24, 25]. For solving large scale problems, works including , [18, 19], and  studied distributed optimizations based on a parameter server framework and parameters partition. Chang et al.  studied asynchronous distributed optimizations based on the alternating direction method of multipliers (ADMM). By formulating the optimization problem as a consensus problem, the ADMM can be used to solve the consensus problem in a fully parallel fashion over networks with a star topology. One drawback of such approaches is that they can be computationally intensive, since each worker machine is required to solve a high dimensional subproblem. As we will show, these methods also converge more slowly (in terms of communication rounds) as compared to the proposed approach (see Section IV).
A growing interest on distributed algorithms also appears in the statistics community [27, 28, 29, 30, 31]. Most of these algorithms depend on the partition of data, so their work usually involves statistical assumptions that handle the correlation between the data in local machines. A popular approach in early works is averaging estimators generated locally by different machines [15, 28, 32, 33]. Yang , Ma et al. , and Jaggi et al.  studied distributed optimization based on stochastic dual coordinate descent, however, their communication complexity is not better than that of first-order approaches. Shamir et al.  and Zhang and Xiao  proposed truly communication-efficient distributed optimization algorithms which leveraged the local second-order information, though these approaches are only guaranteed to work for convex and smooth objectives. In a similar spirit, Wang et al. , Jordan et al. , and Ren et al.  developed communication-efficient algorithms for sparse learning with regularization. However, each of these works needs an assumption about the strong convexity of loss functions, which may limit their approaches to only a small set of real-world applications. Here we describe an algorithm with similar flavor, but with more general applicability, and establish its convergence rate in both strongly convex and nonconvex nonsmooth settings. Moreover, unlike [8, 9, 10, 37, 38] where the convergence analyses rely on certain statistical assumptions on the data stored in machines, our convergence analysis is deterministic and characterizes the worst-case convergence conditions.
Notation. For a vector and we write ; for this is a norm. Usually is briefly written as . The set of natural numbers is denoted by . For an integer , we write as shorthand for the set .
In this section, we describe our approach to computing the minimizer of (2). Recall that we have machines. Let us denote as the iteration number, then is defined as the index of a subset of worker machines from which the master receives updated gradient information during iteration ; worker is said to be “arrived” if . At iteration , the master machine solves a subproblem to obtain an updated estimate, and communicates this to the worker machines in the subset . After receiving the updated estimate, the worker machines will compute the corresponding gradients of local empirical loss functions. These gradients are then communicated back to the master machine, and the process continues.
be the empirical loss at each machine. Let be the latest time (in terms of the iteration count) when the worker is arrived up to and including iteration .
In the -th iteration, the master (machine ) solves the following subproblem to update
This is communicated to the worker machines that are free, where it is used to compute their local gradient . Since machine is assumed to be the master machine, is actually .
Now one question is: which partial sets of worker machines (with indices in ) from which the master receives updated gradient information during iteration are sufficient to ensure convergence of a distributed approach? Firstly, let be a maximum tolerable delay, that is, the maximum number of iterations for which every worker machine can be inactive. The set should satisfy:
[Bounded delay] For all and iteration , it holds that .
To satisfy Assumption II, should contain at least the indices of the worker machines that have been inactive for longer than iterations. That is, the master needs to wait until those workers finish their current computation and have arrived. Note that by the definition of , it holds that
Assumption II requires that every worker is arrived at least once within the period . In other words, the gradient information used by the master must be at most iterations old. To guarantee the bounded delay, at every iteration the master needs to wait for the workers who have not been active for iterations, if such workers exist. Note that, when , one has for all , which reduces to the synchronous case where the master always waits for all the workers at every iteration.
The proposed approach is presented in Algorithm 1, which specifies respectively the steps for the workers and the master. Algorithm 1 has three prominent differences compared with its synchronous alternatives. First, only the workers in update the gradient and transmit it to the master machine. For the workers in , the master uses their latest gradient information before , i.e., . Second, the variables ’s are introduced to count the delays of the workers since their last updates. is set to zero if worker is arrived at the current iteration; otherwise, is increased by one. Therefore, to ensure Assumption II holds at each iteration, the master should wait if there exists at least one worker whose . Third, after solving subproblem (II), the master transmits the up-to-date variable only to the arrived workers. In general both the master and fast workers in the asynchronous approach can update more frequently and have less waiting time than their synchronous counterparts.
Iii Theoretical Analysis
Solving subproblem is inspired by the approaches of Shamir et al. , et al., Wang et al. , and Jordan et al., and is designed to take advantage of both global first-order information and local higher-order information. Indeed, when and is quadratic, (II) has the following closed form solution:
which is similar to a Newton updating step. The more general case has a proximal Newton flavor; see, e.g.,  and the references therein. However, our method is different from their methods in the proximal term as well as the first order term. Intuitively, if we have a first-order approximation
then reduces to
which is essentially a first-order proximal gradient updating step.
We consider the convergence of the proposed approach under the asynchronous protocol where the master has the freedom to make updates with gradients from only a partial set of worker machines. We start with introducing important conditions that are used commonly in previous work [13, 20, 40].
The function is differentiable and has Lipschitz continuous gradient for all , i.e.,
The proof of the linear convergence relies on the following strong convexity assumption. For all , the function is strongly convex with modulus , which means that
for all , .
For all , the parameter in (II) is chosen large enough such that:
and , for some constant , where represents the convex modulus of the function .
There exists a constant such that
Moreover the following concept is needed in the first part of Theorem III.
We say a function is coercive if
where is a proximal operator defined by . Usually is called the proximal gradient of ; is a stationary point when .
Based on these assumptions, now we can present the main theorem. Suppose Assumption II, III, and III are satisfied. Then we have the following claims for the sequence generated by Algorithm 1 (EDANNI).
(Boundedness of Sequence). The gap between and converges to , i.e.,
If is coercive, then the sequence generated by Algorithm 1 is bounded.
(Sublinear Convergence Rate). Given , let us define to be the first time for the optimality gap to reach below , i.e.,
Then there exists a constant such that
where equals to a positive constant times for some . Therefore, the optimality gap converges to in a sublinear manner.
The theorem suggests that the iterates may or may not be bounded without the coerciveness property of . However, it guarantees that the optimality measure converges to sublinearly. We remark that  also analyzed the convergence of a proximal gradient method based communication-efficient algorithm for nonconvex problems, but they did not give a specific convergence rate. Note that such sublinear complexity bound is tight when applying first-order methods for nonconvex unconstrained problems (see [41, 42]). Let us define
Under the assumptions of Theorem III for any we have
Suppose Assumption III is satisfied. Then for generated by (EDANNI), there exists some constants and such that
The proofs of these lemmata are in the Appendix. Now in the following we prove Theorem III.
Proof of Theorem Iii.
by Assumption III we have and , therefore . It holds that
Moreover, Lemma III shows that is bounded, but due to the coerciveness assumption
so we know is bounded. Therefore the first conclusion is proved.
We now establish the second conclusion of the Theorem. From (II), we know that
where is a proximal operator defined by . This implies that
Note that here inequality holds because of the nonexpansiveness of the operator . The last inequality follows from Assumption III.
Let be the set of stationary points of problem (2), and let
denote the distance between and the set . Now we prove
Suppose there exists a subsequence of such that but
Then it is obvious that . Therefore there exists some , such that
On the other hand, from (III) and the lower semi-continuity of we have , so by the definition of the distance function we have
This contradicts to (12), so the second result is proved.
We finally prove the third conclusion of the Theorem. Summing (III) over yields
Let . Then the above inequality implies
Thus it follows that
where , proving Theorem III. ∎
Besides the convergence in the nonconvex setting, in the following theorem we show that the proposed algorithm converges linearly if is strongly convex. Quite interestingly, comparing with the results of [8, 9, 10], here the linear convergence is established without any statistical assumption of the data stored in each local machine. Suppose Assumption II, III, and III are satisfied. If is sufficiently large such that
for some and , then it holds for the sequence generated by (EDANNI) that
Note the above conditions can be satisfied when is sufficiently larger than the order of and the exponential of and is larger than the order of . Theorem III asserts that with the strongly convexity of ’s, the augmented optimality gap decreases linearly to zero under these conditions. Moreover, Assumption III can be replaced by only requiring each is convex and is strongly convex with modulus . To prove Theorem III, we need the following lemma to bound the optimality gap of function .
Proof of Theorem Iii.
We begin by defining . Then from the proof of Lemma III it holds that