Distributed Non-Convex First-Order Optimization and Information Processing: Lower Complexity Bounds and Rate Optimal Algorithms

04/08/2018
by   Haoran Sun, et al.
University of Minnesota
0

We consider a class of distributed non-convex optimization problems often arises in modern distributed signal and information processing, in which a number of agents connected by a network G collectively optimize a sum of smooth (possibly non-convex) local objective functions. We address the following fundamental question: For a class of unconstrained non-convex problems with Lipschitz continuous gradient, by only utilizing local gradient information, what is the fastest rate that distributed algorithms can achieve, and how to achieve those rates. We develop a lower bound analysis that identifies difficult problem instances for any first-order method. We show that in the worst-case it takes any first-order algorithm O(D L /ϵ) iterations to achieve certain ϵ-solution, where D is the network diameter, and L is the Lipschitz constant of the gradient. Further for a general problem class and a number of network classes, we propose optimal primal-dual gradient methods whose rates precisely match the lower bounds (up to a ploylog factor). To the best of our knowledge, this is the first time that lower rate bounds and optimal methods have been developed for distributed non-convex problems. Our results provide guidelines for future design of distributed optimization algorithms, convex and non-convex alike.

READ FULL TEXT VIEW PDF
06/20/2020

On the Divergence of Decentralized Non-Convex Optimization

We study a generic class of decentralized algorithms in which N agents j...
11/10/2020

Distributed Stochastic Consensus Optimization with Momentum for Nonconvex Nonsmooth Problems

While many distributed optimization algorithms have been proposed for so...
04/02/2021

Information-constrained optimization: can adaptive processing of gradients help?

We revisit first-order optimization under local information constraints ...
11/15/2017

Random gradient extrapolation for distributed and stochastic optimization

In this paper, we consider a class of finite-sum convex optimization pro...
07/01/2016

Convergence Rate of Frank-Wolfe for Non-Convex Objectives

We give a simple proof that the Frank-Wolfe algorithm obtains a stationa...
04/08/2021

Universally-Optimal Distributed Algorithms for Known Topologies

Many distributed optimization algorithms achieve existentially-optimal r...
09/22/2019

Distributed Conjugate Gradient Tracking for Resource Allocation in Unbalanced Networks

This paper proposes a distributed conjugate gradient tracking algorithm ...

1 Introduction

1.1 Problem and motivation

In this work, we consider the following distributed optimization problem over a network

(1)

where is a smooth and possibly non-convex function accessible to agent . There is no central controller, and the agents are connected by a network defined by an undirected and unweighted graph , with vertices and edges. Each agent can only communicate with its immediate neighbors, and it can access one component function (by “access” we meant that it will be able to query the function and obtain its values and gradients; this notion will be defined precisely shortly).

A common way to reformulate problem (1) in the distributed setting is given below. Introduce local variables , and suppose the graph is connected, then the following formulation is equivalent to the global consensus problem

(2)

The main benefit of the above formulation is that the objective function is now separable, and the linear constraint encodes the network connectivity pattern.

1.2 Distributed non-convex optimization

Non-convex distributed optimization has gained considerable attention recently, and has found applications in training neural networks

[1]

, distributed information processing and machine learning

[2, 3, 4], and distributed signal processing [5].

The problem (1) and (2) have been studied extensively in the literature when ’s are all convex; see for example [6, 7, 8, 9, 10, 11]. Primal methods such as distributed subgradient (DSG) method [6], the EXTRA method [9], as well as primal-dual based methods such as distributed augmented Lagrangian method [11], Alternating Direction Method of Multipliers (ADMM) [12, 13, 14, 15] have been proposed.

On the contrary, only recently there have been some works addressing the more challenging problems without assuming convexity of ; see recent developments in [16, 17, 18, 4, 5, 19, 20, 21, 22]. Reference [4] develops a non-convex ADMM based methods (with the first known global sublinear convergence rate) for solving the distributed consensus problem (1). However the network considered therein is a star network in which the local nodes are all connected to a central controller. Reference [20] proposes a primal-dual based method for unconstrained problem over a connected network, and derives the first global convergence rate for this setting. In [5] and follow up works [23, 24], the authors utilize certain gradient tracking idea to solve a constrained nonsmooth distributed problem over possibly time-varying networks. The work [25] summarizes a number of recent progress in extending the DSG-based methods for non-convex problems. References [26, 27, 22] developed methods for distributed stochastic zeroth and/or first-order non-convex optimization. It is worth noting that the distributed algorithms proposed in all these works converge to first-order stationary solutions, which contain local maximum, local minimum and saddle points. Only recently, the authors of [28] developed first-order distributed algorithms that are capable of computing second-order stationary solutions (which under suitable conditions become local optimal solutions). Other recently proposed methods for distributed stocha

1.3 Lower and upper rate bounds analysis

Despite all the recent interests and contributions in this field, one major question remains open:

rgb]0.9,0.9,0.9

(Q)   What is the best convergence rate achievable by any distributed algorithms for the non-convex problem (1)?

Question seeks to find a “best convergence rate”, which is a characterization of the smallest number of iterations required to achieve certain high-quality solutions, among all distributed algorithms. Clearly, understanding provides fundamental insights to distributed optimization and information processing. For example, the answer to

can provide meaningful optimal estimates on the total amount of communication effort required to achieve a given level of accuracy. Further, the identified optimal strategies capable of attaining the best convergence rates will also help guide the practical design of distributed information processing algorithms.

Question is easy to state, but formulating it rigorously is quite involved and a number of delicate issues have to be clarified. Below we provide a high level discussion on some of these issues.

(1) Fix Problem and Network Classes. A class of problems and networks of interest should be fixed. Roughly speaking, in this work, we will fix to be the family of smooth unconstrained problem (1), and to be the set of connected and unweighted graphs with finite number of nodes.

(2) Characterize High-Quality Solutions. For a properly defined error constant , one needs to define a high-quality solution in distributed and non-convex setting. Differently from the centralized case, the following questions have to be addressed: Should the solution quality be evaluated based on the averaged iterates among all the agents, or on the individual iterates? Shall we include some consensus measure in the solution characterization? Different solution notion could potentially lead to different lower and upper rate bounds.

(3) Fix Algorithm Classes. A class of algorithms has to be fixed. In the classical complexity analysis in (centralized) optimization, it is common to define the class of algorithms by the information structures that they utilize [29]. In the distributed and non-convex setting, it is necessary to specify both the function information that can be used by individual nodes, as well as the communication protocols that are allowed.

(4) Develop Sharp Upper Bounds. It is necessary to develop algorithms within class , which possess provable and sharp global convergence rate for problem/network class . These algorithms provide achievable upper bounds on the global convergence rates.

(5) Identify Lower Bounds. It is important to characterize the worst rates achievable by any algorithm in class for problem/network class . This task involves identifying instances in that are difficult for algorithm class .

(6) Match Lower and Upper Bounds. The key task is to investigate whether the developed algorithms are rate optimal, in the sense that rate upper bounds derived in (4) match the worst-case lower bounds developed in (5). Roughly speaking, matching two bounds requires that for the class of problem and networks , the following quantities should be matched between the lower and upper bounds: i) the order of the error constants ; ii) the order of problem parameters such as , or that of network parameters such as the spectral gap, diameter, etc.

Convergence rate analysis (aka iteration complexity analysis) for convex problems dates back to Nesterov, Nemirovsky and Yudin [30, 31], in which lower bounds and optimal first-order algorithms have been developed; also see [32]. In recent years, many accelerated first-order algorithms achieving those lower bounds for different kinds of convex problems have been derived; see e.g., [33, 34, 35], including those developed for distributed convex optimization [36]. In those works, the optimality measure used is , and the lower bound can be expressed as [32, Theorem 2.2.2]

(3)

where is the Lipschitz constant for ; (resp. ) is the global optimal solution (resp. the initial solution); is the iteration index. Therefore to achieve -optimal solution in which , one needs iterations. Recently the above approach has been extended to distributed strongly convex optimization in [37]. In particular, the authors consider problem (1) in which each is strongly convex, and they provide lower and upper rate bounds for a class of algorithms in which the local agents can utilize both and its Fenchel conjugate . We note that this result is not directly related to the class of “first-order” method, since beyond the first-order gradient information, the Fenchel conjugate is also needed, but computing this quantity requires performing certain exact minimization, which itself involves solving a strongly convex optimization problem. Recent related works in this direction also include [38]. Therefore the optimal first-order distributed algorithm for strongly convex problems are still left open, not to mention for general convex and non-convex distributed problems.

Network Instances Problem Classes
Uniform Lipschitz Non-uniform Lipschitz Rate Achieving Algorithm
Complete/Star D-GPDA (proposed)
Random Geometric xFILTER (proposed)
Path/Circle xFILTER (proposed)
Grid xFILTER (proposed)
Centralized Gradient Descent
Table 1: The main results of the paper when specializing to a few popular graphs. The entries show the best rate bounds achieved by the proposed algorithms (either GPDA or xFILTER) for a number of specific graphs and problem class; is the Lipschitz constant for [see (4)]; for the uniform case . For the uniform Lipschitz the lower rate bounds derived for the particular graph matches the upper rate bounds (we only show the latter in the table). The last row shows the rate achieved by the centralized gradient descent algorithm. The notation denotes big with some polynomial in logarithms, i.e, use to denote where is the problem dimension.

When the problem becomes non-convex, the size of the gradient function can be used as a measure of solution quality. In particular, let , then it has been shown that the classical (centralized) gradient descent (GD) method achieves the following rate [32, page 28]

It has been shown in [39] that the above rate is (almost) tight for GD. Recently, [40] has further shown that the above rate is optimal for any first-order methods that only utilizes the gradient information, when applied to problems with Lipschitz gradient. However, no lower bound analysis has been developed for distributed non-convex problem (19); there are even not many algorithms that provide achievable upper rate bounds (except for the recent works [4, 20, 41, 42]), not to mention any analysis on the tightness/sharpness of these upper bounds.

1.4 Contribution of this work

In this work, we address various issues arise in answering . Our main contributions are given below:

1) We identify a class of non-convex problems and networks , a class of distributed first-order algorithms , and rigorously define the -optimality gap that measures the progress of the algorithms;

2) We develop the first lower complexity bound for class to solve class : To achieve -optimality, it is necessary for any to perform rounds of communication among all the nodes, where represents certain spectrum gap of the graph Laplacian matrix, and is the averaged Lipschitz constants of the gradients of local functions.

3) We design two algorithms belonging to , one based on primal-dual optimization scheme, the other based on a novel approximate filtering -then- predict and tracking (xFILTER) strategy, both of which achieve -optimality condition with provable global rates [in the order of ];

4) We show that the xFILTER is an optimal method in for problem class as well as a number of its refinements, in that they precisely achieve the lower complexity bounds that we derived (up to a ploylog factor).

In Table 1, we specialize some key results developed in the paper to a few popular graphs.

Notations. For a given symmetric matrix , we use , and to denote the maximum, the minimum and the minimum nonzeroeigenvalues for a symmetric matrix ; We use

to denote an identity matrix with size

, and use to denote the Kronecker product. We use to denote the set

. For a vector

we use to denote its th element. We use to denote where is the problem dimension. We use to denote two connected nodes and , i.e., for a graph , if , and .

2 Preliminaries

2.1 The class , ,

We present the classes of problems, networks and algorithms to be studied, as well as some useful results. We parameterize these classes using a few key parameters so that we can specify their subclasses when needed.

Problem Class. A problem is in class if it satisfies the following conditions.

  • The objective is a sum of functions; see (1).

  • Each component function ’s has Lipschitz gradient:

    (4)

    where is the smallest positive number such that the above inequality holds true. Define , , and similarly.

    Define the matrix of Lipschitz constants as:

    (5)
  • The function is lower bounded over , i.e.,

    (6)

These assumptions are rather mild. For example an satisfies [A2-A3] is not required to be second-order differentiable. Below we provide a few non-convex functions that satisfy Assumption [A2-A3], and each of those can be the component function

’s. Note that the first four functions are of particular interest in learning neural networks, as they are commonly used as activation functions.

(1) The sigmoid function is given by We have , , therefore [A2-A3] are true with .

(2) The function satisfies , . So [A2-A3] hold with .

(3) The function satisfies so [A2-A3] hold with .

(4) The logit function is related to the function as follows

then Assumptions [A2-A3] are again satisfied.

(5) The function has applications in structured matrix factorization [43]. Clearly it is lower bounded. Its second-order derivative is also bounded.

(6) Other functions like , , are easy to verify. Consider where . This function is interesting because it is not second-order differentiable; nonetheless we can verify that [A2-A3] are satisfied with .

Network Class. Let denote a class of networks represented by an undirected and unweighted graph , with vertices and edges, and edge weights all being . In this paper the term ‘network’ and ‘graph’ will be used interchangeably. Also, we use to denote a class of network similarly as above, but with nodes and a diameter of , defined below [where indicates the distance between two nodes]:

(7)

Following the convention in [44], we define a number of graph related quantities below. First, define the degree of node as , and define the averaged degree as:

(8)

Define the incidence matrix (IM) as follows: if and it connects vertex and with , then if , if and otherwise; see the definition in [44, Theorem 8.3]. Using these definitions, the graph Laplacian matrix and the degree matrix are defined as follows (see [44, Section 1.2]):

(9)

In particular, the elements of the Laplacian are given as:

We note that the graph Laplacian defined here is sometimes known as the normalized graph Laplacian in the literature, but throughout this paper we follow the convention used in the classical work [44] and simply refer it as the graph Laplacian. For convenience, we also define a scaled version of the IM:

(10)

It is known that IM and scaled IM satisfies the following (where is an all one vector):

(11)

Define the second smallest eigenvalue of , as :

(12)

Then the spectral gap of the graph can be defined below:

(13)

Algorithm Class. Define the neighbor set for node as

(14)

We say that a distributed, first-order algorithm is in class if it satisfies the following condition.

  • At iteration , each node can obtain some network related constants, such as , , eigenvalues of the graph Laplacian , etc.

  • At iteration , each node first conducts a communication step by linearly combining all its neighboring output from iteration , and then updates its local variable by linearly combining all its current and historical local gradients and variables, i.e.,

    (15)

    The linear combination coefficients can be dependent on those constants obtained at iteration .

Clearly is a class of first-order methods because only historical local gradient information is used in the computation. It is also a class of distributed algorithms because at each iteration the nodes are only allowed to communicate with its neighbors. Since the linear combination coefficients can be arbitrarily chosen in computing , at each iteration the nodes has the flexibility in choosing the subset of its neighbors to communicate, as well as how to combine their outputs.

Additionally, in many practical distributed algorithms such as DSG, ADMM or EXTRA, the nodes are dictated to use a fixed strategy to linearly combine all its neighbors’ information across the iterations. In an effort to model such a requirement, below we consider a slightly restricted algorithm class , where we require the nodes to use a linear combination with the coefficient being one (note that allowing the nodes to use a fixed but arbitrary linear combination is also possible, but the resulting analysis will be more involved). In particular, we say that a distributed, first-order algorithm is in class if it satisfies [B1] as well as the following condition:

  • At iteration , each node can output a linear combination of the sum of the outputs of its neighboring set, as well as historical local gradients and variables, i.e.,

    (16)

2.2 Solution Quality Measure

Next we provide definitions for the quality of the solution. Note that since we consider using first-order methods to solve non-convex problems, it is expected that in the end some first-order stationary solution with small will be computed.

Our first definition is related to a global variable . We say that is a global -solution if the following holds:

(17)

This definition is conceptually simple and it is identical to the centralized criteria in Section 1.3. However it has the following issues. First, no global variable will be formed in the entire network, so criteria (17) is difficult to evaluate. Second, there is no characterization of how close the local variables ’s are. To see the second point, consider the following toy example.

Example 1: Consider a network with and and . Suppose that the local variables take the following values: and . Then if we pick , we have

This suggests that at iteration , there exists one linear combination that makes measure (17) precisely zero. However one can hardly say that the current solution is a good solution for problem (2).

To address the above issue, we provide a second definition which is directly related to local variables . At a given iteration , we say that is a local -solution if the following holds:

(18)

Clearly this definition takes into consideration the consensus error as well as the size of the local gradients. When applied to Example 1, this measure will be large. We will use short-handed notation to denote the left hand side of (18). Note that the constant is needed to balance the two different measures. Also note that the operation is needed to track the best solution obtained before iteration , because the quantity inside this operation may not be monotonically decreasing.

In our work we will focus on providing answers to the following specific version of question :

rgb]0.9,0.9,0.9

For any given , what is the minimum iteration (as a function of ) needed for any algorithm in class (or class ’) to solve instances in classes , so to achieve ?

2.3 Some Useful Facts and Definitions

Below we provide a few facts about the above classes.

On Lipschitz constants. Assume that each has Lipschitz continuous gradient with constant in (4). Then we have :

(19)

where is the average of the local Lipschitz gradients. We also have the following

which implies

(20)

where the matrix is defined in (5).

On Quantities for Graph . This section presents a number of properties for a given graph . Define the following matrices:

(21)

Define where the absolute value is taken component-wise. Then we have the following:

(22)

where is the degree matrix defined in (9).

For two diagonal matrices and of appropriate sizes, the generalized Laplacian (GL) matrix is defined as:

(23)

and its elements are given by:

Define a diagonal matrix as below:

(24)

Then when specializing and , the GL matrix becomes:

(25)

Note that if any diagonal element in the matrix is zero, then denotes the Moore - Penrose matrix pseudoinverse. Similarly, when specializing and , then the GL matrix becomes:

(26)

These matrices will be used later in our derivations.

Below we list some useful results about the Laplacian matrix [45, 44, 46]. First, all eigenvalues of lie in the interval . Also because , we have

(27)

Also we have that [44, Lemma 1.9]

(28)

The eigenvalues of for a number of special graphs are given below:

1) Complete Graph: The eigenvalues are and (with multiplicity ), so ;

2) Star Graph: The eigenvalues are and (with multiplicity ), and , so ;

3) Path Graph: The eigenvalues are for , and .

4) Cycle Graph: The eigenvalues are for , and .

5) Grid Graph: The grid graph is obtained by placing the nodes on a grid, and connecting nodes to their nearest neighbors. We have .

6) Random Geometric Graph: Place the nodes uniformly in and connect any two nodes separated by a distance less than a radius . Then if the connectivity radius satisfies [46]

(29)

then with high probability

(30)

3 Lower Complexity Bounds

In this section we develop the lower complexity bounds for algorithms in class to solve problems over network . We will mainly focus on the case where ’s have uniform Lipschitz constants, that is, we assume that

and we denote the resulting problem class as . At the end of this section, generalization to the non-uniform case will be briefly discussed.

Our proof combines ideas from the classical proof in Nesterov [29], as well as two recent constructions [40] (for centralized non-convex problems) and [37] (for strongly convex distributed problems). Our construction differs from the previous works in a number of ways, in particular, the constructed functions are only first-order differentiable, but not second-order differentiable. Further, we use the local- solution (18) to measure the quality of the solution, which makes the analysis more involved compared with the existing global error measures in [29, 40, 37].

To begin with, we construct the following two non-convex functions

(31)

as well as the corresponding versions that evaluate on a “centralized” variable

(32)

Here we have , for all , , and . Later we make our construction so that functions and are easy to analyze, while and will be in the desired function class in . Without loss of generality, in the construction we will assume will be Lipschitz with constant , for all .

3.1 Path Graph ()

First we consider the extreme case in which the nodes form a path graph with nodes and each node has its own local function , shown in Figure 1.

Figure 1: The path graph used in our construction.
Figure 2: The functional value, and derivatives of .
Figure 3: The functional value, and derivatives of .

For notational simplicity assume that is a multiple of , that is for some integer . Also assume that

is an odd number without loss of generality.

Let us define the component functions ’s in (31) as follows.

(33)

where we have defined the following functions

(34a)
(34b)

The component functions are given as below

Suppose , then the average function becomes:

Figure 4: The functional value for .

Further for a given error constant and a given averaged Lipschitz constant , let us define

(35)

Therefore we also have, if , then

(36)

First we present some properties of the component functions ’s.

Lemma 3.1

The functions and satisfy the following.

  1. For all , , .

  2. The following bounds hold for the functions and their first and second-order derivatives:

  3. The following key property holds:

    (37)
  4. The function is lower bounded as follows:

  5. The first-order derivative of (resp. ) is Lipschitz continuous with constant (resp. , ).

Proof. Property 1) is obviously true.

To prove Property 2), note that following holds for :

(38)

Obviously, is an increasing function over , therefore the lower and upper bounds are ; is increasing on and decreasing on , where , therefore the lower and upper bounds are ; is decreasing on and increasing on [this can be verified by checking the signs of in these intervals]. Therefore the lower and upper bounds are , i.e.,

Further, for all , the following holds:

(39)

Similarly, as above, we can obtain the following bounds:

We refer the readers to Fig. 3 – Fig. 3 for illustrations of these functions.

To show Property 3), note that for all and ,

where the first inequality is true because is strictly increasing and is strictly decreasing for all , and that .

Next we show Property 4). Note that and . Therefore we have and using the construction in (33)

(40)
(41)

where the first inequality follows and second follows , we reach the conclusion.

Finally we show Property 5), using the fact that a function is Lipschitz if it is piecewise smooth with bounded derivative. From construction (33), the first-order partial derivative of can be expressed below.

Case I) If is even, we have

(42)

Case II) If is odd but not 1, we have

(43)

Case III) If , we have

(44)

Obviously, is a piecewise smooth function for any , and it either equals zero or is separated at the non-differentiable point because of the function .

Further, fix a point and a unit vector where . Define

to be the directional projection of on to the direction at point . We will show that there exists such that for all .

First we can compute as follows: