# On the Sublinear Convergence of Randomly Perturbed Alternating Gradient Descent to Second Order Stationary Solutions

The alternating gradient descent (AGD) is a simple but popular algorithm which has been applied to problems in optimization, machine learning, data ming, and signal processing, etc. The algorithm updates two blocks of variables in an alternating manner, in which a gradient step is taken on one block, while keeping the remaining block fixed. When the objective function is nonconvex, it is well-known the AGD converges to the first-order stationary solution with a global sublinear rate. In this paper, we show that a variant of AGD-type algorithms will not be trapped by "bad" stationary solutions such as saddle points and local maximum points. In particular, we consider a smooth unconstrained optimization problem, and propose a perturbed AGD (PA-GD) which converges (with high probability) to the set of second-order stationary solutions (SS2) with a global sublinear rate. To the best of our knowledge, this is the first alternating type algorithm which takes O(polylog(d)/ϵ^7/3) iterations to achieve SS2 with high probability [where polylog(d) is polynomial of the logarithm of dimension d of the problem].

## Authors

• 22 publications
• 50 publications
• 5 publications
• ### On the Convergence of Perturbed Distributed Asynchronous Stochastic Gradient Descent to Second Order Stationary Points in Non-convex Optimization

In this paper, the second order convergence of non-convex optimization i...
10/14/2019 ∙ by Lifu Wang, et al. ∙ 0

• ### Second-order Guarantees of Distributed Gradient Algorithms

We consider distributed smooth nonconvex unconstrained optimization over...
09/23/2018 ∙ by Amir Daneshmand, et al. ∙ 0

• ### How to Escape Saddle Points Efficiently

This paper shows that a perturbed form of gradient descent converges to ...
03/02/2017 ∙ by Chi Jin, et al. ∙ 0

• ### Block Layer Decomposition schemes for training Deep Neural Networks

Deep Feedforward Neural Networks' (DFNNs) weights estimation relies on t...
03/18/2020 ∙ by Laura Palagi, et al. ∙ 0

• ### A Stochastic Alternating Balance k-Means Algorithm for Fair Clustering

In the application of data clustering to human-centric decision-making s...
05/29/2021 ∙ by Suyun Liu, et al. ∙ 0

• ### A Self-Penalizing Objective Function for Scalable Interaction Detection

We tackle the problem of nonparametric variable selection with a focus o...
11/24/2020 ∙ by Keli Liu, et al. ∙ 0

• ### On Noisy Negative Curvature Descent: Competing with Gradient Descent for Faster Non-convex Optimization

The Hessian-vector product has been utilized to find a second-order stat...
09/25/2017 ∙ by Mingrui Liu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we consider a smooth and unconstrained nonconvex optimization problem

 minx∈Rd×1f(x) (1)

where is twice differentiable.

There are many ways of solving problem (1), such as gradient descent (GD), accelerated gradient descent (AGD), etc. When the problem dimension is large, it is natural to split the variables into multiple blocks and solve the subproblems with smaller size individually. The block coordinate descent (BCD) algorithm, and many of its variants such as block coordinate gradient descent (BCGD) and alternating gradient descent (AGD) Bertsekas [1999]; Li and Liang [2017], are among the most powerful tools for solving large scale convex/nonconvex optimization problems Nesterov [2012]; Beck and Tetruashvili [2013]; Razaviyayn et al. [2013]; Hong et al. [2017]. The BCD-type algorithms partition the optimization variables into multiple small blocks, and optimize each block one by one following certain block selection rule, such as cyclic rule Tseng [2001], Gauss-Southwell rule Tseng and Yun [2009], etc.

In recent years, there are many applications of BCD-type algorithms in the areas of machine learning and data mining, such as matrix factorization Zhao et al. [2015]; Lu et al. [2017a, b]

, tensor decomposition, matrix completion/decomposition

Xu and Yin [2013]; Jain et al. [2013]

, and training deep neural networks (DNNs)

Zhang and Brand [2017]. Under relatively mild conditions, the convergence of BCD-type algorithms to first-order stationary solutions (SS1) have been broadly investigated for nonconvex and non-differentiable optimization Tseng [2001]; Grippo and Sciandrone [2000]. In particular, it is known that under mild conditions, these algorithms also achieve global sublinear rates Razaviyayn et al. [2014]. However, despite its popularity and significant recent progress in understanding its behavior, it remains unclear whether BCD-type algorithms can converge to the set of second-order stationary solutions (SS2) with a provable global rate, even for the simplest problem with two blocks of variables.

### 1.1 Motivation

Algorithms that can escape from strict saddle points – those stationary points that have negative eigenvalues – have wide applications. Many recent works have analyzed the saddle points in machine learning problems

Kawaguchi [2016]. Such as learning in shallow networks, the stationary points are either global minimum points or strict saddle points. In two-layer porcupine neural networks (PNNs), it has been shown that most local optima of PNN optimizations are also global optimizers Feizi et al. [2017]. Previous work in Ge et al. [2015] has shown that the saddle points in tensor decomposition are indeed strict saddle points. Also, it has been shown that any saddle points are strict in dictionary learning and phase retrieval problems theoretically and numerically in Sun et al. [2015, 2017]; Wang et al. [2017b, a]. More recently, Ge et al. [2017] proposed a unified analysis of saddle points for a board class of low rank matrix factorization problems, and they proved that these saddle points are strict.

### 1.2 Related Work

Many recent works have been focused on the performance analysis and/or design of algorithms with convergence guarantees to local minimum points/SS2 for nonconvex optimization problems. These include the trust region method Conn et al. [2000], cubic regularized Newton’s method Nesterov and Polyak [2006]; Carmon and Duchi [2016], and a mixed approach of the first-order and seconde-order methods Reddi et al. [2017], etc. However, these algorithms typically require second-order information, therefore they incur high computational complexity when problem dimension becomes large.

There has been a line of work on stochastic gradient descent algorithms, where properly scaled Gaussian noise is added to the iterates of the gradient at each time [also known as stochastic gradient Langevin dynamics, (SGLD)]. Some theoretical works have pointed out that SGLD not only converges to the local minimum points asymptotically but also may escape from local minima

Zhang et al. [2017]; Raginsky et al. [2017]. Unfortunately, these algorithms require a large number of iterations with steps to achieve the optimal point. There are fruitful results that show some carefully designed algorithms can escape from strict saddle point efficiently, such as negative-curvature-originated-from noise (Neon) Xu and Yang [2017], Neon2 Allen-Zhu and Li [2017], NeonXu et al. [2017] and gradient descent with one-step escaping (GOSE) Yu et al. [2017]

. The Neon-type of algorithms utilizes the stochastic first-order updates to find the negative curvature direction, and GOSE just needs one negative curvature descent step with calculation of eigenvectors when the iterates of the algorithm are near the saddle point for saving the computational burden.

On the other hand, there is also a line of work analyzing the deterministic GD type method. With random initializations, it has been shown that GD only converges to SS2 for unconstrained smooth problems Lee et al. [2016]. More recently, block coordinate descent, block mirror descent and proximal block coordinate descent have been proven to almost always converge to SS2 with random initializations Lee et al. [2017], but there is no convergence rate reported. Unfortunately, a follow-up study indicated that GD requires exponential time to escape from saddle points for certain pathological problems Du et al. [2017]. Adding some noise occasionally to the iterates of the algorithm is another way of finding the negative curvature. A perturbed version of GD has been proposed with convergence guarantees to SS2 Jin et al. [2017a], which shows a faster provable convergence rate than the ordinary gradient descent algorithm with random initializations. Furthermore, the accelerated version of PGD (PAGD) is also proposed in Jin et al. [2017b], which shows the fastest convergence rate among all Hessian free algorithms.

### 1.3 Scope of This Paper

In this work, we consider a smooth unconstrained optimization problem, and develop a perturbed AGD algorithm (PA-GD) which converges (with high probability) to the set of SS2 with a global sublinear rate. Our work is inspired by the works Jin et al. [2017a]; Ge et al. [2015], which developed novel perturbed GDs that escapes from strict saddle points. Similarly as in Jin et al. [2017a], we also divide the entire iterates of GD into three types of points: those whose gradients are large, those that are local minimum, and those that are strict saddle points. At a given point, when the size of the gradient is large enough, we just implement the ordinary AGD. When the gradient norm is small, which may be either strict saddle or local minimum, a perturbation will be added on the iterates to help to escape from the saddle points.

From the above section, we know that many works have been developed to make use of negative curvature information around the saddle points. Unfortunately, these techniques cannot be directly applied to the BCD/AGD- type of algorithms. The key challenge here is that at each iteration only part of the variables are updated, therefore we have access only to partial second order information at the points of interest. For example, consider a quadratic objective function shown in Figure 1. While fixing one block, the problem is strongly convex with respect to the other block, but the entire problem is nonconvex. Even if the iterates converge for each block to the minimum points within the block, the stationary point could still be a saddle point for the overall objective function. Therefore, the analysis of how AGD type of algorithms exploit the negative curvature is one of the main tasks in this paper.

To the best of our knowledge, there is no work on modifying AGD algorithms to escape from strict saddle points with any convergence rate. The main contributions of this work are as follows.

### 1.4 Contributions of This Work

In this paper, we design and analyze a perturbed AGD algorithm for solving an unconstrained nonconvex problem, namely perturbed AGD. Through the perturbation of AGD, the algorithm is guaranteed to converge to a set of SS2 of a nonconvex problem with high probability. By utilizing the matrix perturbation theory, convergence rate of the proposed algorithm is also established, which shows that the algorithm takes iterations to achieve an ()-SS2 with high probability. Also, considering the fact that there is a strong relation between GD and proximal point algorithm, we also study a perturbed alternating proximal point (PA-PP) algorithm with some random perturbation. By leveraging the techniques proposed in this paper, we show that PA-PP, which may not need to calculate the gradient at each step, converges as fast as PA-GD in the order of . The comparison of the algorithms which only use the first order information for escaping from strict saddle points is summarized as shown in Table 1.

The main contributions of the paper are highlighted below:

1. To the best of our knowledge, it is the first time that the convergence analysis shows that some variants of AGD (using first-order information) can converge to SS2 for nonconvex optimization problems.

2. The convergence rate of the perturbed AGD algorithm is analyzed, where the choice of the step size is only dependent on certain maximum Lipschitz constant over blocks rather than all variables. This is one of the major difference between GD and AGD.

3. By further extending the analysis in this paper, we also show that PA-PP can also escape from the strict points efficiently with the speed of .

## 2 Preliminaries

### 2.1 Notation

Notation. Bold upper case letters without subscripts (e.g., ) denote matrices and bold lower case letters without subscripts (e.g.,

) represent vectors. Notation

denotes the th block of vector . We use to denote the partial gradient with respect to its th block variable while the remaining one is fixed. Notation denotes a -dimensional ball centered at with radius , and , denote the smallest and largest eigenvalues of matrix respectively.

### 2.2 Definitions

The objective function has the following properties.

###### Definition 1.

A differentiable function is L-smooth with gradient Lipschitz constant (uniformly Lipschitz continuous), if

 ∥∇f(x)−∇f(y)∥≤L∥x−y∥,∀x,y.

The function is called block-wise smooth with gradient Lipschitz constants , if

 ∥∇kf(x−k,xk)−∇kf(x−k,x′k)∥≤Lk∥xk−x′k∥,∀x,x′

or with gradient Lipschitz constants , if

 ∥∇kf(x−k,xk)−∇kf(x′−k,xk)∥≤˜Lk∥x−k−x′−k∥,∀x,x′.

Further, let .

###### Definition 2.

For a differentiable function , if , then is a first-order stationary point. If , then is an -first-order stationary point.

###### Definition 3.

For a differentiable function , if is a SS1, and there exists so that for any in the -neighborhood of , we have , then is a local minimum. A saddle point is a SS1 that is not a local minimum. If , is a strict (non-degenerate) saddle point.

###### Definition 4.

A twice-differentiable function is -Hessian Lipschitz if

 ∥∇2f(x)−∇2f(y)∥≤ρ∥x−y∥,∀x,y. (2)
###### Definition 5.

For a -Hessian Lipschitz function , is a second-order stationary point if and . If the following holds

 ∥∇f(x)∥≤ϵ,andλmin(∇2f(x))≥−γ (3)

where , then is a -SS2.

###### Assumption 1.

Function is -smooth, block-wise smooth with gradient Lipschitz constants , and -Hessian Lipschitz.

## 3 Perturbed Alternating Gradient Descent

### 3.1 Algorithm Description

AGD is a classical algorithm that optimizes the variables of an optimization problem in an alternating manner Bertsekas [1999], meaning that when one block of variables is updated, the remaining block is fixed to be the same as its previous solution. Mathematically, the iterates of AGD are updated by the following rule

 x(t+1)k=x(t)k−η∇kf(h(t)−k,x(t)k),k=1,2 (4)

where superscript denotes the iteration counter; and ; is the step size. AGD can be considered as a special case of block coordinate gradient descent Nesterov [2012]; Beck and Tetruashvili [2013].

Our proposed algorithm is based on AGD, but modified in a way similar to the recent work [Jin et al., 2017a], which adds some noise in PGD. The details of the implementation of PA-GD are shown in Algorithm 1, where is a constant so that , denotes the difference of the objective value at the initial point and global optimal solution, represents the predefined target error.

In each update of variables, we implement one step of the block gradient descent, and then proceed to the next block. Once the algorithm has sufficient decrease of the objective value, it implies that the algorithm converges to some good solution. Otherwise, some perturbation may be needed to help the iterates escape from the saddle points. If after the perturbation the objective value does not decrease sufficiently after a number of further iterations, the algorithm terminates and returns the iterate before the last perturbation.

To illustrate the practical behavior of the algorithm, we provide an example that shows the trajectory of AGD after a small perturbation at a stationary point. In Figure 1, it is clear that is a SS1 and also a strict saddle point since the eigenvalues of are and respectively. When is fixed, function is convex with respect to and vice versa, however, the objective function is nonconvex. It can be observed that PA-GD can escape from the strict saddle point efficiently.

### 3.2 Convergence Rate Analysis

Despite the fact that PA-GD exploits a different way of updating variables, we will show that it can still escape from strict saddle points with high probability with suitable perturbation. The main theorem is presented as follows.

###### Theorem 1.

Under Assumption 1, there exists a constant such that: for any , , , and constant , with probability , the iterates generated by PA-GD converge to an -SS2 satisfying

 ∥∇f(x)∥≤ϵ,andλmin(∇2f(x))≥−(Lmaxρϵ)1/3

in the following number of iterations:

 O⎛⎝L5/3maxP71P22Δfρ1/3ϵ7/3log7⎛⎝P61P22dL5/3maxΔfc5ρ1/3ϵ7/3δ⎞⎠⎞⎠ (5)

where denotes the global minimum value of the objective function, and and .

Remark 1. When is used, the convergence rate of PA-GD is

 O⎛⎝L5/3maxlog2(2d)Δfρ1/3ϵ7/3log7⎛⎝P61P22dL5/3maxΔfc5ρ1/3ϵ7/3δ⎞⎠⎞⎠. (6)

It shows that if a smaller step size is used, the convergence rate of PA-GD is faster (with smaller constants) since the linear dependency of and in (5) both disappear. This property is consistent with the known result when BCD is used in convex optimization problems, i.e., when a smaller step size is used, the rate could become better; e.g., see [Sun and Hong, 2015, Theorem 2.1].

## 4 Perturbed Alternating Proximal Point

In many applications, AGD may not be efficient in the sense that the convergence rate of gradient in each block may be very slow. For example, consider matrix factorization problem where is the given data, , and are two block variables. For this problem, the alternating least squares algorithm (which exactly minimizes each block) would be a faster algorithm compared with the AGD which only uses gradient steps.

In this section, we consider the classical proximal point algorithm Parikh et al. [2014] in which each block of variables is exactly minimized with respect to certain quadratic surrogate. To be specific, we can replace (4) in Algorithm 1 by

 x(t+1)k=argminxkf(h(t)−k,xk)+ν2∥xk−x(t)k∥2,k=1,2 (7)

where is penalty parameter. The iteration can be explicitly written as

 x(t+1)k=x(t)k−1ν∇kf(h(t)−k,x(t+1)k), k=1,2, (8)

which has the similar form as the PA-GD algorithm, but with the step size being , and with gradient evaluated at the new iterate. The resulting algorithm, detailed in the table above, is referred to as the perturbed alternating proximal point (PA-PP). It is worth noting that when the subproblem is convex, such as , only needs to be a small number to make the corresponding subproblem strongly convex. This property is useful in practice.

Next, we can also give the convergence rate of PA-PP.

###### Corollary 1.

Under Assumption 1, there exists a constant such that: for any , , , and constant , with probability , the iterates generated by PA-PP converges to an -SS2 satisfying

 ∥∇f(x)∥≤ϵ,andλmin(∇2f(x))≥−(Lmaxρϵ)1/3

in the following number of iterations:

 O⎛⎝L5/3maxP2Δfρ1/3ϵ7/3log7⎛⎝P2dL5/3maxΔfc5ρ1/3ϵ7/3δ⎞⎠⎞⎠

where denotes the global minimum value of the objective function, and .

Comparing with Theorem 1, we can find that term is removed so the convergence rate of PA-PP is slightly faster than PA-GD.

## 5 Convergence Analysis

In this section, we will present the main proof steps of convergence analysis of PA-GD.

### 5.1 The Main Difficulty of the Proof

GD searches the descent direction of the objective function in the entire space . Without loss of generality, we assume . According to the mean value theorem, the GD update can be expressed as

 x(t+1)=x(t)−η∇f(x(t))=x(t)−η∇f(0)−η(∫10∇2f(θx(t))dθ)x(t). (9)

It can be observed that the update rule of GD contains the information of the Hessian matrix at point , i.e., . To be more specific, letting where denotes an -SS2 satisfying (3), we can rewrite (9) as

 x(t+1)=(I−ηH)x(t)−ηΔ(t)x(t)−η∇f(0) (10)

where .

Based on the -Hessian Lipschitz property, we can quantify that is upper bounded by the difference of iterates. By exploiting the negative curvature of the Hessian matrix at saddle point , we can project the iterate onto the direction where the eigenvalue of is greater than 1. This leads to the fact that the norm of the iterates projected along direction will be increasing exponentially as the algorithm proceeds around point , implying the sequence generated by GD is escaping from the saddle point. The details of characterizing the convergence rate have been analyzed previously in Jin et al. [2017a].

However, the AGD algorithm only updates partial variables of vector , which belong to a subspace of the feasible set. Similarly, from the mean value theorem we can express the AGD rule of updating variables with assuming as follows:

 x(t+1) =x(t)−η∇f(0)−η∫10H(t)ldθx(t+1)−η∫10H(t)udθx(t) (11)

where

 H(t)l≜⎡⎢ ⎢⎣00∇221f(θx(t+1)1,θx(t)2)0⎤⎥ ⎥⎦andH(t)u≜⎡⎢ ⎢⎣∇211f(θx(t)1,θx(t)2)∇212f(θx(t)1,θx(t)2)0∇222f(θx(t+1)1,θx(t)2)⎤⎥ ⎥⎦.

From the above expression, it can be seen clearly that the update rule of AGD does not include a full Hessian matrix at any point but only partial ones. Furthermore, the right hand side of (11) not only contains the second order information of the previous point, i.e., but also the one of the most recently updated point, i.e., . These represent the main challenges in understanding the behavior of the sequence generated by the AGD algorithm.

### 5.2 The Main Idea of the Proof

Although the second order information is divided into two parts, we can still characterize the recursion of the iterates around strict saddle points. We can also split as two parts, which are

 Hu=[∇211f(x∗)∇212f(x∗)0∇222f(x∗)],Hl=[00∇221f(x∗)0], (12)

and obviously we have .

Then, recursion (11) can be written as

 x(t+1)+ηHlx(t+1)=x(t)−ηHux(t)−ηΔ(t)ux(t)−ηΔ(t)lx(t+1) (13)

where , . However, it is still unclear from (13) how the iteration evolves around the strict saddle point.

To highlight ideas, let us define

 M≜I+ηHl,T≜I−ηHu. (14)

It can be observed that is a lower triangular matrix where the diagonal entries are all 1s; therefore it is invertible. After taking the inverse of matrix on both sides of (13), we can obtain

 x(t+1)=M−1Tx(t)−ηM−1Δ(t)ux(t)−ηM−1Δ(t)lx(t+1).

Our goal of analyzing the recursion of becomes to find the maximum eigenvalue of . With the help of the matrix perturbation theory, we can quantify the difference between the eigenvalues of matrix that contains the negative curvature and matrix that we are interested in analyzing. To be more precise, we give the following lemma.

###### Lemma 1.

Under Assumption 1, let denote the Hessian matrix at an -SS2 where and . We have

 λmax(M−1T)>1+ηγ1+L/Lmax (15)

where are defined in (12) and (14).

Lemma 1 illustrates that there exits a subspace spanned by the eigenvector of whose eigenvalue is greater than 1, indicating that the sequence generated by AGD can still potentially escape from the strict saddle point by leveraging such negative curvature information. Next, we can give a sketch of the proof of Theorem 1.

### 5.3 The Sketch of the Proof

The structure of the proof for quantifying the sufficient decrease of the objective function after the perturbation is borrowed from the proof of PGD Jin et al. [2017a], but PA-GD updates the variables block by block, so we have to provide the new proofs to show that PA-GD can still escape from saddle points with the perturbation technique.

First, if the size of the gradient is large enough, Algorithm 1 just implements the ordinary AGD. We give the descent lemma of AGD as follows.

###### Lemma 2.

Under Assumption 1, for the AGD algorithm with step size , we have

 f(x(t+1))≤f(x(t))−2∑k=1η2∥∇kf(h(t)−k,x(t)k)∥2.

Second, if the iterates are near a strict saddle point, we can show that the AGD algorithm after a perturbation can give a sufficient decrease with high probability in terms of the objective value. To be more precise, the statement is given as follows.

###### Lemma 3.

Under Assumption 1, there exists a absolute constant . Let , , and , , , calculated as Algorithm 1 describes. Let be a strict saddle point, which satisfies

 ∥∇f(˜x(t))∥2≤42∑k=1∥∇kf(˜h(t)−k,˜x(t)k)∥2≤4g2th (16)

and , where and .

Let where

is generated randomly which follows the uniform distribution over

, and let be the iterates of PA-GD. With at least probability , we have .

We remark that Lemma 2 is well-known and Lemma 3 is the core technique. In the following, we outline the main idea used in proving the latter. The formal statements of these steps are shown in the appendix; see Lemma 8–Lemma 10 therein.

We emphasize that the main contributions of this paper lies in the analysis of the first two steps, where the special update rule of PA-GD is analyzed so that the negative curvature of around the saddle points can be utilized.

#### Step 1

(Lemma 8) Consider a generic sequence generated by PA-GD. As long as the initial point of is close to saddle point , the distance between and can be upper bounded by using the -Hessian Lipschitz continuity property.

#### Step 2

(Lemma 9) Leveraging the negative curvature around the strict saddle point, we know that there exits a direction, i.e., , which is spanned by the eigenvector of whose corresponding eigenvalue is largest (greater than 1). Consider two sequences generated by PA-GD, initialized around the saddle point. When the initial points of these two iterates are separated apart away from each other along direction with a small distance, meaning that where denotes the radius of the perturbation ball defined in Algorithm 1, we can show that if iterate is still near the saddle point after steps, the other sequence will give a sufficient decrease of the objective value with less than steps, implying that iterates can escape from the saddle point with less than steps.

#### Step 3

(Lemma 10) Consider as the points after the perturbation from the saddle point. We can quantify the probability that the AGD sequence will give a sufficient decrease of the objective value within iterations after the perturbation [Jin et al., 2017a, Lemma 14,15].

### 5.4 Extension to PA-PP

By leveraging the convergence analysis of PA-GD and relation between PA-GD and PA-PP shown in (8), we can also write the recursion of the PA-PP iteration as

 x(t+1)+ηH′lx(t+1)=x(t)−ηH′uv(t)−ηΔ′(t)ux(t)−ηΔ′(t)lx(t+1) (17)

where , , ,

 H′u=[0∇212f(˜x(t))00],H′l=[∇211f(˜x(t))0∇221f(˜x(t))∇222f(˜x(t))], (18)

and

 H′(t)l≜⎡⎢ ⎢⎣∇211f(θx(t+1)1,θx(t)2)0∇221f(θx(t+1)1,θx(t+1)2)∇222f(θx(t+1)1,θx(t+1)2)⎤⎥ ⎥⎦,H′(t)u≜⎡⎢ ⎢⎣0∇212f(θx(t+1)1,θx(t)2)00⎤⎥ ⎥⎦.

Let

 M′≜I+ηH′lT′≜I−ηH′u. (19)

We know that is an upper triangular matrix where the diagonal entries are all 1s, so it is invertible. Different from the case of PA-GD, we take the inverse of matrix on both sides of (17) and obtain

 T′−1M′x(t+1)=x(t)−ηT′−1Δ′(t)ux(t)−ηT′−1Δ′(t)lx(t+1).

Then, we can give the following result that characterizes the recursion of generated by PA-PP.

###### Corollary 2.

Under Assumption 1, let denote the Hessian matrix at an -SS2 where and . Let denote the minimum positive eigenvalue of a matrix. Then we have

 λ+min(T′−1M′)≤1−ηγ/2 (20)

where are defined in (18) and (19); and .

We remark that Corollary 2 is useful since it can be leveraged to show that the norm of the iterates around saddle points can increase exponentially. Then, we can apply the similar analysis steps as the case of proving the convergence rate of PA-GD and obtain the results shown in Corollary 1.

## 6 Connection with Existing Works

Remark 2. In Theorem 1 we characterized the convergence rate to an -SS2. We can also translate this bound to the one for achieving an -SS2, and in this case PA-GD needs iterations. Compared with the existing recent works Jin et al. [2017a], the convergence rate of PA-GD/PA-PP is slower than GD. The main reason is the fact that different from GD-type algorithms, PA-GD and PA-PP cannot fully utilize the Hessian information because they never see a full iteration. Similar situation happens for SGD-type of algorithms which also cannot get the exact negative curvature around strict saddle points.

From Table 1, it can be seen that the convergence rate of PA-GD/PA-PP is still faster than SGD Ge et al. [2015], SGLD Zhang et al. [2017], Neon+SGD Xu and Yang [2017], and Neon2+SGD Allen-Zhu and Li [2017] to achieve an -SS2, but slower than the rest. We emphasize that PA-GD and PA-PP represent the first BCD-type algorithms with the convergence rate guarantee to escape from the strict saddle points efficiently. At this point, it is unclear whether our rate is the best that is achievable, and the question of whether the resulting rate can be improved will be left to future work.

## 7 Numerical Results

In this section, we present a simple example that shows the convergence behavior of PA-GD. Consider a nonconvex objective function, i.e.,

 f(x)≜xTAx+14∥x∥44. (21)

First, we have the following properties of function such that satisfies the assumptions of the analysis.

###### Lemma 4.

For any and , defined in (21) is -smooth and -Hessian Lipschitz.

Here, we can easily show the shape of objective function (21) in the two dimensional (2D) case in Figure 2(a), where . It can be observed clearly that there exits a strict saddle point at and two other local optimal points. We randomly initialize the algorithms around strict saddle point . The convergence comparison between AGD and PA-GD is shown in Figure 2(b). It can be observed that PA-GD converges faster than AGD to a local optimal point.

## 8 Conclusion

In this paper, the perturbed variants of AGD and alternating proximal point (APP) algorithms are proposed, with the objective of finding the second order stationary solutions of nonconvex smooth problems. Leveraging the recently developed idea of random perturbation for the first-order methods, the proposed algorithms add suitable perturbation to the AGD or APP iterates. The main contribution of this work is a new analysis that takes into consideration the block structure of the updates for the perturbed AGD and APP algorithms. By exploiting the negative curvature, it is established that with high probability the algorithms can converge to an -SS2 with iterations.

## 9 Acknowledgment

The authors would like to thank Chi Jin for discussion on the perturbed gradient descent algorithm.

## References

• Allen-Zhu and Li [2017] Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. arXiv preprint arXiv:1711.06673, 2017.
• Angelos et al. [1992] James R. Angelos, Carl C. Cowen, and Sivaram K. Narayan. Triangular truncation and finding the norm of a Hadamard multiplier. Linear Algebra and its Applications, 170:117–135, 1992.
• Beck and Tetruashvili [2013] Amir Beck and Luba Tetruashvili. On the convergence of block coordinate descent type methods. SIAM Journal on Optimization, 23(4):2037–2060, 2013.
• Bertsekas [1999] Dimitri P. Bertsekas. Nonlinear Programming, 2nd ed. Athena Scientific, Belmont, MA, 1999.
• Carmon and Duchi [2016] Yair Carmon and John C Duchi. Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547, 2016.
• Conn et al. [2000] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. Trust region methods. SIAM, 2000.
• Du et al. [2017] Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Barnabás Póczos, and Aarti Singh. Gradient descent can take exponential time to escape saddle points. In Proceedings of Neural Information Processing Systems (NIPS), 2017.
• Feizi et al. [2017] Soheil Feizi, Hamid Javadi, Jesse Zhang, and David Tse. Porcupine neural networks: (almost) all local optima are global. arXiv:1710.02196 [stat.ML], 2017.
• Ge et al. [2015] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points — online stochastic gradient for tensor decomposition. In Proceedings of Annual Conference on Learning Theory (COLT), pages 797–842, 2015.
• Ge et al. [2017] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of International Conference on Machine Learning (ICML), pages 1233–1242, 2017.
• Grippo and Sciandrone [2000] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear Gauss-Seidel method under convex constraints. Operations Research Letters, 26:127–136, 2000.
• Holbrook [1992] John A Holbrook. Spectral variation of normal matrices. Linear Algebra and its Applications, 174:131–144, 1992.
• Hong et al. [2017] Mingyi Hong, Xiangfeng Wang, Meisam Razaviyayn, and Zhi-Quan Luo. Iteration complexity analysis of block coordinate descent methods. Mathematical Programming Series A, 163(1):85–114, May 2017.
• Jain et al. [2013] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In

Proceedings of Annual ACM Symposium on Theory of Computing

, pages 665–674, 2013.
• Jin et al. [2017a] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In Proceedings of International Conference on Machine Learning (ICML), pages 1724–1732, 2017a.
• Jin et al. [2017b] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017b.
• Kawaguchi [2016] Kenji Kawaguchi. Deep learning without poor local minima. In Proceedings of Neural Information Processing Systems (NIPS), pages 586–594, 2016.
• Lee et al. [2016] Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Proceedings of Annual Conference on Learning Theory (COLT), pages 1246–1257, 2016.
• Lee et al. [2017] Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. First-order methods almost always avoid saddle points. arXiv:1710.07406v1 [stat.ML], 2017.
• Li and Liang [2017] Yuanzhi Li and Yingyu Liang. Provable alternating gradient descent for non-negative matrix factorization with strong correlations. In Proceedings of International Conference on Machine Learning (ICML), volume 70, pages 2062–2070, 2017.
• Lu et al. [2017a] Songtao Lu, Mingyi Hong, and Zhengdao Wang. A stochastic nonconvex splitting method for symmetric nonnegative matrix factorization. In

Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS)

, volume 54, pages 812–821, 2017a.
• Lu et al. [2017b] Songtao Lu, Mingyi Hong, and Zhengdao Wang. A nonconvex splitting method for symmetric nonnegative matrix factorization: Convergence analysis and optimality. IEEE Transactions on Signal Processing, 65(12):3120–3135, June 2017b.
• Nesterov [2012] Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
• Nesterov and Polyak [2006] Yurii Nesterov and Boris T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
• Parikh et al. [2014] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014.
• Raginsky et al. [2017] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Proceedings of Annual Conference on Learning Theory (COLT), pages 1674–1703, 2017.
• Razaviyayn et al. [2013] Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013.
• Razaviyayn et al. [2014] Meisam Razaviyayn, Mingyi Hong, Zhi-Quan Luo, and Jong-Shi Pang. Parallel successive convex approximation for nonsmooth nonconvex optimization. In Proceedings of Neural Information Processing Systems (NIPS), 2014.
• Reddi et al. [2017] Sashank J. Reddi, Manzil Zaheer, Suvrit Sra, Barnabás Póczos, Francis Bach, Ruslan Salakhutdinov, and Alexander J Smola. A generic approach for escaping saddle points. arXiv:1709.01434 [cs.LG], 2017.
• Sun et al. [2015] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary? In Proceedings of NIPS Workshop on Non-convex Optimization for Machine Learning: Theory and Practice, 2015.
• Sun et al. [2017] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. arXiv:1602.06664 [cs.IT], 2017.
• Sun and Hong [2015] Ruoyu Sun and Mingyi Hong. Improved iteration complexity bounds of cyclic block coordinate descent for convex problems. In Proceedings of Neural Information Processing Systems (NIPS), pages 1306–1314, 2015.
• Tseng [2001] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475–494, 2001.
• Tseng and Yun [2009] Paul Tseng and Sangwoon Yun. Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications, 140(3):513, 2009.
• Wang et al. [2017a] Gang Wang, Georgios B. Giannakis, and Yonina C. Eldar. Solving systems of random quadratic equations via truncated amplitude flow. IEEE Transactions on Information Theory, 2017a.
• Wang et al. [2017b] Gang Wang, Georgios B. Giannakis, Yousef Saad, and Jie Chen. Solving almost all systems of random quadratic equations. In Proceedings of Neural Information Processing Systems (NIPS), 2017b.
• Weyl [1912] Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, 1912.
• Xu and Yin [2013] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758–1789, 2013.
• Xu and Yang [2017] Yi Xu and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. arXiv preprint arXiv:1711.01944, 2017.
• Xu et al. [2017] Yi Xu, Rong Jin, and Tianbao Yang. Neon+: Accelerated gradient methods for extracting negative curvature for non-convex optimization. arXiv preprint arXiv:1712.01033, 2017.
• Yu et al. [2017] Yaodong Yu, Difan Zou, and Quanquan Gu. Saving gradient and negative curvature computations: Finding local minima more efficiently. arXiv preprint arXiv:1712.03950, 2017.
• Zhang et al. [2017] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In Proceedings of Annual Conference on Learning Theory (COLT), pages 1980–2022, 2017.
• Zhang and Brand [2017] Ziming Zhang and Matthew Brand. On the convergence of block coordinate descent in training DNNs with Tikhonov regularization. In Proceedings of Neural Information Processing Systems (NIPS), 2017.
• Zhao et al. [2015] Tuo Zhao, Zhaoran Wang, and Han Liu.

A nonconvex optimization framework for low rank matrix estimation.

In Proceedings of Neural Information Processing Systems (NIPS), pages 559–567, 2015.

## Appendix A Preliminary

We provide the proofs of some preliminary lemmas (Lemma 5–Lemma 7) used in the proof of Section B.

First, Lemma 5 and Lemma 6 give the property that quantify the size of the difference of the second-order information of the objective values between two points.

###### Lemma 5.

If function is -Hessian Lipschitz, we have

 ∥∥∥∫10∇2f(θx)dθ−∇2f(y)∥∥∥≤ρ(∥x∥+∥y∥),∀x,y. (22)
###### Lemma 6.

Under Assumption 1, we have block-wise Lipschitz continuity as follows:

 (23)

and

 (24)

Then, we illustrate that the size of the partial gradient with one round update by the AGD algorithm has the following relation with the full size of the gradient.

###### Lemma 7.

If function is -smooth with Lipschitz constant, then we have

 ∥∇f(x(t))∥2≤42∑k=1∥∇kf(h(t)−k,x(t)k)∥2 (25)

where sequence is generated by the AGD algorithm.

### a.1 Proof of Lemma 5

###### Proof.

If function is -Hessian Lipschitz, then we have

 ∥∥∥∫10(∇2f(θx)−∇2f(y))dθ∥∥∥≤∫10∥∥∇2f(θx)−∇2f(y)∥∥dθ (a)≤ ρ∫10∥θx−y∥dθ(b)≤ρ∫10θ∥x∥dθ+ρ∥y∥≤ρ(∥x∥+∥y∥)

where is true because of Hessian Lipschitz, in we used the triangle inequality. ∎

### a.2 Proof of Lemma 6

There proof involves two parts:

#### Upper Triangular Matrix:

Consider three different vectors , and . We can have

 ∥∥ ∥∥[∇211f(x)∇212f(x)0∇222f(y)]−[∇211f(z)∇212f(z)0∇222f(z)]∥∥ ∥∥ ≤ (a)≤