Incremental Methods for Weakly Convex Optimization

07/26/2019 ∙ by Xiao Li, et al. ∙ Princeton University Johns Hopkins University The Chinese University of Hong Kong 0

We consider incremental algorithms for solving weakly convex optimization problems, a wide class of (possibly nondifferentiable) nonconvex optimization problems. We will analyze incremental (sub)-gradient descent, incremental proximal point algorithm and incremental prox-linear algorithm in this paper. We show that the convergence rate of the three incremental algorithms is O(k^-1/4) under weakly convex setting. This extends the convergence theory of incremental methods from convex optimization to nondifferentiable nonconvex regime. When the weakly convex function satisfies an additional regularity condition called sharpness, we show that all the three incremental algorithms with a geometrical diminishing stepsize and an appropriate initialization converge linearly to the optimal solution set. We conduct experiments on robust matrix sensing and robust phase retrieval to illustrate the superior convergence property of the three incremental methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 20

page 21

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider incremental methods for addressing the following finite sum optimization problem

(1.1)

where each component function is weakly convex (see creftypecap 1), lower semi-continuous and lower bounded. We restrict ourself to the case when the global minimum set of (1.1) is nonempty and closed and denote as its minimal function value. The formulation in (1.1

) is quite general—training a neural network, phase retrieval, low rank matrix optimization, etc., fit within this framework. It is worthy to mention that

in (1.1) can be nonconvex and nondifferentiable under the weakly convex setting, covering rich applications in practice; see Section 2.2 for detalied examples.

Incremental methods plays an important role in large-scale problems. In this paper, we discuss incremental (sub)-gradient descent111Strictly speaking, when is nondifferentiable, it is more appropriate to name the algorithm incremetal subgradient method since it is not necessarily a descent algorithm. For convenience, we simply call it as incremental subgradient descent, in accordance with the name of incremental gradient descent., incremental proximal point algorithm and incremental prox-linear algorithm. In each time, incremental methods update the iterate with only one component function selected according to a cyclic order, i.e., sequentially from to and repeating such process cyclically. To be more specific, in -th iteration, incremental algorithms start with and then update using with certain method for all , giving The following three incremental algorithms considered in this paper differ from each other in the update of .

incremental (sub)-gradient descent:

(1.2)

where is any subgradient belongs to the Fréchet subdifferential (see (2.1)).

Incremental proximal point algorithm:

(1.3)

Incremental prox-linear algorithm: We now consider a specific class of weakly convex functions in (1.1) that is of the following composite form,

(1.4)

where each is a (possibly nonsmooth) Lipschitz convex mapping and is a smooth function with Lipschitz continuous Jacobian; see Section 2.2 for a list of concrete examples that fit into this setting. We denote by

(1.5)

as the local convex relaxation of at .

The incremental prox-linear algorithm update as

(1.6)

This incremental prox-linear algorithm generalizes its nonincremental counterpart (i.e., using full function components in (1.6)) for composition optimization problems [1, 2].

Another counterpart scheme to incremental methods for addressing (1.1) is the stochastic algorithms which at each iteration take one component function independently and uniformly from

to update. Such sampling scheme plays a key role in the analysis of stochastic algorithms. For example, the random uniform sampling scheme results in an unbiased estimation of the full (sub)-gradient information in each iteration, which makes the stochastic algorithms in expectation the same as the one using full components. Compared with stochastic algorithms, the incremental method is much easier to implement in large-scale setting since the former requires independent uniform sampling in each iteration. Furthermore, we will show in the experimental section that incremental algorithms usually outperforms their stochastic counterparts, which is the case in practice

[3, 4]

. There are two possible reasons for interpreting this phenomenon: 1) The convergence results of stochastic algorithms are stated in expectation which hides large variance, while incremental algorithms have deterministic convergence properties. 2) incremental methods visit each component

exactly once after one cycle, but in SGD some component functions may be selected more often that the others in each epoch.

Incremental methods are widely utilized in practice. Compared to nonincremental methods like subgradient descent, proximal point method and prox-linear algorithm, incremental methods are more suitable for large-scale problems due to their low computational load in each iteration. For instance, incremental (sub)-gradient descent and its random shuffling version222This is an alternative component function selection rule, the resultant algorithm is often called “random shuffling” in training neural network literature. Instead of choosing sequentially from to for updating, at the start of -th iteration it will choose independently and uniformly one permutation from the set of all possible permutations of , then it sequentially visit each component function indexed by to update.

are broadly employed in practice for large scale machine learning problems such as training deep neural networks; see e.g.,

[3, 4, 5, 6, 7, 8]

. In the 80s-90s, it is also known as “online backpropagation algorithm” in artificial intelligence literature

[9, 10]. Moreover, in the regression and linear system literature, the well-known and extensively studied Kaczmarz method is indeed one special case of incremental (sub)-gradient descent.

Though incremental algorithm is broadly used, its theoretical insights are far from being well understood. The purely deterministic component selection in each iteration makes it challenging to analyze. The main prior achievements for analyzing incremental methods are based on convexity assumption. To the best of our knowledge, there is no convergence result for incremental (sub)-gradient descent, incremental proximal point algorithm and incremental prox-linear algorithm when the function in (1.1) is nonconvex and nondifferentiable, even no explicit convergence rate result is currently known under smooth nonconvex setting. Thus, it is fundamentally important to ask:

Are the incremental methods, including incremental (sub)-gradient descent, incremental proximal point algorithm and incremental prox-linear algorithm, guaranteed to converge if in (1.1) is nondifferentiable nonconvex? If yes, what is the convergence rate?

In this paper, we answer this question positively under the assumption that each is weakly convex. This setting includes a wide class of nonconvex and nondifferentiable concrete examples with some of them described in Section 2.2.

1.1 Our contribution

We summarize our main contributions as follows:

  • We extend the convergence theory for incremental algorithms (including incremental (sub)-gradient descent (1.2), incremental proximal point algorithm (1.3)) from convex optimization to weakly convex optimization. Moreover, we show the same convergence result for incremental prox-linear algorithm under the same weakly convex setting333The stochastic prox-linear algorithm is proposed and analyzed very recently in [11, 12]. However, to our knowledge, we have not seen incremental prox-linear algorithm in the literature, thus introducing this algorithm may also be one of our contributions. This specific algorithm would be the first choice in many practical situations due to its high efficiency; see our experiment section for clarification.. In particular, with a constant stepsize scheme, we show that all the three incremental methods converge at rate to a critical point of the weakly convex optimization problem (1.1) (for incremental prox-linear method, in (1.1) should have the specific composite form (1.4)); see creftypecap 1 for the hidden constant. That being said, achieving accurate critical point requires at most number of iterations for all the three algorithms.

  • Suppose the weakly convex function in (1.1) satisfies an additional error bound property called sharpness condition (see creftypecap 2), then we design a geometrically diminishing stepsize rule which depends only on the problem intrinsic parameters and show that all the three incremental algorithms converge locally linearly to the optimal solution set of (1.1) (Again, for incremental prox-linear method, in (1.1) should have the composite form (1.4)). It is surprising since, compared to the incremental aggregated algorithm [13, 14] which has linear convergence for smooth strongly convex optimization, the three incremental methods analyzed in this paper do not use any aggregation techniques other than a diminishing stepsize.

It is worthy to mention that our bounds for incremental (sub)-gradient descent also works for its random shuffling version (see Footnote 2) which is also very popular in large scale learning literature. Since our result is deterministic, one can regard it as a worst case bound for the random shuffling version of incremental (sub)-gradient descent.

Our argument is simple and transparent. For instance, the key steps for showing the linear convergence of incremental (sub)-gradient descent are to establish the recursion (4.8) and then conduct a tailored analysis for our designed geometrically diminishing stepsize rule. Establishing the recursion (4.8) is similar to the previous work for convex optimization; see e.g., [15, 9]. Then, with geometrically diminishing stepsize, the derivation of linear convergence starting from recursion (4.8) shares the same spirit of prior work on full subgradient descent analysis, see, e.g., [16, 17]. Nonetheless, our conclusions are powerful in the sense that incremental methods can have linear rate of convergence for a set of nondifferentiable nonconvex optimization problems. We believe our proof technique for dealing with the weakly convex functions will be useful for further development of the analysis of other incremental type algorithms (like aggregated version [14, 13]) for solving weakly convex optimization.

1.2 Related work

Paper Assumptions Stepsize Complexity Measure
[10]
strongly convex
convex
differentiable
Lipschitz
Lipschitz
Diminish
[18]
convex
Lipschitz
Constant
[18]
convex
Lipschitz
sharp
Diminish
[9]
convex
Lipschitz
Constant
This paper
weakly convex
Lipschitz
Constant
This paper
weakly convex
Lipschitz
sharp
Good initialization
Diminish
see (4.2)
Table 1: Comparison with previous work. The algorithm used in [10, 18] is incremental (sub)-gradient descent. The algorithm used in [9] is incremental proximal point algorithm. This paper’s results cover incremental (sub)-gradient descent, incremental proximal point algorithm and incremental prox-linear algorithm. and in the fourth line represents minimum function value in (1.1) and maximum subgradient norm at point , respectively. in the second last line represents the gradient of the Moreau envelope of ; see (3.3) for clarity. We hide numerical constant in notation for a clean comparison.

The research of incremental (sub)-gradient descent has a long history, but mainly focuses on convex optimization. Most of these works can be divided into two categories based on differentiability. The starting work may date back to [19] for solving linear least square problems where each is differentiable and convex. Then various works [20, 21, 22, 23] extend incremental gradient descent to learning shallow linear neural network and other convex smooth problems of form (1.1), with various choices of step size and convergence analysis. A more recent work [10] shows that for a strongly convex with each component being convex, incremental gradient descent with diminishing step size has convergence rate in terms of the distance between iterates and the optimal solution set. We note that a popular variant named incremental aggregated gradient descent [14, 13, 24, 25]—which utilises a single component function at a time as in incremental methods, but keeps a memory of the most recent gradients of all component functions to approximate the gradient—have been proved to converge linearly for smooth and strongly convex functions.

When in (1.1) is convex but not differentiable, incremental (sub)-gradient descent was proposed in [26] for solving nondifferentiable finite sum optimization problems. Later, Nedic and Bertsekas [15] provided asymptotic convergence result for incremental (sub)-gradient descent using either constant or diminishing step sizes when the function is convex. This result was further improved by [18] where the authors proved that the algorithm converges in a rate of in terms of the function suboptimality gap by using constant step size, and in a linear rate to the optimal solution set by using the Polyak’s dynamic step size which is obtained by using the knowledge of . Similar asymptotic convergence results were also established in [27] for incremental -subgradient descent. Besides, it is worth mentioning that Solodov [28] analyzed the global convergence of incremental gradient descent for smooth nonconvex problems, but only with asymptotic convergence without any explicit rate. We refer the readers to [9] for a comprehensive review.

Incremental proximal point algorithm has been proposed in [9] for large scale convex optimization and is proved to converge in a rate of in terms of the function suboptimality gap by using constant step size. To our knowledge, the convergence performance of incremental proximal point algorithm has not yet been studied for nonconvex problems. In terms of incremental prox-linear algorithm, to our knowledge it has not yet been utilized or analyzed in the literature, except its stochastic counterpart which was proposed and analyzed very recently in [11, 12].

For a more clear and transparent comparison, we list the representative historical results and ours in Table 1.

2 Preliminaries

Since can be nondifferentiable, we utilize tools from generalized differentiation. The (Fréchet) subdifferential of a function at is defined as

(2.1)

where each is called a subgradient of at .

2.1 Function regularity

The following two properties serves as the fundamental basis in our paper.

To begin, we give the definition of weak convexity which characterizes the extent of nonconvexity of a function.

Definition 1 (weak convexity; see, e.g., [29]).

We say that is weakly convex with parameter if is convex.

Noting that the weak convexity of with parameter is equivalent to

(2.2)

for any , which can be shown quickly by employing the convex subgradient inequality of the function . If each in (1.1) is weakly convex, the function in (1.1) is also weakly convex by definition.

We now introduce sharpness, a regularity condition that characterizes how fast the function increases when the is away from the set of global minima.

Definition 2 (sharpness; see, e.g.,  [30]).

We say that a mapping is -sharp where if

(2.3)

for all . Here denotes the set of global minimizers of , represents the minimal value of , and is the distance of to , i.e., .

Informally speaking, sharpness condition is tailored for nondifferentiable functions. This regularity condition plays a fundamental role in showing linear and quadratic convergence rate for (sub)-gradient descent and prox-linear algorithm, respectively [16, 17, 31]. In Section 4, we will show with the sharpness condition, we improve the convergence rate of incremental methods from sublinear convergence to linear convergence.

2.2 Concrete examples

A wide classes of nondifferentiable weakly convex functions have the following composite form (i.e., (1.4)) [32]

(2.4)

where is a Lipschitz convex function and is a smooth mapping with Lipschitz continuous Jacobian. We now list several popular nondifferentiable nonconvex optimization problems of the form (2.4).

Robust Matrix Sensing [33]

Low-rank matrices are ubiquitous in computer vision, machine learning, and data science applications. One fundamental computational task is to recover a Positive Semi-Definite (PSD) low-rank matrix

with

from a small number of linear measurements arbitrarily corrupted with outliers

(2.5)

where is a linear measurement operator consisting of a set of sensing matrices and

is a sparse outliers vector. An effective approach to recover the low-rank matrix

is by using a factored representation of the matrix variable [34] (i.e., with ) and employing an

-loss function to robustify the solution against outliers:

(2.6)

Direct calculus shows that the weak convexity parameter of each component function in (2.6) is at most . Furthermore, it is shown in [33], under certain statistical assumptions ( has i.i.d. Gaussian ensembles and the fraction of outliers in is less than and ), is exactly the set of global minimizers to (2.5) and the objective function in (2.5) is sharp with parameter where is a constant depending on the fraction of outliers in .

Robust Phase Retrieval [35, 31]

Robust phase retrieval aims to recover a signal from its magnitude-wise measurements arbitrarily corrupted with outliers:

(2.7)

where the operator in (2.7) means coordinate-wise taking modulus and then squaring. Here, the matrix is a measurement matrix and is sparse outliers. [31] formulated the following problem for recovering both the sign and magnitude information of :

(2.8)

It is straightforward to verify that each component function in (2.8) is weakly convex with parameter . When the vector obeys an i.i.d.Gaussian distribution and the fraction of outliers is no more than and , then it is proved in [31] that is exactly the set of minimizers to (2.8) and the objective function in (2.8) is sharp with parameter , where is some constant depending on the outlier ratio.

Robust Blind Deconvolution [36]

In image processing community, blind deconvolution is a central technique for recovering real images from convolved measurements. Mathematically speaking, the task is to recovery a ground truth pair from there convolution corrupted by outliers:

(2.9)

where are measurement operators and contains outliers. Robust blind deconvolution amounts to addressing

(2.10)

Similarly, the above objective function is sharp (with sharpness parameter closely related to the energy of ground truth signal and the outliers ratio) when are from standard i.i.d. Gaussian distribution[36]. Again, it is easy to verify that each component function of the objective function in (2.10) is weakly convex with parameter .

Robust PCA [37, 38]

The task of Robust PCA is to separate a low rank matrix with and a sparse matrix from their addition . Robust PCA has been successfully applied in many real applications, such as background extraction from surveillance video [39] and image specularity removal [40]. Considering a simple case that is PSD with dimension , using the same factorization approach as in (2.6) we estimate the low-rank matrix by solving

(2.11)

The weak convexity parameter of each component function in the above is . However, whether the objective function in (2.11) has sharpness property remains open.

3 Convergence Result for Weakly Convex Optimization

In this section, we study the convergence behavior of incremental (sub)-gradient descent, incremental proximal point method, and incremental prox-linear algorithm with constant stepsize under the assumption that each component function is weakly convex.

3.1 Assumptions and Moreau envelop

We make the following assumptions through out this section. Note that in this section, sharpness condition is not assumed.

Assumption 1.
  • (bounded subgradients): For any , there exists a constant , such that , for all and bounded .

  • (weak convexity) Each component function in (1.1) is weakly convex with parameter . Set .

creftypecap 1 (A1) is standard in analyzing incremental and stochastic algorithms; see, e.g., [18, 9, 15, 12, 41]. For the set of concrete applications listed in Section 2.2, creftypecap 1 (A1) is satisfied on any bounded subset. This assumption concerns the Lipschitz continuity of . Too see this, we refer the readers to [42, Theorem 9.1] which established that the function is Lipschitz continuous with parameter whenever creftypecap 1 (A1) is valid and is finite.

creftypecap 1 (A2) is mild and it is satisfied by many concrete nondifferentiable nonconvex optimization problems; see Section 2.2.

The incremental prox-linear algorithm (1.6) is designated for solving the weakly convex optimization problems (1.1) where has the composite form in (1.4). Thus, instead of using creftypecap 1, we need to make the following tailored assumptions for incremental prox-linear method. The parameter notations remain the same with creftypecap 1 which we will explain immediately after the statement of creftypecap 2.

Assumption 2.
  • (bounded subgradient) For any , there exits a constant , such that for all and bounded .

  • (quadratic approximation) There exists a constant such that each component function in (1.1) satisfies

A similar assumption has been used in [31] for analyzing nonincremental prox-linear method. Note that creftypecap 2 is similar to creftypecap 1. For those concrete examples listed in Section 2.2, the bounded subgradient parameter in creftypecap 2 (A3) coincides with that of creftypecap 1 (A1) when is evaluated at . More generally, the two constants in creftypecap 2 (A3) and creftypecap 1 (A1) coincide when in (1.4) is norm or max function. Furthermore, creftypecap 2 (A4) is slightly stronger than creftypecap 1 (A2) since it implies the weak convexity of through a direct calculation by utilizing convexity of . It can be verified that the listed examples in Section 2.2 satisfy creftypecap 2 (A4) and the parameter coincides with the weakly convex parameter. Thus, in order to release the overloading of notations, we keep the same notation as those of creftypecap 1.

Though for weakly convex optimizations, checking the norm of the subgradient is in general not appropriate to characterize optimality due to nondifferentiability. The seminal paper [12] found a reliable surrogate optimality measure for weakly convex optimization, where the problem can be nondifferentiable and even nonconvex. For completeness, we briefly introduce the related notions of Moreau envelope and proximal mapping.

Definition 3.

(see [42, Definition 1.22]) For any , the Moreau envelope of is defined as

(3.1)

The corresponding proximal mapping is defined as

(3.2)

The Moreau envelope approximates from below (i.e., ) and the approximation error is controlled by the penalty parameter . More importantly, if the original function is -weakly convex, then the Moreau envelope is smooth for any with [43] [44, Theorem 3.4], even when is nondifferentiable.

The following result quantifies how close is close to a critical point of when is small [12]:

(3.3)

where . Intuitively, (3.3) implies that if is small, then is close to which is nearly stationary since is small.

In this section, we will utilize the gradient of the Moreau envelope of the weakly convex function in (1.1) as surrogate optimality measure to analyze the convergence rate of the three incremental methods. For any and , we denote by

(3.4)
(3.5)

3.2 Sublinear convergence result

Before presenting our first convergence result, we first rewrite the proximal updates in (1.3) and (1.6) into similar forms as in the subgradient updates.

Lemma 1.

For all , we have:

  1. for incremental proximal point method (1.3), there exists a subgradient such that

    (3.6)
  2. for incremental prox-linear method (1.6), there exists a subgradient such that

    (3.7)

    where is a local convex relaxation of at ; see (1.5).

This result can be proved by using [42, Exercise 8.8] and the first order optimality condition of . These relations in (3.6) and (3.7) have been commonly utilized in the analysis of proximal-type algorithms [9, 45]. It is interesting to see that (3.6) and (3.7) are very similar to the updates in incremental (sub)-gradient descent. The only difference is that the subgradients in (3.6) and (3.7) are evaluated at , while in incremental (sub)-gradient descent it is evaluated at . This observation indicates that we can follow an unified proof strategy for all the three incremental methods with slight modification. Therefore, we will present the results for all the three incremental methods in the same place, i.e., creftypecap 1 and creftypecap 2.

In the following theorem, we give out the global sublinear rate of convergence for the three incremental methods solving general weakly convex optimization problems444The sublinear convergence result in this section remains valid when (1.1) is further constrained by a closed convex set. . The definition of the weakly convex parameter and the upper bound for the norm of subgradients in a bounded subset can be found in creftypecap 1 and creftypecap 2.

Theorem 1.

Suppose that creftypecap 1 is valid for incremental (sub)-gradient descent (1.2) and incremental proximal point algorithm (1.3), and that creftypecap 2 is valid for incremental prox-linear method (1.6), respectively. Suppose further the stepsize for all , where integer is the total iteration number. Then for any , if the sequence is generated by one of the three incremental methods for solving (1.1), then we have

(3.8)

where is defined in (3.4).

Proof of creftypecap 1.

The proof of incremental proximal point and incremental prox-linear methods is very similar to that of incremental (sub)-gradient descent. Thus, we will first present the complete proof for incremental (sub)-gradient descent and then point out the necessary modifications for the proof of the other two incremental proximal methods.

Part I: Proof of incremental (sub)-gradient descent. Let be generated by the incremental (sub)-gradient descent. From the optimality of in (3.5) and the update in incremental (sub)-gradient descent, we have the following inequality

(3.9)

Summing up the inequality in (3.9) for from to gives

(3.10)

where in the last inequality we have used creftypecap 1 (A1).

Due to , (3.10) reduces to

(3.11)

Since by creftypecap 1 (A2) each component function is weakly convex with parameter , it follows from (2.2) that

which together with (3.11) gives

(3.12)

We now provide lower bound for and upper bound for . For , we have

(3.13)

where first inequality utilizes creftypecap 1 (A1) and [42, Theorem 9.1] which implies is Lipschitz continuous with parameter , and the last inequality follows from [42, Proposition 12.19] which sates that for any weakly convex function with parameter , is Lipschitz continuous with constant if .

According to the update in incremental (sub)-gradient descent, we have , which together with creftypecap 1 (A1) yields

(3.14)

Plugging (3.14) into (3.13) gives

(3.15)

Similarly,

where the last line utilizes the same Lipschitz continuous property of as in (3.13). By upper bounding using (3.14), one can see that

(3.16)

We substitute (3.15) and (3.16) into (3.12) to obtain

(3.17)

Note that

where the inequality follows directly from the definition of Moreau envelope of at . Hence, plugging the above inequality into (3.17) and rearranging the terms provide the following recursion

(3.18)

Summing (3.18) from to and recalling (3.3) that , we obtain

(3.19)

Finally, dividing in both sides in (3.19) yields