Lower Bounds for Smooth Nonconvex Finite-Sum Optimization

01/31/2019 ∙ by Dongruo Zhou, et al. ∙ 16

Smooth finite-sum optimization has been widely studied in both convex and nonconvex settings. However, existing lower bounds for finite-sum optimization are mostly limited to the setting where each component function is (strongly) convex, while the lower bounds for nonconvex finite-sum optimization remain largely unsolved. In this paper, we study the lower bounds for smooth nonconvex finite-sum optimization, where the objective function is the average of n nonconvex component functions. We prove tight lower bounds for the complexity of finding ϵ-suboptimal point and ϵ-approximate stationary point in different settings, for a wide regime of the smallest eigenvalue of the Hessian of the objective function (or each component function). Given our lower bounds, we can show that existing algorithms including KatyushaX (Allen-Zhu, 2018), Natasha (Allen-Zhu, 2017), RapGrad (Lan and Yang, 2018) and StagewiseKatyusha (Chen and Yang, 2018) have achieved optimal Incremental First-order Oracle (IFO) complexity (i.e., number of IFO calls) up to logarithm factors for nonconvex finite-sum optimization. We also point out potential ways to further improve these complexity results, in terms of making stronger assumptions or by a different convergence analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider minimizing the following unconstrained finite-sum optimization problem:

(1)

where each is smooth and nonconvex function. We are interested in the algorithmic performance of first-order algorithms for solving (1), which have accesses to the Incremental First-order Oracle (IFO) (Agarwal and Bottou, 2015) defined as follows:

In this paper, we consider the very general setting where is of -smoothness (Allen-Zhu, 2017b), i.e., there exist some constant and , such that for any ,

(2)

where 111We allow to be nonnegative, which covers the definitions of convex and strongly convex functions. is the lower smoothness parameter, and is the upper smoothness parameter. Note that conventional -smoothness definition is a special case of (1), where . (1) is quite general, because with different choice of , (1) and (1) together can cover various kinds of smooth finite-sum optimization problems. For example, when , is convex function, and is -strongly convex if . Such a sum-of-nonconvex optimization problem (convex functions that are average of nonconvex ones) was originally identified in Shalev-Shwartz (2015)

, and widely used in various machine learning problems such as principal component analysis (PCA)

(Garber et al., 2016; Allen-Zhu and Li, 2016). With , our goal is to find an -suboptimal solution (Woodworth and Srebro, 2016) to (1), which satisfies

(3)

On the other hand, when , is nonconvex, and it is called -almost convex (Carmon et al., 2018) 222It is also known as -weakly convex (Chen and Yang, 2018) or -bounded nonconvex Allen-Zhu (2017b).). It is known that finding an -suboptimal solution in such nonconvex setting is NP-hard (Murty and Kabadi, 1987). Thus, our goal is instead to find an -approximate stationary point of for general nonconvex case, which is defined as follows

(4)

There is a vast literature on finding either (3) or (4) for (1), such as SDCA without Duality (Shalev-Shwartz, 2016), Natasha (Allen-Zhu, 2017b), KatyushaX (Allen-Zhu, 2018), RapGrad (Lan and Yang, 2018), StagewiseKatyusha (Chen and Yang, 2018), RepeatSVRG (Agarwal et al., 2017; Carmon et al., 2018), to mention a few. In specific, this line of work can be divided into two categories based on the smoothness assumption over . The first category of work (Shalev-Shwartz, 2016; Allen-Zhu, 2017b, 2018; Agarwal et al., 2017; Carmon et al., 2018) makes the assumption that each individual component function is -smooth and is smooth. Under such an assumption, when is convex or -strongly convex, SDCA without Duality and KatyushaX can find the -suboptimal solution within or IFO calls respectively. When is -almost convex, Natasha and RepeatSVRG can find the -approximate stationary point with IFO calls.

The second category of work Allen-Zhu (2017b, 2018); Lan and Yang (2018); Chen and Yang (2018) assumes that each is -smooth 333In fact, Allen-Zhu (2017b, 2018) fall into both categories.. With such an assumption, RapGrad and StagewiseKatyusha find -approximate stationary point with IFO calls.

Given the above IFO complexity results, a natural research question is:

Are these upper bounds of IFO complexity already optimal?

We answer this question in an affirmative way by proving lower bounds on the IFO complexity for a wide regime of , using carefully constructed functions. More specifically, our contributions are summarized as follows:

  1. For the case that is convex or -strongly convex (a.k.a., sum-of-nonconvex optimization), we show that without the convexity assumption on each component function , the lower bound of IFO complexity for any linear-span first-order randomized algorithms (See Definition 3) to find -suboptimal solution is when is -strongly convex, and when is convex, where is the average smoothness parameter on (See Definition 3). That is in contrast to the lower bounds and proved by Woodworth and Srebro (2016) when each component function is convex.

  2. For the case that is -almost convex, we show that the lower bound of IFO complexity for any linear-span first-order randomized algorithms to find -approximate stationary point is when is -average smooth, and when each is -smooth. To our best knowledge, this is the first lower bound result which precisely characterizes the dependency on the lower smoothness parameter for finding approximate stationary point.

  3. We show that many existing algorithms including SDCA without Duality (Shalev-Shwartz, 2016), Natasha (Allen-Zhu, 2017b), KatyushaX (Allen-Zhu, 2018), RapGrad (Lan and Yang, 2018), StagewiseKatyusha (Chen and Yang, 2018) and RepeatSVRG (Agarwal et al., 2017; Carmon et al., 2018) have indeed achieved optimal IFO complexity for a large regime of the lower smoothness parameter, with slight modification of their original convergence analyses.

Notation We define . We use if , where is a universal constant. We use

to hide polynomial logarithm terms. For any vector

, we use to denote the -th coordinate of , and to denote its 2-norm. For any vector sequence , we use to denote the -th vector. We say a matrix sequence where for each , , if and for any . For any sets , we define the distance between them as . For any , we denote by the linear space spanned by . In the rest of this paper, we use and interchangeably when there is no confusion.

2 Additional Related Work

In this section, we review additional related work that is not discussed in the introduction section.

Existing lower bounds for nonconvex optimization: To the best of our knowledge, the only existing lower bounds for nonconvex optimization are proved in Carmon et al. (2017a, b); Fang et al. (2018). Carmon et al. (2017a, b) proved the lower bounds for both deterministic and randomized algorithms on nonconvex optimization with high-order smoothness assumption. However, they did not consider the finite-sum structure which will bring additional dependency on the lower-smoothness parameter and the number of component functions . Fang et al. (2018) proved a lower bound for nonconvex finite-sum optimization under conventional smoothness assumption, i.e., . Our work extends this line of research, and proves matching lower bounds for nonconvex finite-sum optimization (and sum-of-nonconvex optimization) under the refined -smooth assumption.

Existing upper bounds for first-order convex optimization: There existing a bunch of work focusing on establishing upper complexity bounds to find -suboptimal solution for convex finite-sum optimization problems. It is well known that by treating as a whole part, gradient descent can achieve IFO complexity for convex functions and for -strongly convex functions, and accelerated gradient descent (AGD) (Nesterov, 1983) can achieve IFO complexity for convex functions and for -strongly convex functions. Both IFO complexities achieved by AGD are optimal when (Nesterov, 1983)

. By using variance reduction technique

(Roux et al., 2012; Johnson and Zhang, 2013; Xiao and Zhang, 2014; Defazio et al., 2014; Mairal, 2015; Bietti and Mairal, 2017), the IFO complexity can be improved to be for strongly convex functions. By combining variance reduction and Nesterov’s acceleration techniques (Nesterov, 1983), the IFO complexity can be further reduced to for convex functions, and for -strongly convex functions (Allen-Zhu, 2017a), which matches the lower bounds up to a logarithm factor.

Existing lower bounds for first-order convex optimization: For deterministic optimization algorithms, it has been proved that one needs IFO calls for convex functions, and IFO calls for -strongly convex functions to find an -suboptimal solution. There is a line of work (Woodworth and Srebro, 2016; Lan and Zhou, ; Agarwal and Bottou, 2015; Arjevani and Shamir, 2016) establishing the lower bounds for first-order algorithms to find -suboptimal solution to the convex finite-sum optimization. More specifically, Agarwal and Bottou (2015) proved a lower bound for strongly convex finite-sum optimization problems, which is valid for deterministic algorithms. Arjevani and Shamir (2016) provided a dimension-free lower bound for first-order algorithms with the assumption that any new iterate generated by the algorithm lies in the linear span of gradients and iterates up to the current iteration. Lan and Zhou proved a lower bound

for a class of randomized first-order algorithms where each component function will be selected by fixed probabilities.

Woodworth and Srebro (2016) proved a set of lower bounds including for convex functions and for -strongly convex functions. Besides, Woodworth and Srebro (2016)’s results do not need the assumption that the new iterate lies in the span of all the iterates up to the iteration, which is a more general result.

For more details on the upper bound and lower bound results, please refer to Tables 1 and 2.

Upper Bounds
(Allen-Zhu, 2018) (Allen-Zhu, 2018) (Allen-Zhu, 2017b; Fang et al., 2018)
Lower Bounds
(Theorem 4.1) (Theorem 4.1) (Theorem 4.2)
Table 1: IFO Complexity comparison with the assumption that is average -smooth and is -smooth. Here and , where . When or 0, the goal is to find an -suboptimal solution; and when , the goal is to find an -approximate stationary point.
Upper Bounds
(Allen-Zhu, 2017a) (Allen-Zhu, 2017a) (Lan and Yang, 2018)
(Fang et al., 2018)
Lower Bounds
(Woodworth and Srebro, 2016) (Woodworth and Srebro, 2016) (Theorem 4.2)
Table 2: IFO Complexity comparison with the assumption that each is -smooth. Here and where . When or 0, the goal is to find an -suboptimal solution; and when , the goal is to find an -approximate stationary point.

3 Preliminaries

We first present the formal definitions of -smoothness and average smoothness, which will be used throughout the proof. For any differentiable function , we say is -smooth for some and if for any , it holds that

We denote such a function class by . In particular, we say is -smooth if . Note that if is twice differentiable, then if and only if for any .

For any differentiable functions , we say is -average smooth for some if for any , where

for any random variable

. We denote such a function class by . It is worth noting that if satisfy that for each , , then .

In this work, we focus on the linear-span randomized first-order algorithm, which is defined as follows:

Given an initial point , a linear-span randomized first-order algorithm is defined as a measurable mapping from functions to an infinite sequence of point and index pairs with random variable , which satisfies

It can be easily checked that most first-order primal finite-sum optimization algorithms, such as SAG (Roux et al., 2012), SVRG (Johnson and Zhang, 2013), SAGA (Defazio et al., 2014) and Katyusha (Allen-Zhu, 2017a), KatyushaX (Allen-Zhu, 2018), are linear-span randomized first-order algorithms.

In this work, we prove the lower bounds by constructing adversarial functions which are “hard enough” for any linear-span randomized first-order algorithms. To demonstrate the construction of adversarial functions, we first introduce the following quadratic function class, which comes from Nesterov (2013). Let be:

In our construction, we need the following two important properties of . For any and , the following properties hold:

  1. .

  2. Suppose that satisfying . Suppose that . Then for any satisfying , and any differentiable function , we have .

In short, the first property of says that is a convex function with

-smoothness, and the second property says that for any orthogonal matrix

, the composite function enjoys the so-called zero-chain property (Carmon et al., 2017a): if the current point is , then the information brought by an IFO call at the current point can at most increase the dimension of lienar space which belongs to by , which is very important for the proof of lower bounds.

Based on Definition 3, one can define the following three function classes: , from Nesterov (2013) and from Carmon et al. (2017b). We first introduce a class of strongly convex functions , which is originally defined in Nesterov (2013). (Nesterov, 2013) Let be

(5)

For , we have the following properties. [Chapter 2.1.4, Nesterov (2013)] For any , let , it holds that

  1. .

  2. .

  3. For any satisfying , we have

Next we introduce a class of general convex functions , which is also defined in Nesterov (2013). (Nesterov, 2013) Let be

(6)

We have the following properties about . [Chapter 2.1.2, Nesterov (2013)] We have

  1. .

  2. Let be the optimal solution set, we have .

  3. For any which satisfies that , we have .

The above two function classes and will be used to prove the lower bounds for convex optimization. Finally we introduce , which is original proposed in Carmon et al. (2017b), and we will use it to prove the lower bounds for nonconvex optimization. Let be

where is defined as

We have the following properties about . [Lemmas 2, 3, 4, Carmon et al. (2017b)] For any , it holds that

  1. and .

  2. .

  3. For which satisfies that , we have .

4 Main Results

In this section we present our lower bound results. We start with the sum-of-nonconvex (but convex) optimization setting, then move on to the general nonconvex finite-sum optimization setting.

4.1 is Convex – Suboptimal Solution

We first show the lower bounds for with , and our goal is to find an -suboptimal solution. We first show the result when is -strongly convex and . For any linear-span randomized first-order algorithm and any such that , there exist a dimension and functions which satisfy that , and . In order to find such that , needs at least

(7)

IFO calls. Next we show the result when is convex and . For any linear-span randomized first-order algorithm and any such that there exist a dimension and functions which satisfy that , , and where . In order to find such that , needs at least

(8)

IFO calls. Our lower bounds (7) and (8) are tight, because they have been achieved by SDCA without Duality (Shalev-Shwartz, 2016) for and KatyushaX (Allen-Zhu, 2018) for and up to a logarithm factor. It is interesting to compare (7) and (8) with the corresponding lower bounds for convex finite-sum optimization in Woodworth and Srebro (2016), which proves lower bound for strongly convex functions and for convex functions. The dependence on is in the nonconvex case when , as opposed to in the (strongly) convex case. This suggests a fundamental gap. This gap has been observed firstly by Shalev-Shwartz (2016) from view of the upper bounds. Our lower bound results suggest that such a gap cannot be removed.

4.2 is Nonconvex – Approximate Stationary Point

Next we show the lower bounds when is -almost convex. For this case our goal is to find an -approximate stationary point. We first present the lower result when . For any linear-span randomized first-order algorithm and any with , there exist a dimension and functions which satisfy that , and . In order to find such that , needs at least

(9)

IFO calls. Our lower bound (9) is tight for the following reasons. (9) becomes when , and such IFO complexity has been achieved by RepeatSVRG up to a logarithm factor (Carmon et al., 2018; Agarwal et al., 2017). For the case , (9) becomes , and such IFO complexity has been achieved by SPIDER (Fang et al., 2018) and SNVRG (Zhou et al., 2018) up to a logarithm factor.

Next we show lower bounds under a slightly stronger assumption that each . Our result shows that with such a stronger assumption, the optimal dependency on will be smaller. For any linear-span randomized first-order algorithm and any which satisfies that , there exist a dimension and functions which satisfy that each and . In order to find such that , needs at least

(10)

IFO calls. Our lower bound (10) is tight for the case , where (10) becomes . Such IOF complexity has been achieved by Natasha (Allen-Zhu, 2017b), RapGrad (Lan and Yang, 2018) and StagewiseKatyusha (Chen and Yang, 2018) up to a logarithm factor. Nevertheless, for the case , (10) becomes , which does not match the best-known upper bound (Fang et al., 2018) by a factor of on the dependency of . We leave it as a future work to close this gap.

4.3 Discussion on the Average Smoothness Assumption

Careful readers may have already found that in our Theorems 4.1, 4.1 and 4.2, we only assume that . In other words, the above lower bound results (except Theorem 4.2) hold for that is average smooth. Nevertheless, most of the upper bound results achieved by existing finite-sum optimization algorithms (i.e., SDCA without Duality (Shalev-Shwartz, 2016), Natasha (Allen-Zhu, 2017b), KatyushaX (Allen-Zhu, 2018), RapGrad (Lan and Yang, 2018), StagewiseKatyusha (Chen and Yang, 2018) and RepeatSVRG (Agarwal et al., 2017; Carmon et al., 2018)) are proved under the assumption that for each , which is stronger than assuming , which only appears in Zhou et al. (2018) and Fang et al. (2018). Therefore, it is important to verify that these upper bounds results still hold under the weaker assumption that that is average smooth.

To verify this, we need to rethink about the role that the assumption for each plays in the convergence analyses for those algorithms. In detail, in the convergence analyses of those nonconvex finite-sum optimization algorithms including SDCA without Duality (Shalev-Shwartz, 2016), Natasha (Allen-Zhu, 2017b), KatyushaX (Allen-Zhu, 2018), one needs the assumption that for each in the following two scenarios: First, it is used to show that , which can be derived as follows: for any ,

(11)

Second, it is used to upper bound the variance of the semi-stochastic gradient at each iteration, which is an unbiased estimator of the true gradient. More specifically, let

be

where is the global minimum of when is convex or any snapshot of when is nonconvex. Then we have

(12)

We can see that in both scenarios, the weaker assumption is sufficient to make (11) and (12) hold. Thus, we make the following informal statement, which may be regarded as a slight improvement/modification in terms of assumptions over existing algorithms for nonconvex finite-sum optimization problems. For existing nonconvex finite-sum optimization algorithms including SDCA without Duality (Shalev-Shwartz, 2016), Natasha (Allen-Zhu, 2017b), KatyushaX (Allen-Zhu, 2018), RapGrad (Lan and Yang, 2018), StagewiseKatyusha (Chen and Yang, 2018) and RepeatSVRG (Agarwal et al., 2017; Carmon et al., 2018), we can replace the smoothness assumption that with , without affecting their IFO complexities.

5 Proof of Main Theorems

In this section, we provide the detailed proofs for the lower bounds presented in Section 4. Due to space limit, we only provide the proofs for Theorems 4.1 and 4.2, and defer the proofs for the other theorems in the supplementary material.

5.1 Technical Lemmas

Our proofs are based on the following three technical lemmas, whose proofs can be found in the supplementary material.

The first lemma provides the upper bound for the average smoothness parameter of finite-sum functions, when each component function is lower and upper smooth. For any and where , suppose that . Then for where , we have that . For , we also have . In the proof we need to do scale transformation to the given functions. The following lemma describes how problem dependent quantities change with respect to scale transformation. Let be functions satisfying , . We further define , and . Suppose that and . For any , we define satisfying and . Let . Then we have that , , and . We also need the following lemmas to guarantee an lower bound for finding an -suboptimal solution when is strongly convex or convex. For any linear-span randomized first-order algorithm and any with , there exist functions and which satisfy that , and . In order to find such that , needs at least IFO calls.

For any linear-span randomized first-order algorithm and any with , there exist functions and which satisfy that , and . In order to find such that , needs at least IFO calls. We now begin our proof. Without loss of generality, we assume that , otherwise we can replace function with .

5.2 Proofs for: is Convex

Proof of Theorem 4.1.

Let . We choose as follows: