Towards Unified Acceleration of High-Order Algorithms under Hölder Continuity and Uniform Convexity

06/03/2019 ∙ by Chaobing Song, et al. ∙ Tsinghua University berkeley college 0

In this paper, through a very intuitive vanilla proximal method perspective, we derive accelerated high-order optimization algorithms for minimizing a convex function that has Hölder continuous derivatives. In this general convex setting, we propose a unified acceleration algorithm with an iteration complexity that matches the lower iteration complexity bound given in grapiglia2019tensor. If the function is further uniformly convex, we propose a general restart scheme. The iteration complexity of the algorithm matches existing lower bounds in most important cases. For practical implementation, we introduce a new and effective heuristic that significantly simplifies the binary search procedure required by the algorithm, which makes the algorithm in general settings as efficient as the special case grapiglia2019tensor. On large-scale classification datasets, our algorithm demonstrates clear and consistent advantages of high-order acceleration methods over first-order ones, in terms of run-time complexity. Our formulation considers the more general composite setting in which the objective function may contain a second possibly non-smooth convex term. Our analysis and proofs are also applicable to the general case in which the high-order smoothness conditions are with respect to non-Euclidean norms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In optimization, people often consider the problem of minimizing a convex function:

A typical assumption is that has -Lipschitz continuous gradients with respect to the Euclidean norm ,

(1.1)

where is the Lipschitz constant. For this problem, to find an -accurate solution such that , the classic gradient descent method:

with takes iterations. Nevertheless, it is known that from [Nes98], for convex function with -Lipschitz continuous gradients, a lower-bound for the number of iterations for any first-order algorithms is known to be

(1.2)

In the seminal work [Nes83], Nesterov has introduced an acceleration technique, the so-called accelerated gradient descent (AGD) algorithm, that achieves this optimal lower bound. This algorithm dramatically improves the convergence rate of smooth convex optimization with negligible per-iteration cost.

1.1 High-order Acceleration Methods with Lipschitz Continuity.

To hope for a better iteration complexity beyond , needs to be smooth for its high-order derivatives. A common assumption is that has -Hölder continuous derivatives:

(1.3)

for some . Notice that for and , this condition becomes the first order -Lipschitz continuous gradient (1.1) above. Here, for , the norm of a

-th order tensor denotes its operator norm

[Nes18b]

the vector 2-norm

. Sometimes, when , the function is said to have -Lipschitz continuous derivatives:

(1.4)

In general, if we were able to utilize higher-order derivatives with , we expect to obtain algorithms with higher convergence rates. The higher is (and ), the higher the rate could be.

If a convex function has -Lipschitz continuous derivatives (1.4), the recent work [ASS17] has given a lower-bound on the complexity: any deterministic algorithm would need at least

(1.5)

iterations to find an -accurate solution. For the special case , [Nes08] has proposed an “accelerated cubic regularized Newton method” (ACNM) that achieves a complexity of . From a different approach, [MS13] has proposed an “accelerated Newton proximal extragradient” (A-NPE) method that has achieved the optimal complexity for , although each iteration requires a nontrivial binary search procedure.

To achieve better complexity results and also being encouraged by the fact that third-order methods can often be implemented as efficiently as second-order methods [Nes18b], there is an increasing interest to extend ACNM and A-NPE to even higher-order smoothness settings [Nes18b, JWZ18, GKMC18, BJL18]. In particular, following the Nesterov-type ACNM framework, [Nes18b] has proposed an accelerated tensor method with iteration complexity for . Meanwhile, by following the alternative A-NPE framework of [MS13, JWZ18, GKMC18, BJL18] have proposed accelerated methods that achieve the optimal iteration complexity, although just like A-NPE, all these methods need the nontrivial binary search procedure. Hence the current situation seems to be: methods from the Nesterov acceleration framework [Nes08, Nes18b, GN17, GN19] have advantages with simpler implementation, while methods from the Monteiro-Svaiter acceleration framework [MS13, JWZ18, GKMC18, BJL18] can in theory achieve the optimal rate . However, it remains somewhat mysterious how we could reconcile the differences between these two approaches.

1.2 Acceleration under Hölder Continuity and Our Results

Besides the Lipschitz continuous setting, the more general Hölder continuous setting (1.3) is also of increased interest, partly for designing universal optimization schemes [Nes15, YDC15, GN17, CGT19]. If has -Hölder continuous gradients, a lower bound for the iteration complexity is known to be [NY83]:

(1.6)

An algorithm that can achieve this lower bound has been proposed in [NN85].

For the more general setting of -Hölder continuous derivatives, during the preparation of this paper, [GN19] has given a lower bound of iteration complexity

(1.7)

By extending Nesterov’s method in [Nes18b], [GN19] has proposed a method that achieves the iteration complexity . To the best of our knowledge, methods that can achieve the lower bound are still unknown.

In this paper, for the minimization of convex functions with -Hölder continuous derivatives, we propose a unified acceleration algorithm (UAA), see Algorithm 2, that achieves the iteration complexity of with , ), which matches the lower bound [GN19]. To be more precise, if a convex function has -Hölder continuous derivatives, our algorithm can find an -accurate solution with

(1.8)

iterations, where is a tunable parameter111As we will later see, is the order of the uniform convexity of the proxy-function for algorithm design. [GN19] has used a uniformly convex proxy-function with -th order, while [JWZ18, GKMC18, BJL18] have used a uniformly convex proxy-function with -nd order. such that . Notice that our result and algorithm unify previously known results as (important) special cases:

  • For the case of -Lipschitz continuous gradients [Nes98] where and , the rate (1.8) of the proposed algorithm achieves the lower bound of (1.2).

  • For the more general setting of -Hölder continuous derivatives: When , the rate (1.8) of the proposed algorithm achieves the lower bound of (1.7) given by [GN19]; when , it recovers the complexity of the method given in [GN19].

Nevertheless, our result and algorithm work for the full range of . Our approach and analysis provide a continuous transition from the Nesterov acceleration framework to the Monteiro-Svaiter acceleration framework.

1.3 Acceleration under Uniform Convexity and Our Results

The above result for optimal complexity (1.8) is given for the general convexity setting with -Hölder continuous derivatives. When has additional nice properties such as uniform convexity, we should expect even better complexity. To be more precise, assume that is -uniformly convex, i.e.:

(1.9)

for , where -uniform convexity is also known as -strong convexity. It is known that for , when having -Lipschitz continuous gradient, [Nes98] has provided a lower bound for the iteration complexity

(1.10)

When has -Lipschitz continuous derivatives, [ASS17] has provided a lower bound for the iteration complexity

(1.11)

and it has also proposed a method based on restarting A-NPE [MS13] that achieves a complexity upper-bounded by , quite close to the lower bound.

In this paper, we show that in the uniformly convex setting, the idea of restart for ACNM in [Nes08] is also applicable to our algorithm and can significantly improve the iteration complexity (1.8). Inspired by that work, we in this paper introduce a more general restart scheme, see Algorithm 3, that is applicable for accelerating almost all convex optimization algorithms.

We show that for -uniformly convex and -Hölder continuous functions, if , then the UAA algorithm with the proposed restart scheme applied, needs at most

(1.12)

iterations to find an -accurate solution, where is the tunable parameter as before. If , then the resulting algorithm needs at most

(1.13)

iterations. If , then the algorithm needs at most

(1.14)

iterations.

Notice that according to (1.12), when , with the design parameter , we recover the optimal rate of accelerated gradient descent (AGD) in the strong convex setting [Nes98]. According to (1.13), when , with , our algorithm eliminates the logarithmic factor in the first term of the upper bound given in [ASS17] and achieves the iteration complexity of (1.11), which matches the lower bound given in [ASS17]. According to (1.14), when , with , our algorithm has the iteration complexity , which may be of interest to solve the cubic regularized Newton step [NP06] by gradient descent methods [CD16].

1.4 Our Approach and Some Implications

In this paper, instead of directly designing an algorithm and then analyzing its iteration complexity, we consider a different paradigm to make our approach and algorithm more intuitive and explainable. The paradigm is inspired by the unified theory for first-order algorithms [DO19] and the continuous-time interpretations of Nesterov’s acceleration [SBC14, KBB15, KBB16, WWJ16]. Our approach to the algorithmic design is based on an idealized but impractical algorithm called vanilla proximal method (VPM), introduced in Section 3. A continuous-time approximation to the VPM and a discrete-time approximation to the VPM will lead us to the final implementable algorithm with desired convergence rates.

The VPM aims to solve a regularized program of the original one with an arbitrary convergence rate depending on parameters of our choice. However, the VPM serves more as an ideal target and is itself computationally infeasible to realize. We show that,in Section 4, to overcome the computational hurdle, one can instead solve a continuous-time convex approximation to the VPM. An accelerated continuous-time dynamics can be derived simply as sufficient conditions to ensure that solution to the approximate convex program achieves the same convergence rate as the original VPM. Such point of view unifies the existing continuous-time accelerated dynamics introduced in [SBC14], [KBB15] and [WWJ16] and severs as an arguably better guideline for the design of practical algorithms in the discrete setting.

In practice, to realize the desired accelerated dynamics, we need to know how to implement them in the discrete setting as an iterative algorithm. To this end, we need to consider a discrete-time convex approximation to the VPM. However, as we will see in Section 5, in order for the discrete-time approximation to achieve the same convergence rate as the continuous dynamics, we must solve a fixed-point problem which itself is computationally infeasible (if not impossible) in practice. To circumvent this difficulty, we propose to solve the fixed-point problem approximately by solving a smooth approximation to the VPM which becomes a tractable problem. Finally, by combing the convex approximation and the smooth approximation to the VPM, we propose an implementable discrete-time accelerated algorithm which achieves the optimal iteration complexity given in (1.8) for the minimization of convex functions with -Hölder continuous derivatives (for ).

Besides attaining the optimal complexity (1.8), our approach and algorithm offer several other benefits. Firstly, our approach and analysis are applicable for the composite setting where can be a sum of a smooth convex function and a non-smooth one (such as the norm). Secondly, to our best knowledge, it is the first algorithm that provides iteration complexity results under the non-Euclidean high-order smoothness assumption, which may be of certain theoretical interest 222The existing complexity results [Nes18b, GN19, JWZ18, GKMC18, BJL18] for high-order methods is only applicable under the generalized Euclidean norm setting given by a general positive definite matrix , where .. Thirdly, our approach seems to unify the conditions and results of the previous two separate approaches to develop high-order acceleration algorithms, represented by the work of [Nes18b] and the work of [MS13], respectively.

Last but not the least, in order to achieve the optimal convergence rate that matches the lower bound [ASS17], there is an important difference between first-order and high-order algorithms. In the high-order setting, to obtain the optimal rate, we must employ a binary search procedure to find a suitable coupling coefficient in each iteration, which may substantially slow down the practical performance [Nes18b]. Therefore, in addition to the above theoretical results, we introduce a simple heuristic for finding the coupling coefficient, suggested by our analysis, so that the resulting implementation does not need a binary search procedure required by the optimal acceleration method. Our experiments show that this simple heuristic is extremely effective and can easily ensure the conditions needed to achieve the optimal rate. This leads to a very practical implementation of the optimal acceleration algorithms without extra implementation cost, alleviating concerns raised by [Nes18b].

2 Preliminaries

Before we proceed, we first introduce some notations. Let denote a definition. For , let with . Let denote a norm of vectors and denote the dual norm of . For and , Let . By a little abuse of notation, for a convex function defined on , let denote the gradient at or one point in the subgradient set . For a function , denotes the variable of , denotes the parameter of and denotes the gradient or one point in the subgradient set with respect to .

Similar to the notations in [Nes18b], for , we use to denote the directional derivative of a function at along the directions . Then is a symmetric -linear form and its operator norm a norm is defined as

(2.15)
Definition 1 (Strictly, Uniformly, or Strongly Convex)

We say a continuous function is convex on , if , one has

(2.16)

is strictly convex on , if the equality sign in (2.16) holds if and only if ;

is -uniformly convex on a norm , if , one has

(2.17)

where denotes the order of uniform convexity and denotes the constant of uniform convexity;

is -strongly convex on , if is -uniformly convex on .

In Definition 1, uniform convexity can be viewed as an extension of the better known concept of strong convexity. Example 1 gives two cases of uniform convexity.

Example 1 (Uniform Convexity)

is -uniformly convex on [BCL94]; is -uniformly convex on [Nes08].

Starting from the work of [Nes15], an increasing interest is to replace the Lipschitz continuity assumption by the Hölder continuity assumption [YDC15, Rd17, Nes18a, GN19] and to propose universal algorithms in the sense that the convergence of algorithms can optimally adapt to the Hölder parameter. [Nes15, YDC15] have considered first-order algorithms with Hölder continuous gradients ; [GN17, NGN18] have proposed cubic regularized Newton methods for minimizing functions with Hölder continuous Hessians ; [CGT19, GN19] considered tensor methods for minimizing functions with -th Hölder continuous derivatives (). In this paper, we extend the definition of Hölder continuous derivatives any norm , including non-Euclidean norms. Our analysis and results will be applicable to this general setting.

Definition 2 ( Hölder Continuous Derivative)

We say a function on has -Hölder continuous derivatives , if , one has

(2.18)

where denotes the order of derivative, denotes the Hölder parameter and is the constant of smoothness.

is said to have -Lipschitz continuous derivatives on if has -Hölder continuous derivatives on .

In Definition 2, we unify the definition of Hölder continuous gradients and high-order Hölder continuous derivatives . For , denotes the dual norm of ; for , denotes the operator norm of tensor of -th order , which is defined by (2.15).

In the paper, we mainly consider the problem of optimizing a composite convex function of the form

(2.19)

where is a closed proper convex function and is a simple convex but maybe non-smooth function. We consider the case when has -Hölder continuous derivatives, for all . Then we can define the following two auxiliary functions that approximate :

(2.20)
(2.21)

Then is a lower-bound convex approximation to for any parameter . gives a high-order smooth approximation to for any parameter . Or more formally, we have:

Lemma 1

If and are convex, and has -Hölder continuous derivatives, for all , then we have

(2.22)
(2.23)
(2.24)

Proof. See Section A.1.    

In (2.20) and (2.21), we do not linearize the term which may be nonsmooth. Because of (2.22), in this paper, is viewed as a lower-bound convex approximation to for any parameter . satisfies (2.23) and (2.24), and gives a high-order smooth approximation to for any parameter .

Finally, we give two inequalities in Lemma 2 which will be used in our analysis.

Lemma 2

For a sequence with and . Then for and , if ,

(2.25)

we have

(2.26)

Meanwhile, for and , if

(2.27)

then we have

(2.28)

Proof. See Section A.2.    

3 A Vanilla Proximal Method

Let us start our study by considering a composite convex optimization problem in (2.19). In the following discussion, we assume that is a minimizer of on . To design an acceleration algorithm to minimize , we first introduce a so-called vanilla proximal method (VPM), that considers to minimize an auxiliary function as in Algorithm 1.

1:  Input: an initialized point , a positive scalar function depending on which satisfies if and if .
2:  For any ,
(3.29)
Algorithm 1 Vanilla Proximal Method (VPM)

In the auxiliary objective , the proxy term typically should satisfy the following assumption:

Assumption A

For all , with if and only if . Meanwhile, is strictly convex on .

Therefore, in the VPM, for each , (3.29) is a strictly convex program and thus there exists a unique minimizer . By using only the optimality condition of (3.29), we can characterize the “convergence rate” of the VPM as below.

Theorem 1

For any , the solution generated by Algorithm 1 satisfies

(3.30)

Proof. By the definition of in (3.29), one has

(3.31)

Then by the optimality condition of and the nonnegativity of , one has

(3.32)

By the upper bound of in (3.31) and lower bound of in (3.32), after a simple rearrangement, Theorem 1 is proved.

 

According to Theorem 1, the VPM may converge with any convergence rate if is chosen to a large enough value. In fact we do not need any extra assumption on in the proof of Theorem 1, except that the optimal solution exists. Although solving the subproblem (3.29) is impractical in general, it provides us a good starting point to design practical algorithms: by making certain assumptions on the objective function and the proxy function , it is possible to achieve or approach the convergence rate of the VPM by solving a tractable approximation to (3.29). The ideal subproblem (3.29) does not depend on any previously visited states or along the optimization path. Nevertheless, when we consider a tractable approximation to (3.29), the approximation can depend on the previous states either in terms of the entire continuous path or finite number of discrete samples. As we will see, a continuous approximation results in a continuous-time accelerated dynamic system in Section 4, while a discrete-time approximation results in a discrete-time accelerated algorithm in Section 5.

Remark. The proposed VPM is similar to the proximal point algorithm (PPA) [PB14] which performs

(3.33)

along iterations, where . The difference between VPM and PPA is that VPM is not an iterative algorithm and has a convergence rate only depending on the parameter we choose. If we set , the per-iteration costs of VPM and PPA are comparable.    

Remark. The subproblem (3.33) of PPA is often computationally infeasible in practice. By considering tractable inexact versions of PPA with the concept “-subdifferential”, [MS13] has proposed a unified framework, accelerated hybrid proximal extragradient (A-HPE), for convex optimization. One difference between our framework and A-HPE is that ours extends from the non-iterative VPM and therefore can unify both continuous-time accelerated dynamics and discrete-time accelerated algorithms. Meanwhile, we consider a general proxy function rather than the Euclidean norm square , thus our framework can be generalized to the non-Euclidean setting such as and the -th power of Euclidean norm .    

4 Continuous-time Accelerated Descent Dynamics

The subproblem (3.29) in the VPM is merely conceptual as it is almost as difficult as minimizing the original function. Nevertheless, if is convex, one can always seek more tractable approximations. From an acceleration perspective, the convex approximation in Lemma 1 gives a lower bound for at the current state . The minimizer of would suggest an aggressive direction and step for the next iterate to go to. However, for such iterates not to diverge too far from the landscape of , we also need a good upper bound. A basic idea is that up to time , we have already traversed a path over the landscape of . We could potentially use all the lower-bounds of to construct a good upper bound to guide the next step. The simplest possible form for such an upper bound we could consider is a superposition (or an integral) of these lower bounds:

where is a properly chosen weight function of and is a strictly convex term to bound the function from below in case are not.

Therefore, to guide the descent trajectory, we can consider solving an approximate problem of (3.29) as follows

(4.34)

where and satisfies and is the path of optimization and its relationship with will be determined soon.

In this section, our main goal is to show that the widely studied continuous-time accelerated dynamics arise from a sufficient condition that allows (4.34) to achieve the same convergence rate as the original VPM. First, a upper bound of is given in Lemma 3.

Lemma 3

For all , we have .

Proof. See Section B.1    

Lemma 3 is an extension of the upper bound (3.31) of , which follows trivially from Lemma 1. In other words, Lemma 3 provides a lower bound of .

Lemma 4

For all , we have .

Proof. See Section B.2    

Essentially, Lemma 4 says that the lower bound (3.32) of can be extended to , at least approximately. We would like to make this approximation as close as possible and establish as an upper bound of , at least along certain path by our choice. To this end, based on Lemmas 3 and 4, we have the following theorem.

Theorem 2 (Continuous-Time VPM)

If the continuous-time trajectories and are evolved according to the dynamics:

(4.35)

where , , , and , then for all , one has

(4.36)

Proof. If , then from Lemma 4, one has

Combining Lemma 3, we have (4.36).    

In Theorem 2, (4.35) does not specify any concrete values or forms for and , except the condition , and ;333Theoretically should be chosen such that the differential equation to have a unique solution. meanwhile it does not specify any concrete form for . As result, by instantiating the dynamical system (4.35), one may obtain all the ODEs previously introduced and studied in the literature [SBC14, KBB15, KBB16, WWJ16], respectively. We show a few examples below:

Example 2

If the component of (2.19) and , then (4.35) is equivalent to

(4.37)

where , , . By setting and respectively, then we recover the ODE in [SBC14] and the ODE under the Euclidean norm setting in [WWJ16].

Remark. In Example 2, if is the indicator function of a closed convex set and is chosen as the Bregman divergence of a strictly convex function, then we may recover the formulation of accelerated mirror descent dynamic [KBB15] and the Euler-Lagrange equation [WWJ16].    

Remark. Although we have derived the dynamics (4.35) from a different perspective, it should be noted that the dynamical system (4.35) is an extension and refinement to the ODE derived by the “approximate duality gap technique (ADGT)” [DO19]. The main difference is that instead of giving a upper bound of and a lower bound of , we give a upper bound of and a lower bound of . This modification allows us to set rather than , and thus the initialization expression about can be removed. Such modification simplifies future derivation and analysis greatly.    

Compared to the VPM, the continuous-time accelerated dynamics must satisfy an extra ODE, which can be viewed as an additional cost associated with the continuous-time approximation. As we see in Theorem 2, the optimization path can have the same property of the VPM and one can obtain an arbitrarily fast convergence rate if is chosen to be large enough. However, if a discrete-time approximation is used to implement and approximate the VPM, it is in general difficult to retain the same rate, which will be discussed carefully in the next Section 5.

5 Discrete-time Accelerated Descent Algorithm

In order to achieve the same convergence rate of the VPM, the continuous-time approximation needs the extra ODE condition in (4.35), which is reasonable to assume in the continuous setting. In the discrete-time setting, if all other conditions remain unchanged, except that we replace the weighted continuous-time approximation (4.34) by a weighted discrete-time counterpart, one may see that the ODE will be replaced by a condition that requires us to find a solution to a fixed-point problem (which will be clear in Lemma 6). Unfortunately, directly solving this fixed-point problem is computationally infeasible in practice. To remedy this difficulty, we employ a stronger assumption for the proxy-function such that it can introduce an extra term as follows.

Assumption B

For all , with if and only if . Meanwhile is -uniformly convex a norm where , , one has

(5.38)
Example 3

By setting or the Bregman divergence of , then satisfies Assumption B with order and constant [BCL94]; by setting , then satisfies Assumption B with order and constant [Nes08].

Meanwhile, we also need that is -uniformly convex.

Assumption C

For the norm , we assume that is -uniformly convex , where ,

(5.39)

In order to find a good approximate solution of our problem in a computationally efficient way, the smooth component of in (2.19) should satisfy the following.

Assumption D

has -Hölder continuous derivatives, where .

In Assumptions B to D, for practical concerns and technical reasons, in the following discussion, we will assume that and . ( means that in our setting if , then .)

Based on Assumptions B-D, in this section, similar to the continuous-time approximation in Section 4, we consider a weighted discrete-time convex approximation of (3.29): for ,

(5.40)

where we assume that , , satisfies Assumption B, and is defined in Lemma 1. Meanwhile, in (5.40), when , we let and thus . Then we motivate the discrete-time algorithm by analyzing the conditions needed to emulate the same rate of the VPM. First, a upper bound of is given in Lemma 5.

Lemma 5

For all , one has .

Proof. see Section C.1.    

Then in Lemma 6 below, we show how the lower bound (3.32) of can be extended to the discrete case with some extra terms.

Lemma 6

For , let . Then for all , one has

(5.41)

Proof. See Section C.2.    

In Lemma 6, the extra negative term in is from the uniform convexity of . If is only convex (i.e. ), this negative term does not exist and thus a sufficient condition for is:

(5.42)

By (5.40), is a function of . Therefore finding to satisfy (5.42) is reduced to a fixed-point problem (so is it for ). It is computationally infeasible (if not impossible) to find an exact solution to this problem in general. Nevertheless, if satisfies Assumption B, the term contains a negative term . So there is hope that an approximate solution to the fixed-point problem (5.42) can still make .

To approximately solve the fixed-point problem, for convenient analysis, inspired by [HP87, DO17], we define a pair