FLAG n' FLARE: Fast Linearly-Coupled Adaptive Gradient Methods

05/26/2016 ∙ by Xiang Cheng, et al. ∙ berkeley college 0

We consider first order gradient methods for effectively optimizing a composite objective in the form of a sum of smooth and, potentially, non-smooth functions. We present accelerated and adaptive gradient methods, called FLAG and FLARE, which can offer the best of both worlds. They can achieve the optimal convergence rate by attaining the optimal first-order oracle complexity for smooth convex optimization. Additionally, they can adaptively and non-uniformly re-scale the gradient direction to adapt to the limited curvature available and conform to the geometry of the domain. We show theoretically and empirically that, through the compounding effects of acceleration and adaptivity, FLAG and FLARE can be highly effective for many data fitting and machine learning applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of minimizing a convex function over a convex set , and we focus on first order methods, which exploit only value and gradient information about . These methods have become very important for many of the large-scale optimization problems that arise in machine learning applications. Two techniques have emerged as powerful tools for large-scale optimization: First, Nesterov’s accelerated algorithms [11] and its proximal variants, Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) [2, 3], can exploit smoothness to improve the convergence rate of a simple gradient/ISTA method from accuracy to after iterations. Second, adaptive regularization approaches such as AdaGrad [7]

can optimize a gradient method’s step-size in different directions, in some sense making the optimization problem better-conditioned. These two techniques have also become popular in non-convex optimization problems, such as parameter estimation in deep neural networks (see, for example,the work of 

[13] on the benefits of acceleration, and AdaGrad [7, 6]

, RMSProp 

[14], ESGD [5], and Adam [9] on adaptive regularization methods in this context).

In this paper, we introduce a new algorithm, called FLAG, that combines the benefits of these two techniques. Like Nesterov’s accelerated algorithms, our method has a convergence rate of . Like AdaGrad, our method adaptively chooses a regularizer, in a way that performs almost as well as the best choice of regularizer in hindsight. In addition, our improvement over FISTA is roughly a square of AdaGrad’s improvement over mirror descent (see section 2.1 for details).

There have been a number of papers in recent years dealing with ways to interpret acceleration. Our algorithm is heavily influenced by [1]. This insightful paper introduced a view of accelerated gradient descent as a linear combination of gradient and mirror descent steps. We exploit this viewpoint in introducing adaptive regularization to our accelerated method. Authors in [1] use a fixed schedule to combine gradient and mirror descent steps. However, [4] views acceleration as an ellipsoid-like algorithm, and instead, determines an appropriate ratio using a line search. This latter idea is also crucial for our algorithm, as we do not know the right ratio in advance.

The optimal stepsize for an accelerated method generally depends on both the smoothness parameter of the objective function as well as the properties of the regularizer. In FLAG, the regularizer is chosen adaptively, and we show that it is also possible to adapt the stepsize in a way that is competitive with the best choice in hindsight. In fact, our method for picking the adaptive stepsizes is closely related to the approach in [8]. There, the authors considered the problem of picking the right stepsize to adapt to an unknown strong-convexity parameter. We use a related approach, but with different proof techniques, to choose the stepsize to adapt to a changing smoothness parameter.

Finally, it should also be noted that there are some interesting papers that study the continuous-time limit of acceleration algorithms, e.g., [12, 10, 15]. Indeed, studying adaptive regularization in the continuous time setting is an interesting future direction for research.

1.1 Notation and Definitions

In what follows, vectors are considered as column vectors and are denoted by bold lower case letters, e.g.,

and matrices are denoted by regular capital letters, e.g., . We overload the “diag” operator as follows: for a given matrix and a vector , and denote the vector made from the diagonal elements of and a diagonal matrix made from the elements of , respectively. Vector norms and denote the standard , and respectively. We adopt the Matlab notation for accessing the elements of vectors and matrices, i.e., components of a vector is indicated by and denotes the entire row of the matrix . Finally, signifies that is the augmentation of the matrix with the column vector .

Consider the optimization problem

(1)

where and are closed proper convex convex functions and is a closed convex set. We further assume that is differentiable with -Lipschitz gradient, i.e.,

(2)

and we allow to be (possibly) non-smooth with sub-differential at denoted by .

The proximal operator [11] associated with , and is defined as

(3a)
(3b)

For a symmetric positive definite(SPD) matrix , define . Note that is -strongly convex with respect to the norm , i.e., ,

The Bregman divergence associated with , is define as

We will use the fact that, for any ,

(4)

It is easy to see that the dual of is given by

(5)

2 The Algorithm and Main Result

In this section, we describe our main algorithm, FLAG, and give its convergence properties in Theorem 1. The core of FLAG consists of five essential ingredients:

  1. A proximal gradient step (Step 4)

  2. Construction of the adaptive regularization (Steps 5-10)

  3. Picking the Adaptive stepsize (Step 12)

  4. A mirror descent step (Step 13)

  5. Linear combination of the proximal gradient step and the mirror descent step (Step 14)

1:, , ,
2:
3:for k=1 to T do
4:     
5:     
6:     
7:     
8:     
9:     
10:     
11:     
12:     
13:     
14:      return
Algorithm 1 FLAG

The subroutine BinarySearch is given in Algorithm 2, where is the usual bisection routine for finding the root of a single variable function in the interval and to the accuracy of . More specifically, for a root such that and given , the sub-routine returns an approximation to such that and this is done with only function evaluations.

1:, , and
2:Define the univariate function
3:if  then return y
4:if  then return z
5:
6:return
Algorithm 2 BinarySearch

Theorem 1, gives the convergence properties of Algorithm 1:

Theorem 1 (Main Result: Convergence Property of FLAG)

Let . For any , after iterations of Algorithm 1, we get

where . Also each iteration takes time.

2.1 Comparison of FLAG with AdaGrad and FISTA

In this Section, we briefly describe the advantages of FLAG relative to AdaGrad and FISTA.

Let be the subgradient of at iteration , , , and . Using a similar argument to that used in Lemma 5, we see that . AdaGrad achieves a rate of

(6)

which is in contrast to mirror descent, which achieves

where is the average of all ’s. Thus, one can see that the improvement of AdaGrad compared to mirror descent is .

Now, let be defined as in Algorithm 1. One can verify using Lemma 5 that and ( is analogous to ). Thus, from Theorem 1, we get

(7)

This is in contrast to FISTA, which achieves

As a result, the improvement of FLAG compared to FISTA is . FLAG’s improvement over FISTA can thus be up to a square of AdaGrad’s improvement over mirror descent. Here, we note that and though analogous, are not the same.

Finally, we can directly compare the rates of (6) and (7) to see the speed-up offered by FLAG over AdaGrad. In particular, FLAG enjoys from the optimal rate of , compared to the sub-optimal rate of of AdaGrad. However, we do stress that AdaGrad does not make any smoothness assumptions, and thus works in more general settings than FLAG.

3 Proofs

Below we give the details for the proof of our main result. The proofs of technical lemmas in Section 3.1 are given in Appendix A.

The proof sketch of the convergence of FLAG is as follows.

  1. FLAG is essentially a combination of mirror descent and proximal gradient descent steps (see Lemmas 1 and 4).

  2. in Algorithm 1 plays the role of an ”effective gradient lipschitz constant” in each iteration. The convergence rate of our algorithm ultimately depends on . (see Lemmas 7 and 2)

  3. By picking adaptively, as in AdaGrad, we achieve a non-trivial upper bound for . (see Lemma 5)

  4. Our algorithm relies on picking an at each iteration that satisfies an inequality involving (see Corollary 1). However, because

    is not known at the moment of picking

    , we must choose an to roughly satisfy the inequality for all possible values of . We do this by picking using binary search. (see Lemmas 2 and 3 and Corollary 1)

  5. Finally, we need to pick the right stepsizes for each iteration. Our scheme is very similar to the one used in [1], but generalized to handle a different each iteration. (see Lemma 6 and 7 and Corollary 2).

Theorem 2 combines items III, and IV. Theorem 1 combines Theorem 2 with items III and V to get our final result.

3.1 Technical Lemmas

We have the following key result (see Lemma 2.3 [2]) regarding the vector , as in Step 5 of FLAG, which is called the Gradient Mapping of on .

Lemma 1 (Gradient Mapping)

For any , we have

where is defined as in (3). In particular, .

The following lemma establishes the Lipschitz continuity of the prox operator.

Lemma 2 (Prox Operator Continuity)

is a -Lipschitz continuous, that is, for any , we have

Using prox operator continuity Lemma 2, we can conclude that given any , if and , then there must be a for which gives . Algorithm 2 finds an approximation to in iterations.

Lemma 3 (Binary Search Lemma)

Let defined as in Algorithm 2. Then one of 3 cases happen:

  1. and ,

  2. and , or

  3. for some and .

Using the above result, we can prove the following:

Corollary 1

Let , , and be defined as in Algorithm 1 and . Then for all ,

Next, we state a result regarding the mirror descent step. Similar results can be found in most texts on online optimization, e.g. [1].

Lemma 4 (Mirror Descent Inequality)

Let and be the diameter of measured by infinity norm. Then for any , we have

Finally, we state a similar result to that of [7] that captures the benefits of using in FLAG.

Lemma 5 (AdaGrad Inequalities)

Define , where is as in Step 7 of Algorithm 1. We have

  1. , where , and

  2. .

3.2 Master Theorem

We can now prove the central theorems of this section.

Theorem 2

Let . For any , after iterations of Algorithm 1, we get

Proof (Proof of Theorem 2)

Noting that is the gradient mapping of on , it follows that

(by Lemma 4)
(By Step 11 of Algorithm 1)
(By Corollary 1)
(By the Gradient Mapping Lemma 1)

Now rearranging terms and re-indexing the summations gives the desired result.

3.3 Choosing the Stepsize

In this section, we discuss the final piece of our algorithm: choosing the stepsize for the mirror descent step.

Lemma 6

For the choice of in Algorithm 1 and , we have

  1. ,

  2. , and

  3. .

Proof

We prove (i) by induction. For , is is easy to verify that , and so and the base case follows trivially. Now suppose . Re-arranging (i) for gives

Now, it is easy to verify that the choice of in Algorithm 1 is a solution of the above quadratic equation. The rest of the items follow immediately from part i.

Corollary 2

Let . For any , after iterations of Algorithm 1, we get

Proof (Proof of corollary 2)

Combining Theorem 2 and Lemma 6, and noting that gives the desired result.

Finally, it only remains to lower bound , which is done in the following Lemma.

Lemma 7

For the choice of in Algorithm 1, we have

Remark 1

We note here that we made little effort to minimize constants, and that we used rather sloppy bounds such as . As a result, the constant appearing above is very conservative and a mere by product of our proof technique. We have numerically verified that a much smaller constant (e.g., ) indeed satisfies the bound above.

And now, the proof of our main result, Theorem 1, follows rather immediately:

Proof (Proof of Theorem 1)

The result follows immediately from Lemma 7 and Corollary 2 and noting that by Lemma 5 and by Step 8 of Algorithm 1 and definition of in Lemma 5. This gives

The run-time per iteration follows from having to do calls to bisection, each taking time.

References

Appendix A Proofs of Technical Lemmas

Proof (Proof of Lemma 1)

This result is the same as Lemma 2.3 in [2]. We bring its proof here for completeness.

For any , any sub-gradient, , of at , i.e., , and by optimality of in (3), we have

and so

Now from -Lipschitz continuity of as well as convexity of and , we get

Proof (Proof of Lemma 2)

By Definition (3), for any , , and , we have

In particular, for and , we get

By monotonicity of sub-gradient, we get

So

and as a result

which gives

and the result follows.

Proof (Proof of Lemma 3)

Items i and ii, are simply Steps 3 and 4, respectively. For item iii, we have

Now it follows that

By Lemma 2

Proof (Proof of Corollary 1)

Note that by Step 5 of Algorithm 1), . For , since , the inequality is trivially true. For , we consider the three cases of Corollary 3: (i) if , the right hand side is and the left hand side is , (ii) if , the left hand side is and , so the inequality holds trivially, and (iii) in this last case, for some , we have

and