1 Introduction
We consider the problem of minimizing a convex function over a convex set , and we focus on first order methods, which exploit only value and gradient information about . These methods have become very important for many of the largescale optimization problems that arise in machine learning applications. Two techniques have emerged as powerful tools for largescale optimization: First, Nesterov’s accelerated algorithms [11] and its proximal variants, Fast Iterative ShrinkageThresholding Algorithm (FISTA) [2, 3], can exploit smoothness to improve the convergence rate of a simple gradient/ISTA method from accuracy to after iterations. Second, adaptive regularization approaches such as AdaGrad [7]
can optimize a gradient method’s stepsize in different directions, in some sense making the optimization problem betterconditioned. These two techniques have also become popular in nonconvex optimization problems, such as parameter estimation in deep neural networks (see, for example,the work of
[13] on the benefits of acceleration, and AdaGrad [7, 6], RMSProp
[14], ESGD [5], and Adam [9] on adaptive regularization methods in this context).In this paper, we introduce a new algorithm, called FLAG, that combines the benefits of these two techniques. Like Nesterov’s accelerated algorithms, our method has a convergence rate of . Like AdaGrad, our method adaptively chooses a regularizer, in a way that performs almost as well as the best choice of regularizer in hindsight. In addition, our improvement over FISTA is roughly a square of AdaGrad’s improvement over mirror descent (see section 2.1 for details).
There have been a number of papers in recent years dealing with ways to interpret acceleration. Our algorithm is heavily influenced by [1]. This insightful paper introduced a view of accelerated gradient descent as a linear combination of gradient and mirror descent steps. We exploit this viewpoint in introducing adaptive regularization to our accelerated method. Authors in [1] use a fixed schedule to combine gradient and mirror descent steps. However, [4] views acceleration as an ellipsoidlike algorithm, and instead, determines an appropriate ratio using a line search. This latter idea is also crucial for our algorithm, as we do not know the right ratio in advance.
The optimal stepsize for an accelerated method generally depends on both the smoothness parameter of the objective function as well as the properties of the regularizer. In FLAG, the regularizer is chosen adaptively, and we show that it is also possible to adapt the stepsize in a way that is competitive with the best choice in hindsight. In fact, our method for picking the adaptive stepsizes is closely related to the approach in [8]. There, the authors considered the problem of picking the right stepsize to adapt to an unknown strongconvexity parameter. We use a related approach, but with different proof techniques, to choose the stepsize to adapt to a changing smoothness parameter.
Finally, it should also be noted that there are some interesting papers that study the continuoustime limit of acceleration algorithms, e.g., [12, 10, 15]. Indeed, studying adaptive regularization in the continuous time setting is an interesting future direction for research.
1.1 Notation and Definitions
In what follows, vectors are considered as column vectors and are denoted by bold lower case letters, e.g.,
and matrices are denoted by regular capital letters, e.g., . We overload the “diag” operator as follows: for a given matrix and a vector , and denote the vector made from the diagonal elements of and a diagonal matrix made from the elements of , respectively. Vector norms and denote the standard , and respectively. We adopt the Matlab notation for accessing the elements of vectors and matrices, i.e., components of a vector is indicated by and denotes the entire row of the matrix . Finally, signifies that is the augmentation of the matrix with the column vector .Consider the optimization problem
(1) 
where and are closed proper convex convex functions and is a closed convex set. We further assume that is differentiable with Lipschitz gradient, i.e.,
(2) 
and we allow to be (possibly) nonsmooth with subdifferential at denoted by .
The proximal operator [11] associated with , and is defined as
(3a)  
(3b) 
For a symmetric positive definite(SPD) matrix , define . Note that is strongly convex with respect to the norm , i.e., ,
The Bregman divergence associated with , is define as
We will use the fact that, for any ,
(4) 
It is easy to see that the dual of is given by
(5) 
2 The Algorithm and Main Result
In this section, we describe our main algorithm, FLAG, and give its convergence properties in Theorem 1. The core of FLAG consists of five essential ingredients:
The subroutine BinarySearch is given in Algorithm 2, where is the usual bisection routine for finding the root of a single variable function in the interval and to the accuracy of . More specifically, for a root such that and given , the subroutine returns an approximation to such that and this is done with only function evaluations.
Theorem 1 (Main Result: Convergence Property of FLAG)
2.1 Comparison of FLAG with AdaGrad and FISTA
In this Section, we briefly describe the advantages of FLAG relative to AdaGrad and FISTA.
Let be the subgradient of at iteration , , , and . Using a similar argument to that used in Lemma 5, we see that . AdaGrad achieves a rate of
(6) 
which is in contrast to mirror descent, which achieves
where is the average of all ’s. Thus, one can see that the improvement of AdaGrad compared to mirror descent is .
Now, let be defined as in Algorithm 1. One can verify using Lemma 5 that and ( is analogous to ). Thus, from Theorem 1, we get
(7) 
This is in contrast to FISTA, which achieves
As a result, the improvement of FLAG compared to FISTA is . FLAG’s improvement over FISTA can thus be up to a square of AdaGrad’s improvement over mirror descent. Here, we note that and though analogous, are not the same.
Finally, we can directly compare the rates of (6) and (7) to see the speedup offered by FLAG over AdaGrad. In particular, FLAG enjoys from the optimal rate of , compared to the suboptimal rate of of AdaGrad. However, we do stress that AdaGrad does not make any smoothness assumptions, and thus works in more general settings than FLAG.
3 Proofs
Below we give the details for the proof of our main result. The proofs of technical lemmas in Section 3.1 are given in Appendix A.
The proof sketch of the convergence of FLAG is as follows.

By picking adaptively, as in AdaGrad, we achieve a nontrivial upper bound for . (see Lemma 5)

Our algorithm relies on picking an at each iteration that satisfies an inequality involving (see Corollary 1). However, because
is not known at the moment of picking
, we must choose an to roughly satisfy the inequality for all possible values of . We do this by picking using binary search. (see Lemmas 2 and 3 and Corollary 1)
Theorem 2 combines items I, II, and IV. Theorem 1 combines Theorem 2 with items III and V to get our final result.
3.1 Technical Lemmas
We have the following key result (see Lemma 2.3 [2]) regarding the vector , as in Step 5 of FLAG, which is called the Gradient Mapping of on .
Lemma 1 (Gradient Mapping)
The following lemma establishes the Lipschitz continuity of the prox operator.
Lemma 2 (Prox Operator Continuity)
is a Lipschitz continuous, that is, for any , we have
Using prox operator continuity Lemma 2, we can conclude that given any , if and , then there must be a for which gives . Algorithm 2 finds an approximation to in iterations.
Lemma 3 (Binary Search Lemma)
Using the above result, we can prove the following:
Corollary 1
Next, we state a result regarding the mirror descent step. Similar results can be found in most texts on online optimization, e.g. [1].
Lemma 4 (Mirror Descent Inequality)
Let and be the diameter of measured by infinity norm. Then for any , we have
Finally, we state a similar result to that of [7] that captures the benefits of using in FLAG.
3.2 Master Theorem
We can now prove the central theorems of this section.
Theorem 2
3.3 Choosing the Stepsize
In this section, we discuss the final piece of our algorithm: choosing the stepsize for the mirror descent step.
Lemma 6
Proof
We prove (i) by induction. For , is is easy to verify that , and so and the base case follows trivially. Now suppose . Rearranging (i) for gives
Now, it is easy to verify that the choice of in Algorithm 1 is a solution of the above quadratic equation. The rest of the items follow immediately from part i.
Corollary 2
Proof (Proof of corollary 2)
Finally, it only remains to lower bound , which is done in the following Lemma.
Lemma 7
Remark 1
We note here that we made little effort to minimize constants, and that we used rather sloppy bounds such as . As a result, the constant appearing above is very conservative and a mere by product of our proof technique. We have numerically verified that a much smaller constant (e.g., ) indeed satisfies the bound above.
And now, the proof of our main result, Theorem 1, follows rather immediately:
References
 [1] Zeyuan AllenZhu and Lorenzo Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. arXiv preprint arXiv:1407.1537, 2014.
 [2] Amir Beck and Marc Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
 [3] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(34):231–357, 2015.
 [4] Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to Nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187, 2015.
 [5] Yann Dauphin, Harm de Vries, and Yoshua Bengio. Equilibrated adaptive learning rates for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 1504–1512, 2015.
 [6] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
 [7] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.
 [8] Elad Hazan, Alexander Rakhlin, and Peter L Bartlett. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, pages 65–72, 2007.
 [9] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [10] Walid Krichene, Alexandre Bayen, and Peter L Bartlett. Accelerated mirror descent in continuous and discrete time. In Advances in Neural Information Processing Systems, pages 2827–2835, 2015.
 [11] Neal Parikh and Stephen P Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):127–239, 2014.
 [12] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.

[13]
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
On the importance of initialization and momentum in deep learning.
In Proceedings of the 30th international conference on machine learning (ICML13), pages 1139–1147, 2013.  [14] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.
 [15] Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. arXiv preprint arXiv:1603.04245, 2016.
Appendix A Proofs of Technical Lemmas
Proof (Proof of Lemma 1)
This result is the same as Lemma 2.3 in [2]. We bring its proof here for completeness.
For any , any subgradient, , of at , i.e., , and by optimality of in (3), we have
and so
Now from Lipschitz continuity of as well as convexity of and , we get
Proof (Proof of Lemma 2)
By monotonicity of subgradient, we get
So
and as a result
which gives
and the result follows.
Proof (Proof of Lemma 3)
Proof (Proof of Corollary 1)
Note that by Step 5 of Algorithm 1), . For , since , the inequality is trivially true. For , we consider the three cases of Corollary 3: (i) if , the right hand side is and the left hand side is , (ii) if , the left hand side is and , so the inequality holds trivially, and (iii) in this last case, for some , we have
and
Comments
There are no comments yet.