We consider the problem of minimizing a convex function over a convex set , and we focus on first order methods, which exploit only value and gradient information about . These methods have become very important for many of the large-scale optimization problems that arise in machine learning applications. Two techniques have emerged as powerful tools for large-scale optimization: First, Nesterov’s accelerated algorithms  and its proximal variants, Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) [2, 3], can exploit smoothness to improve the convergence rate of a simple gradient/ISTA method from accuracy to after iterations. Second, adaptive regularization approaches such as AdaGrad 
can optimize a gradient method’s step-size in different directions, in some sense making the optimization problem better-conditioned. These two techniques have also become popular in non-convex optimization problems, such as parameter estimation in deep neural networks (see, for example,the work of on the benefits of acceleration, and AdaGrad [7, 6]
, RMSProp, ESGD , and Adam  on adaptive regularization methods in this context).
In this paper, we introduce a new algorithm, called FLAG, that combines the benefits of these two techniques. Like Nesterov’s accelerated algorithms, our method has a convergence rate of . Like AdaGrad, our method adaptively chooses a regularizer, in a way that performs almost as well as the best choice of regularizer in hindsight. In addition, our improvement over FISTA is roughly a square of AdaGrad’s improvement over mirror descent (see section 2.1 for details).
There have been a number of papers in recent years dealing with ways to interpret acceleration. Our algorithm is heavily influenced by . This insightful paper introduced a view of accelerated gradient descent as a linear combination of gradient and mirror descent steps. We exploit this viewpoint in introducing adaptive regularization to our accelerated method. Authors in  use a fixed schedule to combine gradient and mirror descent steps. However,  views acceleration as an ellipsoid-like algorithm, and instead, determines an appropriate ratio using a line search. This latter idea is also crucial for our algorithm, as we do not know the right ratio in advance.
The optimal stepsize for an accelerated method generally depends on both the smoothness parameter of the objective function as well as the properties of the regularizer. In FLAG, the regularizer is chosen adaptively, and we show that it is also possible to adapt the stepsize in a way that is competitive with the best choice in hindsight. In fact, our method for picking the adaptive stepsizes is closely related to the approach in . There, the authors considered the problem of picking the right stepsize to adapt to an unknown strong-convexity parameter. We use a related approach, but with different proof techniques, to choose the stepsize to adapt to a changing smoothness parameter.
Finally, it should also be noted that there are some interesting papers that study the continuous-time limit of acceleration algorithms, e.g., [12, 10, 15]. Indeed, studying adaptive regularization in the continuous time setting is an interesting future direction for research.
1.1 Notation and Definitions
In what follows, vectors are considered as column vectors and are denoted by bold lower case letters, e.g.,and matrices are denoted by regular capital letters, e.g., . We overload the “diag” operator as follows: for a given matrix and a vector , and denote the vector made from the diagonal elements of and a diagonal matrix made from the elements of , respectively. Vector norms and denote the standard , and respectively. We adopt the Matlab notation for accessing the elements of vectors and matrices, i.e., components of a vector is indicated by and denotes the entire row of the matrix . Finally, signifies that is the augmentation of the matrix with the column vector .
Consider the optimization problem
where and are closed proper convex convex functions and is a closed convex set. We further assume that is differentiable with -Lipschitz gradient, i.e.,
and we allow to be (possibly) non-smooth with sub-differential at denoted by .
The proximal operator  associated with , and is defined as
For a symmetric positive definite(SPD) matrix , define . Note that is -strongly convex with respect to the norm , i.e., ,
The Bregman divergence associated with , is define as
We will use the fact that, for any ,
It is easy to see that the dual of is given by
2 The Algorithm and Main Result
In this section, we describe our main algorithm, FLAG, and give its convergence properties in Theorem 1. The core of FLAG consists of five essential ingredients:
The subroutine BinarySearch is given in Algorithm 2, where is the usual bisection routine for finding the root of a single variable function in the interval and to the accuracy of . More specifically, for a root such that and given , the sub-routine returns an approximation to such that and this is done with only function evaluations.
Theorem 1 (Main Result: Convergence Property of FLAG)
Let . For any , after iterations of Algorithm 1, we get
where . Also each iteration takes time.
2.1 Comparison of FLAG with AdaGrad and FISTA
In this Section, we briefly describe the advantages of FLAG relative to AdaGrad and FISTA.
Let be the subgradient of at iteration , , , and . Using a similar argument to that used in Lemma 5, we see that . AdaGrad achieves a rate of
which is in contrast to mirror descent, which achieves
where is the average of all ’s. Thus, one can see that the improvement of AdaGrad compared to mirror descent is .
This is in contrast to FISTA, which achieves
As a result, the improvement of FLAG compared to FISTA is . FLAG’s improvement over FISTA can thus be up to a square of AdaGrad’s improvement over mirror descent. Here, we note that and though analogous, are not the same.
Finally, we can directly compare the rates of (6) and (7) to see the speed-up offered by FLAG over AdaGrad. In particular, FLAG enjoys from the optimal rate of , compared to the sub-optimal rate of of AdaGrad. However, we do stress that AdaGrad does not make any smoothness assumptions, and thus works in more general settings than FLAG.
The proof sketch of the convergence of FLAG is as follows.
By picking adaptively, as in AdaGrad, we achieve a non-trivial upper bound for . (see Lemma 5)
Our algorithm relies on picking an at each iteration that satisfies an inequality involving (see Corollary 1). However, because
is not known at the moment of picking, we must choose an to roughly satisfy the inequality for all possible values of . We do this by picking using binary search. (see Lemmas 2 and 3 and Corollary 1)
3.1 Technical Lemmas
Lemma 1 (Gradient Mapping)
For any , we have
where is defined as in (3). In particular, .
The following lemma establishes the Lipschitz continuity of the prox operator.
Lemma 2 (Prox Operator Continuity)
is a -Lipschitz continuous, that is, for any , we have
Lemma 3 (Binary Search Lemma)
Let defined as in Algorithm 2. Then one of 3 cases happen:
and , or
for some and .
Using the above result, we can prove the following:
Let , , and be defined as in Algorithm 1 and . Then for all ,
Next, we state a result regarding the mirror descent step. Similar results can be found in most texts on online optimization, e.g. .
Lemma 4 (Mirror Descent Inequality)
Let and be the diameter of measured by infinity norm. Then for any , we have
Finally, we state a similar result to that of  that captures the benefits of using in FLAG.
3.2 Master Theorem
We can now prove the central theorems of this section.
Let . For any , after iterations of Algorithm 1, we get
3.3 Choosing the Stepsize
In this section, we discuss the final piece of our algorithm: choosing the stepsize for the mirror descent step.
For the choice of in Algorithm 1 and , we have
Let . For any , after iterations of Algorithm 1, we get
Proof (Proof of corollary 2)
Finally, it only remains to lower bound , which is done in the following Lemma.
For the choice of in Algorithm 1, we have
We note here that we made little effort to minimize constants, and that we used rather sloppy bounds such as . As a result, the constant appearing above is very conservative and a mere by product of our proof technique. We have numerically verified that a much smaller constant (e.g., ) indeed satisfies the bound above.
And now, the proof of our main result, Theorem 1, follows rather immediately:
-  Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. arXiv preprint arXiv:1407.1537, 2014.
-  Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
-  Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
-  Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to Nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187, 2015.
-  Yann Dauphin, Harm de Vries, and Yoshua Bengio. Equilibrated adaptive learning rates for non-convex optimization. In Advances in Neural Information Processing Systems, pages 1504–1512, 2015.
-  Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
-  John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.
-  Elad Hazan, Alexander Rakhlin, and Peter L Bartlett. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, pages 65–72, 2007.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Walid Krichene, Alexandre Bayen, and Peter L Bartlett. Accelerated mirror descent in continuous and discrete time. In Advances in Neural Information Processing Systems, pages 2827–2835, 2015.
-  Neal Parikh and Stephen P Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):127–239, 2014.
-  Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
On the importance of initialization and momentum in deep learning.In Proceedings of the 30th international conference on machine learning (ICML-13), pages 1139–1147, 2013.
-  Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.
-  Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. arXiv preprint arXiv:1603.04245, 2016.
Appendix A Proofs of Technical Lemmas
Proof (Proof of Lemma 1)
This result is the same as Lemma 2.3 in . We bring its proof here for completeness.
For any , any sub-gradient, , of at , i.e., , and by optimality of in (3), we have
Now from -Lipschitz continuity of as well as convexity of and , we get
Proof (Proof of Lemma 2)
By Definition (3), for any , , and , we have
In particular, for and , we get
By monotonicity of sub-gradient, we get
and as a result
and the result follows.
Proof (Proof of Lemma 3)
Proof (Proof of Corollary 1)
Note that by Step 5 of Algorithm 1), . For , since , the inequality is trivially true. For , we consider the three cases of Corollary 3: (i) if , the right hand side is and the left hand side is , (ii) if , the left hand side is and , so the inequality holds trivially, and (iii) in this last case, for some , we have