1 Introduction
Online Convex Optimization (OCO) is a wellestablished learning framework which has both theoretical and practical appeals (ShalevShwartz et al., 2012). It is performed in a sequence of consecutive rounds: In each round , firstly a learner chooses a decision from a convex set
, at the same time, an adversary reveals a loss function
, and consequently the learner suffers a loss . The goal is to minimize regret, defined as the difference between the cumulative loss of the learner and that of the best decision in hindsight (Hazan et al., 2016):The most classic algorithm for OCO is Online Gradient Descent (OGD) (Zinkevich, 2003), which attains an regret. OGD iteratively performs descent step towards gradient direction with a predetermined step size, which is oblivious to the characteristics of the data being observed. As a result, its regret bound is dataindependent, and can not benefit from the structure of data. To address this limitation, various of adaptive gradient methods, such as Adagrad (Duchi et al., 2011), RMSprop (Tieleman & Hinton, 2012) and Adadelta (Zeiler, 2012) have been proposed to exploit the geometry of historical data. Among them, Adam (Kingma & Ba, 2015), which dynamically adjusts the step size and the update direction by exponential average of the past gradients, has been extensively popular and successfully applied to many applications (Xu et al., 2015; Gregor et al., 2015; Kiros et al., 2015; Denkowski & Neubig, 2017; Bahar et al., 2017). Despite the outstanding performance, Reddi et al. (2018) pointed out that Adam suffers the nonconvergence issue, and developed two modified versions, namely AMSgrad and AdamNC. These variants are equipped with datadependant regret bounds, which are in the worst case and become tighter when gradients are sparse.
While the theoretical behavior of Adam in convex case becomes clear, it remains an open problem whether strong convexity
can be exploited to achieve better performance. Such property arises, for instance, in support vector machines as well as other regularized learning problems, and it is wellknown that the vanilla OGD with appropriately chosen step size enjoys a much better
regret bound for strongly convex functions (Hazan et al., 2007). In this paper, we propose a variant of Adam adapted to strongly convex functions, referred to as SAdam. Our algorithm follows the general framework of Adam, yet keeping a faster decaying step size controlled by timevariant heperparameters to exploit strong convexity. Theoretical analysis demonstrates that SAdam achieves a datadependant regret bound for strongly convex functions, which means that it converges faster than AMSgrad and AdamNC in such cases, and also enjoys a huge gain in the face of sparse gradients.Furthermore, under a special configuration of heperparameters, the proposed algorithm reduces to the SCRMSprop (Mukkamala & Hein, 2017), which is a variant of RMSprop algorithm for strongly convex functions. We provide an alternative proof for SCRMSprop, and establish the first datadependent logarithmic regret bound. Finally, we evaluate the proposed algorithm on strongly convex problems as well as deep networks, and the empirical results demonstrate the effectiveness of our method.
Notation. Throughout the paper, we use lower case bold face letters to denote vectors, lower case letters to denote scalars, and upper case letters to denote matrices. We use to denote the norm and the infinite norm. For a positive definite matrix , the weighted norm is defined by . The weighted projection of x onto is defined by
We use to denote the gradient of at . For vector sequence , we denote the th element of by . For diagonal matrix sequence , we use to denote the th element in the diagonal of . We use to denote the vector obtained by concatenating the th element of the gradient sequence .
2 Related Work
In this section, we briefly review related work in online convex and strongly convex optimization.
In the literature, most studies are devoted to the minimization of regret for convex functions. Under the assumptions that the infinite norm of gradients and the diameter of the decision set are bounded, OGD with step size on the order of (referred to as convex OGD) achieves a dataindependent regret (Zinkevich, 2003), where is the dimension. To conduct more informative updates, Duchi et al. (2011) introduce Adagrad algorithm, which adjusts the step size of OGD in a perdimension basis according to the geometry of the past gradients. In particular, the diagonal version of the algorithm updates decisions as
(1) 
where is a constant factor, is a diagonal matrix, and is the arithmetic average of the square of the th elements of the past gradients. Intuitively, while the step size of Adagrad, i.e., , decreases generally on the order of as that in convex OGD, the additional matrix will automatically increase step sizes for sparse dimensions in order to seize the infrequent yet valuable information therein. For convex functions, Adagrad enjoys an regret, which is in the worst case and becomes tighter when gradients are sparse.
Although Adagrad works well in sparse cases, its performance has been found to deteriorate when gradients are dense due to the rapid decay of the step size since it uses all the past gradients in the update (Zeiler, 2012). To tackle this issue, Tieleman & Hinton (2012) propose RMSprop, which alters the arithmetic average procedure with Exponential Moving Average (EMA), i.e.,
where is a hyperparameter, and denotes extracting the diagonal matrix. In this way, the weights assigned to past gradients decay exponentially so that the reliance of the update is essentially limited to recent few gradients. Since the invention of RMSprop, many EMA variants of Adagrad have been developed (Zeiler, 2012; Kingma & Ba, 2015; Dozat, 2016). One of the most popular algorithms is Adam (Kingma & Ba, 2015), where the firstorder momentum acceleration, shown in (2), is incorporated into RMSprop to boost the performance:
(2)  
(3)  
(4) 
While it has been successfully applied to various practical applications, a recent study by Reddi et al. (2018) shows that Adam could fail to converge to the optimal decision even in some simple onedimensional convex scenarios due to the potential rapid fluctuation of the step size. To resolve this issue, they design two modified versions of Adam. The first one is AMSgrad, where an additional elementwise maximization procedure, i.e.,
is employed before the update of to ensure a stable step size. The other is AdamNC, where the framework of Adam remains unchanged, yet a timevariant (i.e., ) is adopted to keep the step size under control. Theoretically, the two algorithms achieve datadependent and regrets respectively. In the worst case, they suffer and regrets respectively, and enjoy a huge gain when data is sparse.
Note that the aforementioned algorithms are mainly analysed in general convex settings and suffer at least regret in the worst case. For online strongly convex optimization, the classical OGD with step size proportional to (referred to as strongly convex OGD) achieves a dataindependent regret (Hazan et al., 2007). Inspired by this, Mukkamala & Hein (2017) modify the update rule of Adagrad in (1) as follows
so that the step size decays approximately on the order of , which is similar to that in strongly convex OGD. The new algorithm, named SCAdagrad, is proved to enjoy a datadependant regret bound of , which is in the worst case. They further extend this idea to RMSprop, and propose an algorithm named SCRMSprop. However, as pointed out in Section 3, their regret bound for SCRMSprop is in fact dataindependent, and our paper provide the first datadependent regret bound for this algorithm.
Very recently, several modifications of Adam adapted to nonconvex settings have been developed (Chen et al., 2018a; Basu et al., 2018; Zhang et al., 2018; Shazeer & Stern, 2018). However, to our knowledge, none of these algorithms is particularly designed for strongly convex functions, nor enjoys a logarithmic regret bound.
3 SAdam
In this section, we first describe the proposed algorithm, then state its theoretical guarantees, and finally compare it with SCRMSprop algorithm.
3.1 The Algorithm
Before proceeding to our algorithm, following previous studies, we introduce some standard definitions (Boyd & Vandenberghe, 2004) and assumptions (Reddi et al., 2018).
Definition 1.
A function is strongly convex if
(5) 
Assumption 1.
The infinite norm of the gradients of all loss functions are bounded by , i.e., their exists a constant such that holds for all .
Assumption 2.
The decision set is bounded. Specifically, their exists a constant such that .
We are now ready to present our algorithm, which follows the general framework of Adam and is summarized in Algorithm 1. In each round , we firstly observe the gradient at (Step 4), then compute the firstorder momentum (Step 5). Here a timevariant nonincreasing hyperparameter, which is commonly set as , where (Kingma & Ba, 2015; Reddi et al., 2018). Next, we calculate the secondorder momentum by EMA of the square of past gradients (Step 6). This procedure is controlled by , whose value will be discussed later. After that, we add a vanishing factor to the diagonal of and get (Step 7), which is a standard technique for avoiding too large steps caused by small gradients in the beginning iterations. Finally, we update the decision by and , which is then projected onto the decision set (Step 8).
While SAdam is inspired by Adam, there exist two key differences: One is the update rule of in Step 8, the other is the configuration of in Step 6. Intuitively, both modifications stem from strongly convex OGD, and jointly result in a faster decaying yet under controlled step size which helps utilize the strong convexity while preserving the practical benefits of Adam. Specifically, in the first modification, we remove the square root operation in (4) of Adam, and update at Step 8 as follows
(6) 
In this way, the step size used to update the th element of is , which decays in general on the order of , and can still be automatically tuned in a perfeature basis via the EMA of the historical gradients.
The second modification is made to , which determines the value of and thus also controls the decaying rate of the step size. To help understand the motivation behind our algorithm, we first revisit Adam, where is simply set to be constant, which, however, could cause rapid fluctuation of the step size, and further leads to the nonconvergence issue. To ensure convergence, Reddi et al. (2018) propose that should satisfy the following two conditions:
Condition 1.
and
Condition 2.
For some and , ,
The first condition implies that the difference between the inverses of step sizes in two consecutive rounds is positive. It is inherently motivated by convex OGD (i.e., OGD with step size , where is a constant factor), where
is a key condition used in the analysis. We first modify Condition 1 by mimicking the behavior of strongly convex OGD as we are devoted to minimizing regret for strongly convex functions. In strongly convex OGD (Hazan et al., 2007), the step size at each round is set as with for strongly convex functions. Under this configuration, we have
(7) 
Motivated by this, we propose the following condition for our SAdam, which is an analog to (7).
Condition 3.
Their exists a constant such that for any , we have and ,
(8) 
Note that the extra in the righthand side of (8) is necessary because SAdam involves the firstorder momentum in its update.
Finally, since the step size of SAdam scales with rather than in Adam, we modify Condition 2 accordingly as follows:
Condition 4.
For some , and ,
(9) 
3.2 Theoretical Guarantees
In the following, we give a general regret bound when the two conditions are satisfied.
Theorem 1.
Remark 1.
The above theorem implies that our algorithm enjoys an regret bound, which is in the worst case, and automatically becomes tighter whenever the gradients are small or sparse such that for some . The superiority of datadependent bounds have been witnessed by a long list of literature, such as Duchi et al. (2011); Mukkamala & Hein (2017); Reddi et al. (2018). In the following, we give some concrete examples:

Consider a onedimensional sparse setting where nonzero gradient appears with probability
and is a constant. Then , which is a constant factor. 
Consider a highdimensional sparse setting where in each dimension of gradient nonzero element appears with probability with being a constant. Then, , which is much tighter than .
Next, we provide an instantiation of such that Conditions 3 and 4 hold, and derive the following Corollary.
Corollary 2.
Furthermore, as a special case, by setting and , our algorithm reduces to SCRMSprop (Mukkamala & Hein, 2017), which is a variant of RMSprop for strongly convex functions. Although Mukkamala & Hein (2017) have provided theoretical guarantees for this algorithm, we note that their regret bound is in fact dataindependent. Specifically, the regret bound provided by Mukkamala & Hein (2017) takes the following form:
(11) 
Focusing on the denominator of the last term in (11), we have
thus
which implies that their regret bound is of order , and thus dataindependent. In contrast, based on Corollary 2, we present a new regret bound for SCRMSprop in the following, which is , and thus datadependent.
Corollary 3.
Finally, we note that Mukkamala & Hein (2017) also consider a more general version of SCRMSprop which uses a timevariant nonincreasing for each dimension . In Appendix D we introduce the variant technique to our SAdam, and provide corresponding theoretical guarantees.
4 Experiments
In this section, we present empirical results on optimizing strongly convex functions and training deep networks.
Algorithms. In both experiments, we compare the following algorithms:

[noitemsep,nolistsep]

SCAdagrad (Mukkamala & Hein, 2017), with step size .

SCRMSprop (Mukkamala & Hein, 2017), with step size and .

AdamNC (Reddi et al., 2018), with , , and for convex problems and a timeinvariant for nonconvex problems.

Online Gradient Descent (OGD), with step size for strongly convex problems and a timeinvariant for nonconvex problems.

Our proposed SAdam, with , .
For Adam, AdamNC and AMSgrad, we choose according to the recommendations in their papers. For SCAdagrad and SCRMSprop, following Mukkamala & Hein (2017), we choose a timevariant for each dimension , with , for convex problems and , for nonconvex problems. For our SAdam, since the removing of the square root procedure and very small gradients may cause too large step sizes in the beginning iterations, we use a rather large to avoid this problem. To conduct a fair comparison, for each algorithm, we choose from the set and report the best results.
Datasets. In both experiments, we examine the performances of the aforementioned algorithms on three widely used datasets: MNIST (60000 training samples, 10000 test samples), CIFAR10 (50000 training samples, 10000 test samples), and CIFAR100 (50000 training samples, 10000 test samples). We refer to LeCun (1998) and Krizhevsky (2009) for more details of the three datasets.
4.1 Optimizing Strongly Convex Functions
In the first experiment, we consider the problem of minibatch regularized softmax regression, which belongs to the online strongly convex optimization framework. Let be the number of classes and be the batch size. In each round , firstly a minibatch of training samples arrives, where . Then, the algorithm predicts parameter vectors , and suffers a loss which takes the following form:
The value of and are set to be for all experiments. The regret (in log scale) v.s. dataset proportion is shown in Figure 1. It can be seen that our SAdam outperforms other methods across all the considered datasets. Besides, we observe that datadependant strongly convex methods such as SCAdagrad, SCRMSprop and SAdam preform better than algorithms for general convex functions such as Adam, AMSgrad and AdamNC. Finally, OGD has the overall highest regret on all three datasets.
4.2 Training Deep Networks
Training loss v.s. number of epochs for 4layer CNN
Following Mukkamala & Hein (2017), we also conduct experiments on a 4layer CNN, which consists of two convolutional layers (each with 32 filters of size 3
3), one maxpooling layer (with a 2
2 window and 0.25 dropout), and one fully connected layer (with 128 hidden units and 0.5 dropout). We employ ReLU function as the activation function for convolutional layers and softmax function as the activation function for the fully connected layer. The loss function is the crossentropy. The training loss v.s. epoch is shown in Figure
2, and the test accuracy v.s. epoch is presented in Figure 3. As can be seen, our SAdam achieves the lowest training loss on the three data sets. Moreover, this performance gain also translates into good performance on test accuracy. The experimental results show that although our proposed SAdam is designed for strongly convex functions, it could lead to superior practical performance even in some highly nonconvex cases such as deep learning tasks.
5 Conclusion and Future Work
In this paper, we provide a variant of Adam adapted to strongly convex functions. The proposed algorithm, namely SAdam, follows the general framework of Adam, while keeping a step size decaying in general on the order of and controlled by datadependant heperparameters in order to exploit strong convexity. Our theoretical analysis shows that SAdam achieves a datadependant regret bound for strongly convex functions, which means that it converges much faster than Adam, AdamNC, and AMSgrad in such cases, and can also enjoy a huge gain in the face of sparse gradients. In addition, we also provide the first datadependant logarithmic regret bound for SCRMSprop algorithm. Finally, we test the proposed algorithm on optimizing strongly convex functions as well as training deep networks, and the empirical results demonstrate the effectiveness of our method.
Since SAdam enjoys a datadependent regret for online strongly convex optimization, it can be easily translated into a datadependent convergence rate for stochastic strongly convex optimization (SSCO) by using the onlinetobatch conversion (Kakade & Tewari, 2009). However, this rate is not optimal for SSCO, and it is sill an open problem how to achieve a datadependent convergence rate for SSCO. Recent development on adaptive gradient method (Chen et al., 2018b) has proved that Adagrad combined with multistage scheme (Hazan & Kale, 2014) can achieve this rate, but it is highly nontrivial to extend this technique to SAdam, and we leave it as a future work.
References

Bahar et al. (2017)
Bahar, P., Alkhouli, T., Peter, J.T., Brix, C. J.S., and Ney, H.
Empirical investigation of optimization algorithms in neural machine translation.
The Prague Bulletin of Mathematical Linguistics, 108(1):13–25, 2017.  Basu et al. (2018) Basu, A., De, S., Mukherjee, A., and Ullah, E. Convergence guarantees for rmsprop and adam in nonconvex optimization and their comparison to nesterov acceleration on autoencoders. arXiv preprint arXiv:1807.06766, 2018.
 Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
 Chen et al. (2018a) Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adamtype algorithms for nonconvex optimization. arXiv preprint arXiv:1808.02941, 2018a.
 Chen et al. (2018b) Chen, Z., Xu, Y., Chen, E., and Yang, T. Sadagrad: Strongly adaptive stochastic gradient methods. In Proceedings of 35th International Conference on Machine Learning, pp. 912–920, 2018b.
 Denkowski & Neubig (2017) Denkowski, M. and Neubig, G. Stronger baselines for trustable results in neural machine translation. arXiv preprint arXiv:1706.09733, 2017.

Dozat (2016)
Dozat, T.
Incorporating nesterov momentum into adam.
In Proceedings of 4th International Conference on Learning Representations, Workshop Track, 2016.  Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.

Gregor et al. (2015)
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D.
Draw: a recurrent neural network for image generation.
In Proceedings of the 32nd International Conference on Machine Learning, pp. 1462–1471, 2015.  Hazan & Kale (2014) Hazan, E. and Kale, S. Beyond the regret minimization barrier: optimal algorithms for stochastic stronglyconvex optimization. The Journal of Machine Learning Research, 15:2489–2512, 2014.
 Hazan et al. (2007) Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2007.
 Hazan et al. (2016) Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Kakade & Tewari (2009) Kakade, S. M. and Tewari, A. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems 21, pp. 801–808, 2009.
 Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. Proceedings of 3th International Conference on Learning Representations, 2015.
 Kiros et al. (2015) Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. Skipthought vectors. In Advances in Neural Information Processing Systems 27, pp. 3294–3302, 2015.
 Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

LeCun (1998)
LeCun, Y.
The mnist database of handwritten digits.
http://yann. lecun. com/exdb/mnist/, 1998.  McMahan & Streeter (2010) McMahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In Proceedings of the 23nd Annual Conference on Learning Theory, pp. 224–256, 2010.
 Mukkamala & Hein (2017) Mukkamala, M. C. and Hein, M. Variants of rmsprop and adagrad with logarithmic regret bounds. In Proceedings of the 33th International Conference on Machine Learning, pp. 2545–2553, 2017.
 Reddi et al. (2018) Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In Proceedings of 6th International Conference on Learning Representations, 2018.
 ShalevShwartz et al. (2012) ShalevShwartz, S. et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
 Shazeer & Stern (2018) Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235, 2018.
 Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, pp. 26–31, 2012.
 Xu et al. (2015) Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of 32nd International Conference on Machine Learning, pp. 2048–2057, 2015.
 Zeiler (2012) Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 Zhang et al. (2018) Zhang, J., Cui, L., and Gouza, F. B. Gadam: Geneticevolutionary adam for deep neural network optimization. arXiv preprint arXiv:1805.07500, 2018.
 Zinkevich (2003) Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pp. 928–936, 2003.
Appendix A Proof of Theorem 1
From Definition 1, we can upper bound regret as
(13) 
where is the best decision in hindsight. On the other hand, by the update rule of in Algorithm 1, we have
(14) 
where , and the inequality is due to the following lemma, which implies that the weighed projection procedure is nonexpansive.
Lemma 1.
(McMahan & Streeter, 2010) Let be a positive definite matrix and be a convex set. Then we have, ,
(15) 
Rearranging (14), we have
(16) 
where the last inequality is due to . We proceed to bound the second term of the inequality above. When , by Young’s inequality and the definition of , we get
(17) 
When , this term becomes 0 since in Algorithm 1. Plugging (16) and (17) into (13), we get the following inequality, of which we divide the righthand side into three parts and upper bound each of them one by one.
To bound , we have
(18) 
where the inequality is derived from . For the first term in the last equality of (18), we have
(19) 
where the second equality is because , and the second inequality is due to .
For the second term of (18), we have
(20) 
where first inequality is due to Condition 3, and the second inequality follows from Assumption 2. Combining (18), (19) and (20), we get
(21) 
To bound , we introduce the following lemma.
Lemma 2.
The following inequality holds
(22) 
Comments
There are no comments yet.