Online Convex Optimization (OCO) is a well-established learning framework which has both theoretical and practical appeals (Shalev-Shwartz et al., 2012). It is performed in a sequence of consecutive rounds: In each round , firstly a learner chooses a decision from a convex set
, at the same time, an adversary reveals a loss function, and consequently the learner suffers a loss . The goal is to minimize regret, defined as the difference between the cumulative loss of the learner and that of the best decision in hindsight (Hazan et al., 2016):
The most classic algorithm for OCO is Online Gradient Descent (OGD) (Zinkevich, 2003), which attains an regret. OGD iteratively performs descent step towards gradient direction with a predetermined step size, which is oblivious to the characteristics of the data being observed. As a result, its regret bound is data-independent, and can not benefit from the structure of data. To address this limitation, various of adaptive gradient methods, such as Adagrad (Duchi et al., 2011), RMSprop (Tieleman & Hinton, 2012) and Adadelta (Zeiler, 2012) have been proposed to exploit the geometry of historical data. Among them, Adam (Kingma & Ba, 2015), which dynamically adjusts the step size and the update direction by exponential average of the past gradients, has been extensively popular and successfully applied to many applications (Xu et al., 2015; Gregor et al., 2015; Kiros et al., 2015; Denkowski & Neubig, 2017; Bahar et al., 2017). Despite the outstanding performance, Reddi et al. (2018) pointed out that Adam suffers the non-convergence issue, and developed two modified versions, namely AMSgrad and AdamNC. These variants are equipped with data-dependant regret bounds, which are in the worst case and become tighter when gradients are sparse.
While the theoretical behavior of Adam in convex case becomes clear, it remains an open problem whether strong convexity
can be exploited to achieve better performance. Such property arises, for instance, in support vector machines as well as other regularized learning problems, and it is well-known that the vanilla OGD with appropriately chosen step size enjoys a much betterregret bound for strongly convex functions (Hazan et al., 2007). In this paper, we propose a variant of Adam adapted to strongly convex functions, referred to as SAdam. Our algorithm follows the general framework of Adam, yet keeping a faster decaying step size controlled by time-variant heperparameters to exploit strong convexity. Theoretical analysis demonstrates that SAdam achieves a data-dependant regret bound for strongly convex functions, which means that it converges faster than AMSgrad and AdamNC in such cases, and also enjoys a huge gain in the face of sparse gradients.
Furthermore, under a special configuration of heperparameters, the proposed algorithm reduces to the SC-RMSprop (Mukkamala & Hein, 2017), which is a variant of RMSprop algorithm for strongly convex functions. We provide an alternative proof for SC-RMSprop, and establish the first data-dependent logarithmic regret bound. Finally, we evaluate the proposed algorithm on strongly convex problems as well as deep networks, and the empirical results demonstrate the effectiveness of our method.
Notation. Throughout the paper, we use lower case bold face letters to denote vectors, lower case letters to denote scalars, and upper case letters to denote matrices. We use to denote the -norm and the infinite norm. For a positive definite matrix , the weighted -norm is defined by . The -weighted projection of x onto is defined by
We use to denote the gradient of at . For vector sequence , we denote the -th element of by . For diagonal matrix sequence , we use to denote the -th element in the diagonal of . We use to denote the vector obtained by concatenating the -th element of the gradient sequence .
2 Related Work
In this section, we briefly review related work in online convex and strongly convex optimization.
In the literature, most studies are devoted to the minimization of regret for convex functions. Under the assumptions that the infinite norm of gradients and the diameter of the decision set are bounded, OGD with step size on the order of (referred to as convex OGD) achieves a data-independent regret (Zinkevich, 2003), where is the dimension. To conduct more informative updates, Duchi et al. (2011) introduce Adagrad algorithm, which adjusts the step size of OGD in a per-dimension basis according to the geometry of the past gradients. In particular, the diagonal version of the algorithm updates decisions as
where is a constant factor, is a diagonal matrix, and is the arithmetic average of the square of the -th elements of the past gradients. Intuitively, while the step size of Adagrad, i.e., , decreases generally on the order of as that in convex OGD, the additional matrix will automatically increase step sizes for sparse dimensions in order to seize the infrequent yet valuable information therein. For convex functions, Adagrad enjoys an regret, which is in the worst case and becomes tighter when gradients are sparse.
Although Adagrad works well in sparse cases, its performance has been found to deteriorate when gradients are dense due to the rapid decay of the step size since it uses all the past gradients in the update (Zeiler, 2012). To tackle this issue, Tieleman & Hinton (2012) propose RMSprop, which alters the arithmetic average procedure with Exponential Moving Average (EMA), i.e.,
where is a hyperparameter, and denotes extracting the diagonal matrix. In this way, the weights assigned to past gradients decay exponentially so that the reliance of the update is essentially limited to recent few gradients. Since the invention of RMSprop, many EMA variants of Adagrad have been developed (Zeiler, 2012; Kingma & Ba, 2015; Dozat, 2016). One of the most popular algorithms is Adam (Kingma & Ba, 2015), where the first-order momentum acceleration, shown in (2), is incorporated into RMSprop to boost the performance:
While it has been successfully applied to various practical applications, a recent study by Reddi et al. (2018) shows that Adam could fail to converge to the optimal decision even in some simple one-dimensional convex scenarios due to the potential rapid fluctuation of the step size. To resolve this issue, they design two modified versions of Adam. The first one is AMSgrad, where an additional element-wise maximization procedure, i.e.,
is employed before the update of to ensure a stable step size. The other is AdamNC, where the framework of Adam remains unchanged, yet a time-variant (i.e., ) is adopted to keep the step size under control. Theoretically, the two algorithms achieve data-dependent and regrets respectively. In the worst case, they suffer and regrets respectively, and enjoy a huge gain when data is sparse.
Note that the aforementioned algorithms are mainly analysed in general convex settings and suffer at least regret in the worst case. For online strongly convex optimization, the classical OGD with step size proportional to (referred to as strongly convex OGD) achieves a data-independent regret (Hazan et al., 2007). Inspired by this, Mukkamala & Hein (2017) modify the update rule of Adagrad in (1) as follows
so that the step size decays approximately on the order of , which is similar to that in strongly convex OGD. The new algorithm, named SC-Adagrad, is proved to enjoy a data-dependant regret bound of , which is in the worst case. They further extend this idea to RMSprop, and propose an algorithm named SC-RMSprop. However, as pointed out in Section 3, their regret bound for SC-RMSprop is in fact data-independent, and our paper provide the first data-dependent regret bound for this algorithm.
Very recently, several modifications of Adam adapted to non-convex settings have been developed (Chen et al., 2018a; Basu et al., 2018; Zhang et al., 2018; Shazeer & Stern, 2018). However, to our knowledge, none of these algorithms is particularly designed for strongly convex functions, nor enjoys a logarithmic regret bound.
In this section, we first describe the proposed algorithm, then state its theoretical guarantees, and finally compare it with SC-RMSprop algorithm.
3.1 The Algorithm
A function is -strongly convex if
The infinite norm of the gradients of all loss functions are bounded by , i.e., their exists a constant such that holds for all .
The decision set is bounded. Specifically, their exists a constant such that .
We are now ready to present our algorithm, which follows the general framework of Adam and is summarized in Algorithm 1. In each round , we firstly observe the gradient at (Step 4), then compute the first-order momentum (Step 5). Here a time-variant non-increasing hyperparameter, which is commonly set as , where (Kingma & Ba, 2015; Reddi et al., 2018). Next, we calculate the second-order momentum by EMA of the square of past gradients (Step 6). This procedure is controlled by , whose value will be discussed later. After that, we add a vanishing factor to the diagonal of and get (Step 7), which is a standard technique for avoiding too large steps caused by small gradients in the beginning iterations. Finally, we update the decision by and , which is then projected onto the decision set (Step 8).
While SAdam is inspired by Adam, there exist two key differences: One is the update rule of in Step 8, the other is the configuration of in Step 6. Intuitively, both modifications stem from strongly convex OGD, and jointly result in a faster decaying yet under controlled step size which helps utilize the strong convexity while preserving the practical benefits of Adam. Specifically, in the first modification, we remove the square root operation in (4) of Adam, and update at Step 8 as follows
In this way, the step size used to update the -th element of is , which decays in general on the order of , and can still be automatically tuned in a per-feature basis via the EMA of the historical gradients.
The second modification is made to , which determines the value of and thus also controls the decaying rate of the step size. To help understand the motivation behind our algorithm, we first revisit Adam, where is simply set to be constant, which, however, could cause rapid fluctuation of the step size, and further leads to the non-convergence issue. To ensure convergence, Reddi et al. (2018) propose that should satisfy the following two conditions:
For some and , ,
The first condition implies that the difference between the inverses of step sizes in two consecutive rounds is positive. It is inherently motivated by convex OGD (i.e., OGD with step size , where is a constant factor), where
is a key condition used in the analysis. We first modify Condition 1 by mimicking the behavior of strongly convex OGD as we are devoted to minimizing regret for strongly convex functions. In strongly convex OGD (Hazan et al., 2007), the step size at each round is set as with for -strongly convex functions. Under this configuration, we have
Motivated by this, we propose the following condition for our SAdam, which is an analog to (7).
Their exists a constant such that for any , we have and ,
Note that the extra in the righthand side of (8) is necessary because SAdam involves the first-order momentum in its update.
Finally, since the step size of SAdam scales with rather than in Adam, we modify Condition 2 accordingly as follows:
For some , and ,
3.2 Theoretical Guarantees
In the following, we give a general regret bound when the two conditions are satisfied.
The above theorem implies that our algorithm enjoys an regret bound, which is in the worst case, and automatically becomes tighter whenever the gradients are small or sparse such that for some . The superiority of data-dependent bounds have been witnessed by a long list of literature, such as Duchi et al. (2011); Mukkamala & Hein (2017); Reddi et al. (2018). In the following, we give some concrete examples:
Consider a one-dimensional sparse setting where non-zero gradient appears with probabilityand is a constant. Then , which is a constant factor.
Consider a high-dimensional sparse setting where in each dimension of gradient non-zero element appears with probability with being a constant. Then, , which is much tighter than .
Furthermore, as a special case, by setting and , our algorithm reduces to SC-RMSprop (Mukkamala & Hein, 2017), which is a variant of RMSprop for strongly convex functions. Although Mukkamala & Hein (2017) have provided theoretical guarantees for this algorithm, we note that their regret bound is in fact data-independent. Specifically, the regret bound provided by Mukkamala & Hein (2017) takes the following form:
Focusing on the denominator of the last term in (11), we have
which implies that their regret bound is of order , and thus data-independent. In contrast, based on Corollary 2, we present a new regret bound for SC-RMSprop in the following, which is , and thus data-dependent.
In this section, we present empirical results on optimizing strongly convex functions and training deep networks.
Algorithms. In both experiments, we compare the following algorithms:
SC-Adagrad (Mukkamala & Hein, 2017), with step size .
SC-RMSprop (Mukkamala & Hein, 2017), with step size and .
AdamNC (Reddi et al., 2018), with , , and for convex problems and a time-invariant for non-convex problems.
Online Gradient Descent (OGD), with step size for strongly convex problems and a time-invariant for non-convex problems.
Our proposed SAdam, with , .
For Adam, AdamNC and AMSgrad, we choose according to the recommendations in their papers. For SC-Adagrad and SC-RMSprop, following Mukkamala & Hein (2017), we choose a time-variant for each dimension , with , for convex problems and , for non-convex problems. For our SAdam, since the removing of the square root procedure and very small gradients may cause too large step sizes in the beginning iterations, we use a rather large to avoid this problem. To conduct a fair comparison, for each algorithm, we choose from the set and report the best results.
Datasets. In both experiments, we examine the performances of the aforementioned algorithms on three widely used datasets: MNIST (60000 training samples, 10000 test samples), CIFAR10 (50000 training samples, 10000 test samples), and CIFAR100 (50000 training samples, 10000 test samples). We refer to LeCun (1998) and Krizhevsky (2009) for more details of the three datasets.
4.1 Optimizing Strongly Convex Functions
In the first experiment, we consider the problem of mini-batch -regularized softmax regression, which belongs to the online strongly convex optimization framework. Let be the number of classes and be the batch size. In each round , firstly a mini-batch of training samples arrives, where . Then, the algorithm predicts parameter vectors , and suffers a loss which takes the following form:
The value of and are set to be for all experiments. The regret (in log scale) v.s. dataset proportion is shown in Figure 1. It can be seen that our SAdam outperforms other methods across all the considered datasets. Besides, we observe that data-dependant strongly convex methods such as SC-Adagrad, SC-RMSprop and SAdam preform better than algorithms for general convex functions such as Adam, AMSgrad and AdamNC. Finally, OGD has the overall highest regret on all three datasets.
4.2 Training Deep Networks
Training loss v.s. number of epochs for 4-layer CNN
3), one max-pooling layer (with a 2
2 window and 0.25 dropout), and one fully connected layer (with 128 hidden units and 0.5 dropout). We employ ReLU function as the activation function for convolutional layers and softmax function as the activation function for the fully connected layer. The loss function is the cross-entropy. The training loss v.s. epoch is shown in Figure2, and the test accuracy v.s. epoch is presented in Figure 3
. As can be seen, our SAdam achieves the lowest training loss on the three data sets. Moreover, this performance gain also translates into good performance on test accuracy. The experimental results show that although our proposed SAdam is designed for strongly convex functions, it could lead to superior practical performance even in some highly non-convex cases such as deep learning tasks.
5 Conclusion and Future Work
In this paper, we provide a variant of Adam adapted to strongly convex functions. The proposed algorithm, namely SAdam, follows the general framework of Adam, while keeping a step size decaying in general on the order of and controlled by data-dependant heperparameters in order to exploit strong convexity. Our theoretical analysis shows that SAdam achieves a data-dependant regret bound for strongly convex functions, which means that it converges much faster than Adam, AdamNC, and AMSgrad in such cases, and can also enjoy a huge gain in the face of sparse gradients. In addition, we also provide the first data-dependant logarithmic regret bound for SC-RMSprop algorithm. Finally, we test the proposed algorithm on optimizing strongly convex functions as well as training deep networks, and the empirical results demonstrate the effectiveness of our method.
Since SAdam enjoys a data-dependent regret for online strongly convex optimization, it can be easily translated into a data-dependent convergence rate for stochastic strongly convex optimization (SSCO) by using the online-to-batch conversion (Kakade & Tewari, 2009). However, this rate is not optimal for SSCO, and it is sill an open problem how to achieve a data-dependent convergence rate for SSCO. Recent development on adaptive gradient method (Chen et al., 2018b) has proved that Adagrad combined with multi-stage scheme (Hazan & Kale, 2014) can achieve this rate, but it is highly non-trivial to extend this technique to SAdam, and we leave it as a future work.
Bahar et al. (2017)
Bahar, P., Alkhouli, T., Peter, J.-T., Brix, C. J.-S., and Ney, H.
Empirical investigation of optimization algorithms in neural machine translation.The Prague Bulletin of Mathematical Linguistics, 108(1):13–25, 2017.
- Basu et al. (2018) Basu, A., De, S., Mukherjee, A., and Ullah, E. Convergence guarantees for rmsprop and adam in non-convex optimization and their comparison to nesterov acceleration on autoencoders. arXiv preprint arXiv:1807.06766, 2018.
- Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
- Chen et al. (2018a) Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941, 2018a.
- Chen et al. (2018b) Chen, Z., Xu, Y., Chen, E., and Yang, T. Sadagrad: Strongly adaptive stochastic gradient methods. In Proceedings of 35th International Conference on Machine Learning, pp. 912–920, 2018b.
- Denkowski & Neubig (2017) Denkowski, M. and Neubig, G. Stronger baselines for trustable results in neural machine translation. arXiv preprint arXiv:1706.09733, 2017.
Incorporating nesterov momentum into adam.In Proceedings of 4th International Conference on Learning Representations, Workshop Track, 2016.
- Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
Gregor et al. (2015)
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D.
Draw: a recurrent neural network for image generation.In Proceedings of the 32nd International Conference on Machine Learning, pp. 1462–1471, 2015.
- Hazan & Kale (2014) Hazan, E. and Kale, S. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15:2489–2512, 2014.
- Hazan et al. (2007) Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2007.
- Hazan et al. (2016) Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
- Kakade & Tewari (2009) Kakade, S. M. and Tewari, A. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems 21, pp. 801–808, 2009.
- Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. Proceedings of 3th International Conference on Learning Representations, 2015.
- Kiros et al. (2015) Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. Skip-thought vectors. In Advances in Neural Information Processing Systems 27, pp. 3294–3302, 2015.
- Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998.
- McMahan & Streeter (2010) McMahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In Proceedings of the 23nd Annual Conference on Learning Theory, pp. 224–256, 2010.
- Mukkamala & Hein (2017) Mukkamala, M. C. and Hein, M. Variants of rmsprop and adagrad with logarithmic regret bounds. In Proceedings of the 33th International Conference on Machine Learning, pp. 2545–2553, 2017.
- Reddi et al. (2018) Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In Proceedings of 6th International Conference on Learning Representations, 2018.
- Shalev-Shwartz et al. (2012) Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
- Shazeer & Stern (2018) Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235, 2018.
- Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, pp. 26–31, 2012.
- Xu et al. (2015) Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of 32nd International Conference on Machine Learning, pp. 2048–2057, 2015.
- Zeiler (2012) Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- Zhang et al. (2018) Zhang, J., Cui, L., and Gouza, F. B. Gadam: Genetic-evolutionary adam for deep neural network optimization. arXiv preprint arXiv:1805.07500, 2018.
- Zinkevich (2003) Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pp. 928–936, 2003.
Appendix A Proof of Theorem 1
From Definition 1, we can upper bound regret as
where is the best decision in hindsight. On the other hand, by the update rule of in Algorithm 1, we have
where , and the inequality is due to the following lemma, which implies that the weighed projection procedure is non-expansive.
(McMahan & Streeter, 2010) Let be a positive definite matrix and be a convex set. Then we have, ,
Rearranging (14), we have
where the last inequality is due to . We proceed to bound the second term of the inequality above. When , by Young’s inequality and the definition of , we get
When , this term becomes 0 since in Algorithm 1. Plugging (16) and (17) into (13), we get the following inequality, of which we divide the righthand side into three parts and upper bound each of them one by one.
To bound , we have
where the inequality is derived from . For the first term in the last equality of (18), we have
where the second equality is because , and the second inequality is due to .
For the second term of (18), we have
To bound , we introduce the following lemma.
The following inequality holds
By Lemma 2, we have
Finally, we turn to upper bound :