SAdam: A Variant of Adam for Strongly Convex Functions

05/08/2019 ∙ by Guanghui Wang, et al. ∙ Nanjing University 0

The Adam algorithm has become extremely popular for large-scale machine learning. Under convexity condition, it has been proved to enjoy a data-dependant O(√(T)) regret bound where T is the time horizon. However, whether strong convexity can be utilized to further improve the performance remains an open problem. In this paper, we give an affirmative answer by developing a variant of Adam (referred to as SAdam) which achieves a data-dependant O( T) regret bound for strongly convex functions. The essential idea is to maintain a faster decaying yet under controlled step size for exploiting strong convexity. In addition, under a special configuration of hyperparameters, our SAdam reduces to SC-RMSprop, a recently proposed variant of RMSprop for strongly convex functions, for which we provide the first data-dependent logarithmic regret bound. Empirical results on optimizing strongly convex functions and training deep networks demonstrate the effectiveness of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Online Convex Optimization (OCO) is a well-established learning framework which has both theoretical and practical appeals (Shalev-Shwartz et al., 2012). It is performed in a sequence of consecutive rounds: In each round , firstly a learner chooses a decision from a convex set

, at the same time, an adversary reveals a loss function

, and consequently the learner suffers a loss . The goal is to minimize regret, defined as the difference between the cumulative loss of the learner and that of the best decision in hindsight (Hazan et al., 2016):

The most classic algorithm for OCO is Online Gradient Descent (OGD) (Zinkevich, 2003), which attains an regret. OGD iteratively performs descent step towards gradient direction with a predetermined step size, which is oblivious to the characteristics of the data being observed. As a result, its regret bound is data-independent, and can not benefit from the structure of data. To address this limitation, various of adaptive gradient methods, such as Adagrad (Duchi et al., 2011), RMSprop (Tieleman & Hinton, 2012) and Adadelta (Zeiler, 2012) have been proposed to exploit the geometry of historical data. Among them, Adam (Kingma & Ba, 2015), which dynamically adjusts the step size and the update direction by exponential average of the past gradients, has been extensively popular and successfully applied to many applications (Xu et al., 2015; Gregor et al., 2015; Kiros et al., 2015; Denkowski & Neubig, 2017; Bahar et al., 2017). Despite the outstanding performance, Reddi et al. (2018) pointed out that Adam suffers the non-convergence issue, and developed two modified versions, namely AMSgrad and AdamNC. These variants are equipped with data-dependant regret bounds, which are in the worst case and become tighter when gradients are sparse.

While the theoretical behavior of Adam in convex case becomes clear, it remains an open problem whether strong convexity

can be exploited to achieve better performance. Such property arises, for instance, in support vector machines as well as other regularized learning problems, and it is well-known that the vanilla OGD with appropriately chosen step size enjoys a much better

regret bound for strongly convex functions (Hazan et al., 2007). In this paper, we propose a variant of Adam adapted to strongly convex functions, referred to as SAdam. Our algorithm follows the general framework of Adam, yet keeping a faster decaying step size controlled by time-variant heperparameters to exploit strong convexity. Theoretical analysis demonstrates that SAdam achieves a data-dependant regret bound for strongly convex functions, which means that it converges faster than AMSgrad and AdamNC in such cases, and also enjoys a huge gain in the face of sparse gradients.

Furthermore, under a special configuration of heperparameters, the proposed algorithm reduces to the SC-RMSprop (Mukkamala & Hein, 2017), which is a variant of RMSprop algorithm for strongly convex functions. We provide an alternative proof for SC-RMSprop, and establish the first data-dependent logarithmic regret bound. Finally, we evaluate the proposed algorithm on strongly convex problems as well as deep networks, and the empirical results demonstrate the effectiveness of our method.

Notation. Throughout the paper, we use lower case bold face letters to denote vectors, lower case letters to denote scalars, and upper case letters to denote matrices. We use to denote the -norm and the infinite norm. For a positive definite matrix , the weighted -norm is defined by . The -weighted projection of x onto is defined by

We use to denote the gradient of at . For vector sequence , we denote the -th element of by . For diagonal matrix sequence , we use to denote the -th element in the diagonal of . We use to denote the vector obtained by concatenating the -th element of the gradient sequence .

2 Related Work

In this section, we briefly review related work in online convex and strongly convex optimization.

In the literature, most studies are devoted to the minimization of regret for convex functions. Under the assumptions that the infinite norm of gradients and the diameter of the decision set are bounded, OGD with step size on the order of (referred to as convex OGD) achieves a data-independent regret (Zinkevich, 2003), where is the dimension. To conduct more informative updates, Duchi et al. (2011) introduce Adagrad algorithm, which adjusts the step size of OGD in a per-dimension basis according to the geometry of the past gradients. In particular, the diagonal version of the algorithm updates decisions as

(1)

where is a constant factor, is a diagonal matrix, and is the arithmetic average of the square of the -th elements of the past gradients. Intuitively, while the step size of Adagrad, i.e., , decreases generally on the order of as that in convex OGD, the additional matrix will automatically increase step sizes for sparse dimensions in order to seize the infrequent yet valuable information therein. For convex functions, Adagrad enjoys an regret, which is in the worst case and becomes tighter when gradients are sparse.

Although Adagrad works well in sparse cases, its performance has been found to deteriorate when gradients are dense due to the rapid decay of the step size since it uses all the past gradients in the update (Zeiler, 2012). To tackle this issue, Tieleman & Hinton (2012) propose RMSprop, which alters the arithmetic average procedure with Exponential Moving Average (EMA), i.e.,

where is a hyperparameter, and denotes extracting the diagonal matrix. In this way, the weights assigned to past gradients decay exponentially so that the reliance of the update is essentially limited to recent few gradients. Since the invention of RMSprop, many EMA variants of Adagrad have been developed (Zeiler, 2012; Kingma & Ba, 2015; Dozat, 2016). One of the most popular algorithms is Adam (Kingma & Ba, 2015), where the first-order momentum acceleration, shown in (2), is incorporated into RMSprop to boost the performance:

(2)
(3)
(4)

While it has been successfully applied to various practical applications, a recent study by Reddi et al. (2018) shows that Adam could fail to converge to the optimal decision even in some simple one-dimensional convex scenarios due to the potential rapid fluctuation of the step size. To resolve this issue, they design two modified versions of Adam. The first one is AMSgrad, where an additional element-wise maximization procedure, i.e.,

is employed before the update of to ensure a stable step size. The other is AdamNC, where the framework of Adam remains unchanged, yet a time-variant (i.e., ) is adopted to keep the step size under control. Theoretically, the two algorithms achieve data-dependent and regrets respectively. In the worst case, they suffer and regrets respectively, and enjoy a huge gain when data is sparse.

Note that the aforementioned algorithms are mainly analysed in general convex settings and suffer at least regret in the worst case. For online strongly convex optimization, the classical OGD with step size proportional to (referred to as strongly convex OGD) achieves a data-independent regret (Hazan et al., 2007). Inspired by this, Mukkamala & Hein (2017) modify the update rule of Adagrad in (1) as follows

so that the step size decays approximately on the order of , which is similar to that in strongly convex OGD. The new algorithm, named SC-Adagrad, is proved to enjoy a data-dependant regret bound of , which is in the worst case. They further extend this idea to RMSprop, and propose an algorithm named SC-RMSprop. However, as pointed out in Section 3, their regret bound for SC-RMSprop is in fact data-independent, and our paper provide the first data-dependent regret bound for this algorithm.

Very recently, several modifications of Adam adapted to non-convex settings have been developed (Chen et al., 2018a; Basu et al., 2018; Zhang et al., 2018; Shazeer & Stern, 2018). However, to our knowledge, none of these algorithms is particularly designed for strongly convex functions, nor enjoys a logarithmic regret bound.

3 SAdam

In this section, we first describe the proposed algorithm, then state its theoretical guarantees, and finally compare it with SC-RMSprop algorithm.

3.1 The Algorithm

Before proceeding to our algorithm, following previous studies, we introduce some standard definitions (Boyd & Vandenberghe, 2004) and assumptions (Reddi et al., 2018).

Definition 1.

A function is -strongly convex if

(5)
Assumption 1.

The infinite norm of the gradients of all loss functions are bounded by , i.e., their exists a constant such that holds for all .

Assumption 2.

The decision set is bounded. Specifically, their exists a constant such that .

1:  Input:
2:  Initialize: , .
3:  for  do
4:     
5:     
6:     
7:     
8:     
9:  end for
Algorithm 1 SAdam

We are now ready to present our algorithm, which follows the general framework of Adam and is summarized in Algorithm 1. In each round , we firstly observe the gradient at (Step 4), then compute the first-order momentum (Step 5). Here a time-variant non-increasing hyperparameter, which is commonly set as , where (Kingma & Ba, 2015; Reddi et al., 2018). Next, we calculate the second-order momentum by EMA of the square of past gradients (Step 6). This procedure is controlled by , whose value will be discussed later. After that, we add a vanishing factor to the diagonal of and get (Step 7), which is a standard technique for avoiding too large steps caused by small gradients in the beginning iterations. Finally, we update the decision by and , which is then projected onto the decision set (Step 8).

While SAdam is inspired by Adam, there exist two key differences: One is the update rule of in Step 8, the other is the configuration of in Step 6. Intuitively, both modifications stem from strongly convex OGD, and jointly result in a faster decaying yet under controlled step size which helps utilize the strong convexity while preserving the practical benefits of Adam. Specifically, in the first modification, we remove the square root operation in (4) of Adam, and update at Step 8 as follows

(6)

In this way, the step size used to update the -th element of is , which decays in general on the order of , and can still be automatically tuned in a per-feature basis via the EMA of the historical gradients.

The second modification is made to , which determines the value of and thus also controls the decaying rate of the step size. To help understand the motivation behind our algorithm, we first revisit Adam, where is simply set to be constant, which, however, could cause rapid fluctuation of the step size, and further leads to the non-convergence issue. To ensure convergence, Reddi et al. (2018) propose that should satisfy the following two conditions:

Condition 1.

and

Condition 2.

For some and , ,

The first condition implies that the difference between the inverses of step sizes in two consecutive rounds is positive. It is inherently motivated by convex OGD (i.e., OGD with step size , where is a constant factor), where

is a key condition used in the analysis. We first modify Condition 1 by mimicking the behavior of strongly convex OGD as we are devoted to minimizing regret for strongly convex functions. In strongly convex OGD (Hazan et al., 2007), the step size at each round is set as with for -strongly convex functions. Under this configuration, we have

(7)

Motivated by this, we propose the following condition for our SAdam, which is an analog to (7).

Condition 3.

Their exists a constant such that for any , we have and ,

(8)

Note that the extra in the righthand side of (8) is necessary because SAdam involves the first-order momentum in its update.

Finally, since the step size of SAdam scales with rather than in Adam, we modify Condition 2 accordingly as follows:

Condition 4.

For some , and ,

(9)

3.2 Theoretical Guarantees

In the following, we give a general regret bound when the two conditions are satisfied.

Theorem 1.

Suppose Assumptions 1 and 2 hold, and all loss functions are -strongly convex. Let , where , and be a parameter sequence such that Conditions 3 and 4 are satisfied. Let . The regret of SAdam satisfies

(10)
Remark 1.

The above theorem implies that our algorithm enjoys an regret bound, which is in the worst case, and automatically becomes tighter whenever the gradients are small or sparse such that for some . The superiority of data-dependent bounds have been witnessed by a long list of literature, such as Duchi et al. (2011); Mukkamala & Hein (2017); Reddi et al. (2018). In the following, we give some concrete examples:

  • Consider a one-dimensional sparse setting where non-zero gradient appears with probability

    and is a constant. Then , which is a constant factor.

  • Consider a high-dimensional sparse setting where in each dimension of gradient non-zero element appears with probability with being a constant. Then, , which is much tighter than .

Next, we provide an instantiation of such that Conditions 3 and 4 hold, and derive the following Corollary.

Corollary 2.

Suppose Assumptions 1 and 2 hold, and all loss functions are -strongly convex. Let , where , and , where . Then we have:
1. For any , and ,

2. For all and ,

Moreover, let , and the regret of SAdam satisfies:

Furthermore, as a special case, by setting and , our algorithm reduces to SC-RMSprop (Mukkamala & Hein, 2017), which is a variant of RMSprop for strongly convex functions. Although Mukkamala & Hein (2017) have provided theoretical guarantees for this algorithm, we note that their regret bound is in fact data-independent. Specifically, the regret bound provided by Mukkamala & Hein (2017) takes the following form:

(11)

Focusing on the denominator of the last term in (11), we have

thus

which implies that their regret bound is of order , and thus data-independent. In contrast, based on Corollary 2, we present a new regret bound for SC-RMSprop in the following, which is , and thus data-dependent.

Corollary 3.

Suppose Assumptions 1 and 2 hold, and all loss functions are -strongly convex. Let , , and , where . Let . Then SAdam reduces to SC-RMSprop, and its regret satisfies

(12)

Finally, we note that Mukkamala & Hein (2017) also consider a more general version of SC-RMSprop which uses a time-variant non-increasing for each dimension . In Appendix D we introduce the -variant technique to our SAdam, and provide corresponding theoretical guarantees.

4 Experiments

(a) MNIST
(b) CIFAR10
(c) CIFAR100
Figure 1: Regret v.s. data proportion for -regularized softmax regression

In this section, we present empirical results on optimizing strongly convex functions and training deep networks.

Algorithms. In both experiments, we compare the following algorithms:

  • [noitemsep,nolistsep]

  • SC-Adagrad (Mukkamala & Hein, 2017), with step size .

  • SC-RMSprop (Mukkamala & Hein, 2017), with step size and .

  • Adam (Kingma & Ba, 2015) and AMSgrad (Reddi et al., 2018), both with , , for convex problems and time-invariant for non-convex problems.

  • AdamNC (Reddi et al., 2018), with , , and for convex problems and a time-invariant for non-convex problems.

  • Online Gradient Descent (OGD), with step size for strongly convex problems and a time-invariant for non-convex problems.

  • Our proposed SAdam, with , .

For Adam, AdamNC and AMSgrad, we choose according to the recommendations in their papers. For SC-Adagrad and SC-RMSprop, following Mukkamala & Hein (2017), we choose a time-variant for each dimension , with , for convex problems and , for non-convex problems. For our SAdam, since the removing of the square root procedure and very small gradients may cause too large step sizes in the beginning iterations, we use a rather large to avoid this problem. To conduct a fair comparison, for each algorithm, we choose from the set and report the best results.

Datasets. In both experiments, we examine the performances of the aforementioned algorithms on three widely used datasets: MNIST (60000 training samples, 10000 test samples), CIFAR10 (50000 training samples, 10000 test samples), and CIFAR100 (50000 training samples, 10000 test samples). We refer to LeCun (1998) and Krizhevsky (2009) for more details of the three datasets.

4.1 Optimizing Strongly Convex Functions

In the first experiment, we consider the problem of mini-batch -regularized softmax regression, which belongs to the online strongly convex optimization framework. Let be the number of classes and be the batch size. In each round , firstly a mini-batch of training samples arrives, where . Then, the algorithm predicts parameter vectors , and suffers a loss which takes the following form:

The value of and are set to be for all experiments. The regret (in log scale) v.s. dataset proportion is shown in Figure 1. It can be seen that our SAdam outperforms other methods across all the considered datasets. Besides, we observe that data-dependant strongly convex methods such as SC-Adagrad, SC-RMSprop and SAdam preform better than algorithms for general convex functions such as Adam, AMSgrad and AdamNC. Finally, OGD has the overall highest regret on all three datasets.

4.2 Training Deep Networks

(a) MNIST
(b) CIFAR10
(c) CIFAR100
Figure 2:

Training loss v.s. number of epochs for 4-layer CNN

(a) MNIST
(b) CIFAR10
(c) CIFAR100
Figure 3: Test accuracy v.s. number of epochs for 4-layer CNN

Following Mukkamala & Hein (2017), we also conduct experiments on a 4-layer CNN, which consists of two convolutional layers (each with 32 filters of size 3

3), one max-pooling layer (with a 2

2 window and 0.25 dropout), and one fully connected layer (with 128 hidden units and 0.5 dropout). We employ ReLU function as the activation function for convolutional layers and softmax function as the activation function for the fully connected layer. The loss function is the cross-entropy. The training loss v.s. epoch is shown in Figure

2, and the test accuracy v.s. epoch is presented in Figure 3

. As can be seen, our SAdam achieves the lowest training loss on the three data sets. Moreover, this performance gain also translates into good performance on test accuracy. The experimental results show that although our proposed SAdam is designed for strongly convex functions, it could lead to superior practical performance even in some highly non-convex cases such as deep learning tasks.

5 Conclusion and Future Work

In this paper, we provide a variant of Adam adapted to strongly convex functions. The proposed algorithm, namely SAdam, follows the general framework of Adam, while keeping a step size decaying in general on the order of and controlled by data-dependant heperparameters in order to exploit strong convexity. Our theoretical analysis shows that SAdam achieves a data-dependant regret bound for strongly convex functions, which means that it converges much faster than Adam, AdamNC, and AMSgrad in such cases, and can also enjoy a huge gain in the face of sparse gradients. In addition, we also provide the first data-dependant logarithmic regret bound for SC-RMSprop algorithm. Finally, we test the proposed algorithm on optimizing strongly convex functions as well as training deep networks, and the empirical results demonstrate the effectiveness of our method.

Since SAdam enjoys a data-dependent regret for online strongly convex optimization, it can be easily translated into a data-dependent convergence rate for stochastic strongly convex optimization (SSCO) by using the online-to-batch conversion (Kakade & Tewari, 2009). However, this rate is not optimal for SSCO, and it is sill an open problem how to achieve a data-dependent convergence rate for SSCO. Recent development on adaptive gradient method (Chen et al., 2018b) has proved that Adagrad combined with multi-stage scheme (Hazan & Kale, 2014) can achieve this rate, but it is highly non-trivial to extend this technique to SAdam, and we leave it as a future work.

References

Appendix A Proof of Theorem 1

From Definition 1, we can upper bound regret as

(13)

where is the best decision in hindsight. On the other hand, by the update rule of in Algorithm 1, we have

(14)

where , and the inequality is due to the following lemma, which implies that the weighed projection procedure is non-expansive.

Lemma 1.

(McMahan & Streeter, 2010) Let be a positive definite matrix and be a convex set. Then we have, ,

(15)

Rearranging (14), we have

(16)

where the last inequality is due to . We proceed to bound the second term of the inequality above. When , by Young’s inequality and the definition of , we get

(17)

When , this term becomes 0 since in Algorithm 1. Plugging (16) and (17) into (13), we get the following inequality, of which we divide the righthand side into three parts and upper bound each of them one by one.

To bound , we have

(18)

where the inequality is derived from . For the first term in the last equality of (18), we have

(19)

where the second equality is because , and the second inequality is due to .
For the second term of (18), we have

(20)

where first inequality is due to Condition 3, and the second inequality follows from Assumption 2. Combining (18), (19) and (20), we get

(21)

To bound , we introduce the following lemma.

Lemma 2.

The following inequality holds

(22)

By Lemma 2, we have

(23)

Finally, we turn to upper bound :