TAdam: A Robust Stochastic Gradient Optimizer

02/29/2020 ∙ by Wendyam Eric Lionel Ilboudo, et al. ∙ 0

Machine learning algorithms aim to find patterns from observations, which may include some noise, especially in robotics domain. To perform well even with such noise, we expect them to be able to detect outliers and discard them when needed. We therefore propose a new stochastic gradient optimization method, whose robustness is directly built in the algorithm, using the robust student-t distribution as its core idea. Adam, the popular optimization method, is modified with our method and the resultant optimizer, so-called TAdam, is shown to effectively outperform Adam in terms of robustness against noise on diverse task, ranging from regression and classification to reinforcement learning problems. The implementation of our algorithm can be found at https://github.com/Mahoumaru/TAdam.git

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

TAdam

TAdam (http://arxiv.org/abs/2003.00179) code in pytorch


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The field of machine learning is undoubtedly dominated by first-order optimization methods based on the gradient descent algorithm and particularly [1]

, its stochastic variant, the stochastic gradient descent (SGD) method 

[2]

. The popularity of the SGD algorithm comes from its simplicity, its computational efficiency with respect to second-order methods, its applicability to online training and its convergence rate that is independent of the training set. In addition, SGD has high affinity with deep learning 

[3]

, where network parameters are updated by backpropagation of their gradients, and is intensively used to train large deep neural networks.

Despite such established popularity, a specific trait of SGD is the inherent noise, coming from sampling training points. Even though this stochasticity makes the algorithm more likely to find a global minimum, those fluctuations also slow down the learning process and furthermore, render the algorithm sensitive to outliers. Indeed, bad estimates of the gradients are likely to produce bad estimation of the minimum.

Many of the new optimizers proposed to improve the SGD algorithm and tackle complex training scenarios where gradient descent methods behave poorly also share the same weakness to aberrant value. Adam (Adaptive moment estimates) 

[4], one of the most widely used and practical optimizers for training deep learning models, is no exception. This is mainly due to the insufficient number of samples implicitly involved in its first moment evaluation.

The weakness to noisy data is particularly important in robotics learning where incomplete, ambiguous and noisy sensor data are inevitable. Furthermore, in order to generate large scale robot datasets for scaling up robot learning, the ability to use automatically labeled data [5] is important. Robust learning methods are therefore needed to deal with the eventual noisy labels and can improve the performance of low-cost robots that suffers from inaccurate position control and calibration and noisy executions, without the need of a noise modeling network [6].

Hence, the aim of the present research is to propose a robust version of Adam through the use of robust estimates of the momentum, which is assumed to be the first-order probabilistic moment of the gradients. The key idea for such robust estimates is the use of the student-t distribution, which is a model suitable for the estimates from a few samples [7].

Ii Background and previous works

Ii-a Background

Stochastic Gradient Descent

Let be a random sample from the data set at iteration , the objective function evaluated on data with the parameters , its gradient, and the learning rate. The SGD algorithm [2] updates to through the following update rule:

(1)

This algorithm yields at least local minima of w.r.t .

Improving SGD

Since its proposition, many ideas have been developed in order to improve the convergence property of the SGD algorithm. This feature heavily connects to the fluctuations of the gradients during learning. All the research that aim to accelerate the convergence rate have done so through several approaches. For instance, they improved i) the update method of the parameters [8, 9, 10, 11]; ii) the adjustment of the learning rate [12, 13, 14, 15]; and iii) the robustness to aberrant values from heavy-tailed data [16, 17, 18, 19]

. Those approaches have culminated to some pretty effective state-of-the-art first-order optimization methods, going from the momentum idea to the adaptive learning rate and variance reduction schemes. Below, we review some of the works related to the robustness.

Ii-B Previous works

As stated before, SGD is inherently noisy and susceptible to produce bad minima estimates when facing aberrant gradient estimates. A lot of work have therefore been done to propose more robust methods for efficient machine learning under noise or data with heavy tails.

In this review, we ignore the general statistical methods for robust mean estimates [20] such as the median based estimations [21, 22, 23] due to their practical limitations. Three main approaches are distinguished: a) methods based on direct robust estimates of the loss (or risk) function [24]; b) methods based on robust estimates of the gradients [25, 19] among which falls our algorithm; and c) methods with small learning rates for wrong gradient estimates [18].

Robust risk estimation

Those methods usually require the use of all the available data in order to produce, for each parameter, a robust estimate of the loss function to be minimized. A specific inconvenient trait of this approach is the implicit definition of the robust estimate, which may introduce some computational roadblocks. As briefly explained by Holland et al. 

[19], since the estimates do not need to be convex even in the case where the loss function is, the non-linear optimization can be both unstable and costly in high dimensions.

Robust gradient descent

This approach usually rely on the replacement of the empirical mean (first moment) gradient estimate with a more robust alternative, and simply differs in the method used to achieve this objective. Chen et al. [25] proposed the use of the geometric median of the gradients mean to aggregate multiple candidates. Using the same strategy, Prasad et al. [26] proposed a class of gradient estimator based on the idea that the gradient of a population loss could be regarded as the mean of a multivariate distribution, reducing the problem of gradient estimation to a multivariate mean estimation problem. Very close to our approach, Holland et al. [19] proposed to carefully reduce the effect of aberrant values instead of discarding them, which can also result in unfortunate discards of valuable data.

Adaptive learning rate

This approach is to reduce the effect of wrong gradient estimates by reducing the learning rate. One such approach has been proposed by Haimin et al. [18] and shares the same objective as ours to produce a robust version of the Adam optimization algorithm. The method employed by Haimin et al. uses an exponential moving average (EMA) of the ratio between the current loss value and the past one to scale the learning rate. However, this strategy allows the outliers to modify the estimated gradient mean, and then uses the impact of the deviated mean on the loss function to reduce the effect on subsequent updates.

As one of the problems in the EMA scheme, the lack of robustness has been dealt with in [16] and [17]. In those methods, the exponential decay parameter of the EMA is increased whenever a value that falls beyond some boundary is encountered. The common drawback in this strategy is that all outlier gradients are treated equally and discretely without consideration of how far they are from the normal values, and the boundary over which a data is considered to be an outlier must be set manually before training.

Our contribution

To the best of our knowledge, our approach, named TAdam, is the first to employ estimates of the student-t distribution first moment to replace the estimates of the Gaussian first moment introduced by Adam, through the EMA scheme. The main advantage of this approach is that it relies on the natural robustness of the student-t distribution and its ability to deal with outliers, and can easily be reduced to Adam for non-heavy-tailed data. Also, even though, in this letter, we use our method to modify the popular optimizer Adam, we encourage the reader to keep in mind that it can be integrated to the other stochastic gradient descent methods that rely on EMAs like RMSProp 

[14], VSGD-fd [16], Adasecant [17] or Adabound [15].

Iii Proposal

Iii-a Adaptive moment estimation: Adam

Before describing our proposal, let us introduce Adam [4], the baseline of TAdam. Adam is a popular method that combines the advantages of SGD with momentum along with those of adaptive learning rate methods [14, 13]. Its update rule is implemented as follows:

(2)
(3)
(4)

where is the first-order moment (i.e., mean of gradients) and is the second-order moment utilized to adjust learning rates at time step . and are the exponential decay rates (by default and , respectively). is the global learning rate and is a small value added to avoid division by zero (typical value of ).

Although the use of EMAs in equations (2) and (3) makes the gradients smooth and reduces the fluctuations inherent to SGD, they are also sensitive to outliers. In particular, with a small value like (), the momentum is very likely to be pulled out by outliers and easily deviate from the true average. This fluctuation makes learning unstable (see Fig. 1), and therefore, more robust learning techniques are needed.

Fig. 1:

Sensitivity to outliers in Adam: a regression task with noise drawn from a student-t distribution with degrees of freedom

, and scale was conducted with Adam; the predicted curve had large variance and its accuracy was clearly deteriorated due to the noises.

Iii-B Overview

Our proposition relies on the fact that the EMA, like equations (2) and (3

), can be regarded as an incremental update law of the mean in the normal distribution with a fixed number of samples. The sensitivity of Adam to aberrant gradient values is therefore just a feature inherited from the normal distribution, which is itself also sensitive to outliers.

In order for Adam to be robust, the distribution of the gradients must be assumed to come from a robust probability distribution. We therefore propose to replace the normal distribution moment estimates by those from the student-t distribution, which is well-known to be a robust probability distribution 

[7, 27, 28], as shown in Fig. 2, and a general form of the normal distribution. From the next section, we describe how the EMA is replaced using the student-t distribution, and the features of our implementation are analyzed later. A pseudo code of TAdam is summarized in Algorithm 1.

Fig. 2: Robustness to outliers: the normal distribution (in green) was pulled out by outliers; in contrast, the student-t distribution (in red) allowed their existence and hardly moved.
1:: Learning rate
2:, [0,1): Exponential decay rates
3:: Small term added to the denominator
4:: Degrees of freedom
5:: Objective function with parameters
6:: Initial parameters
7:, ,
8:
9:while  not converged or  do
10:     
11:     
12:     
13:     
14:     
15:     
16:     
17:     
18:end while
19:return
Algorithm 1 TAdam, our proposed algorithm for stochastic optimization: it is an extended version of Adam by the student-t mean estimation; in typical setting, is smaller than ; a good default value for the degrees of freedom is found, .

Iii-C Formulation

To replace the EMA by the student-t distribution, a new hyperparameter, the degrees of freedom of the student-t distribution

, is introduced to control the robustness.

We can derive the incremental update law of the first moment for the student-t distribution using a maximum log-likelihood estimator. Given -dimensional i.i.d. random samples from multivariate student-t distribution with the parameters , and , its log-likelihood function is expressed as:

(5)

where . Taking the gradient with respect to and setting it equal to gives us:

(6)

If we solve this equation for , we get the expression of the first moment estimate given samples:

(7)

where and .

By assuming a diagonal distribution and fixing the number of samples (decaying ), we can derive the equation (8) used below in TAdam. Due to the high value of (i.e., about samples) w.r.t. (i.e., about samples), only the first-order moment in equation (2) is replaced by the following rule:

(8)

where

(9)
(10)

is the unmodified Adam’s second moment estimate coming from equation (3), and is the dimension of the gradient (i.e., the number of parameters in subsets like layers of deep learning). Here, the summation in the denominator of is substituted from now on by since it corresponds to the Mahalanobis distance between the gradient of the parameter , , and the corresponding previous estimate of the mean, , w.r.t. the variance that is assumed to be the same as Adam’s second moment estimate, . Note that, ultimately, the gradients converge to zero, and therefore, the second moment would be consistent with the variance of the gradients.

The power of this update rule is two folds: the outliers detection and the robustness control. Their details are explained below.

Iii-D The outliers detection

This is performed through which is an adaptive weight of the mean introduced in equation (8) with degrees of freedom . Again, we can notice that depends on the Mahalanobis distance . Hence, outlying gradient values are down-weighted since their Mahalanobis distances are larger than for normal values, and their contribution to the momentum update is therefore automatically dampened. On the contrary, the normal gradients are up-weighted ultimately by due to zero Mahalanobis distances, although is kept in that case since . In short, TAdam automatically and continuously reduces only the adverse effects of the outlier gradients.

Iii-E The robustness control

The Student-t distribution has a controllable robustness and that nice property of being similar to the normal distribution when the degrees of freedom grows larger. The same feature is left in TAdam, as can be seen in equation (9). Namely, when , we have:

(11)

In this case, TAdam loses its robustness to outliers, like Adam.

To make TAdam be an extended version of Adam, the decay rule in equation (10) is designed to fulfill some requirements. Specifically, if , the decay rate derived from and in equation (8) must be consistent with at any time.

(12)

To satisfy such a constant , the decay rate in equation (10) can be derived as follows if the decay rule is given as .

(13)

By the above derivation, TAdam defined by equations (8)–(10) is proved to be the extended version of Adam defined by the equation (2) (and equations (3)–(4)).

Iii-F The Regret Bound and TAdam’s Convergence

The convergence of the TAdam algorithm is assured by the two following theorems, whose proofs can be found in the appendix:

Theorem 1.

Given and , the sequences obtained from the TAdam algorithm, , , and . If has a bounded diameter , and if for all and . Then, for generated using TAdam (with the AMSGrad [29] scheme), we have the following upper bound on the regret:

(14)
Theorem 2.

Let’s assume that the gradients ultimately follow an asymptotic Normal distribution

, according to the central limit theorem; then the Mahalanobis distance appearing in TAdam follows a Chi-Squared distribution

, and the expected value of the adaptive decay parameter is constrained, for , by the following relation:

(15)

We can see that the difference between the upper bound of TAdam and Adam lies in the value of , which corresponds to the expected value of the adaptive exponential decay parameter . Theorem (2) tells us that, if the gradients are normally distributed, this value is bounded above by , so that we can recover the same upper bound for TAdam and Adam. However, if we know the exact value of the expected value, a more precise upper bound for the regret can be obtained.

Iv Experiments

To assess the robustness of TAdam against noisy data, we conducted three types of experiments spanning the main machine learning frameworks, i.e. supervised learning (regression and classification) and reinforcement learning. We compare TAdam mainly with Adam, but also with another robust gradient descent algorithm, RoAdam 

[18].

(a)
(b)
(c)
(d)
Fig. 3: Results of the regression task: (First Two Figures) Loss function w.r.t. the noise probability ; in all the noise settings, TAdam outperformed Adam. (Last Two Figures) Prediction curves after learning; although Adam suffered a large variance against the large noise and a bad prediction accuracy, TAdam relatively succeeded in approximating the ground truth function.
(a) Noise-free training accuracy
(b)

Noise-free data training loss per epoch

(c) Noisy data training accuracy
(d) Noise-free test accuracy
(e) Noise-free data test loss per epoch
(f) Noisy data test accuracy
Fig. 4: Training and test accuracy (noise-free and noise-included) and loss (noise-free) for ResNet-34 on CIFAR-100.
(a) Pybullet Ant-v0
(b) Pybullet Hopper-v0
(c) Pybullet HalfCheetah-v0
(d) Pybullet InvertedDoublePendulum-v0
(e) Pybullet Walker2D-v0
(f) Pybullet Reacher-v0
Fig. 5: Training curves for PPO agent.

Iv-a Robust Supervised Learning

It has been shown [30] that training standard supervised learning algorithms with noisy data resulted in bad performance and accuracy of the resulting models. In real robotic tasks, it is often unrealistic to assume that the true state is completely observable and noise-free, and perfect supervised signals are difficult to obtain. In the following experiments, TAdam reveals to be useful in increasing the accuracy of the models, even when facing noisy inputs.

Iv-A1 Robust Regression

Experimental settings

The regression setting on which we compared TAdam, Adam and RoAdam is as follows.

A ground truth function is defined as and we set a fully-connected neural network to approximate it from scattered observations , sampled from the true function with noise. The observations have a probability of being infected by some noise , so that:

(16)
(17)
(18)

where designates a student-t distribution with degrees of freedom , location, and scale , and

is a Bernoulli distribution with the probability

as its parameter.

The model, on the other hand, is a neural network with linear layers, each composed of neurons. The ReLU activation function [31] is used for all the hidden layers, while the loss function for the network is the Mean Squared Error (MSE).

Experimental results

The results of the loss functions against the noise probability on the regression task are depicted in Fig. 3 and Fig. 3. Note that 50 trials are conducted for each . As it can be seen, TAdam absolutely outperformed Adam in all the cases, and reveals to be more robust than RoAdam. In addition, as the noise probability in the observations increases, TAdam managed to resist to their effect.

To visualize the learning results, the predicted curves after learning are also illustrated in Fig. 3 and Fig. 3. The learning variances of Adam were obviously larger than those generated by TAdam, and TAdam relatively succeeded in following the ground truth function from the observations even with large noise.

Iv-A2 Robust Classification

Experimental settings

Here, we use the same experimental settings described in [15] and compare Adam, AMSGrad and their T versions along with RoAdam on an image classification task on the standard CIFAR-100 dataset.

The architecture of the convolutional network involved in the described experiments is the ResNet-34 [32]. A fixed budget of 200 epochs are used throughout the training, and the learning rates are reduced by 10 after 150 epochs.

The optimizers are launched with the following hyperparameter values: {learning rate: 0.001}, {betas: (0.99, 0.999)} and both T algorithms use the default degrees of freedom, i.e. {degrees of freedom = dimension of the gradients}. The third beta value of RoAdam is also set to {0.999}.

Experimental results

We first launched a simulation without noise, using directly the unmodified datasets. The results for that simulation are found in Fig. 4 and Fig. 4. We can see that TAdam and TAMSGrad are able to achieve faster convergence during the training phase compared to the standard versions, and also show higher level of generalization during the test phase. The corresponding loss curves, Fig. 4 and Fig. 4, show that TAMSGrad is able to reach a lowest point during the training phase, while also keeping a low loss value on the test data. This result points the fact that TAMSGrad builds on the combined improvement of the first moment (TAdam) and second moment (AMSGrad) in order to provide a more stable algorithm that can outperform the others.

Next, we applied, with a probability of 25%, a color jittering effect on the training dataset and replaced 20% of the original training data points by fake ones, in order to test the ability of the optimizers to extract the most useful informations from corrupted datasets. The results can be seen in Fig. 4 and Fig. 4 and it highlights the benefits of TAdam against Adam. Indeed, even thought the value of beta1 is larger (0.99 instead of default 0.9), Adam remains sensitive to outliers, while TAdam can ignore them.

Iv-B Robust Reinforcement Learning

Whether it comes from sensors, or from bad estimates during learning, or from different feedbacks from different human instructors (e.g. non-technical users in real-world robotics situations), noisiness is inseparable from robotics reinforcement learning (RL). In order to test the robustness properties of TAdam in RL tasks, we conducted some simulations on six different Pybullet gym environments [33]. The results are summarized in Fig. 5.

Experimental settings

The Fig. 5 summarizes four trials with four different seeds on each environment. The algorithm employed is the proximal policy optimization (PPO) [34]

, from the Berkley artificial intelligence research implementation, rlpyt 

[35], with the following setting:

Value loss coef. 1
Entropy loss coef. 0
GAE parameter 0.95
Num. Epochs 10
Ratio clipping 0.2
Horizon T 2048
Minibatch size 64
TAdam d.o.f.
TABLE I: Settings for the RL experiments

No gradient norm clipping was used throughout the simulations, since the property at test is the robustness of the optimizers to aberrant gradient values and their ability to produce good policies. Gradient norm clipping introduces a manually defined heuristic threshold, which depends on the task and on various conditions, and moreover, is used for the norm of all gradients larger than its value. Such trick would therefore introduce some undesirable bias in the results.

The simulations involved two different learning rates: the widely used and fine tuned value for Adam, , and the defined default, yet larger value, .

Experimental results

Searching for the optimal learning rate is commonly known to be a tedious and serious problem in SGD based algorithms, and high learning rates (particularly the default Adam step value ) are usually not used in reinforcement learning due to the amount of noise coming from the early bootstrapping stage, but also to avoid the agent from reaching an early deterministic policy.

As displayed by the results in Fig. 5, a high learning rate causes Adam to suffer from both these problems and makes it unable to converge to a good policy. On the other hand, TAdam proves to be robust enough to sustain different learning rates, and learns the tasks with both given hyperparameter values. Thanks to its careful updates of the agent, TAdam can still reach a sub-optimal policy that may even be better than the one reached with smaller learning rates (Fig. 5, 5). This feature offered by TAdam not only allows for the use of higher learning rates in order to accelerate the learning process, but also reduces the difficulties related to the tuning of the learning rate, since the default learning rate can be directly used.

Also, as stated in the experimental settings section, no gradient norm clipping was used during the simulations. Without this trick, we can see that Adam fails altogether on the inverted double pendulum task, while TAdam naturally and automatically ignores or reduces the effect of large gradients, keeping the gradient (momentum) from overshooting during learning and making the gradient norm clipping stratagem unnecessary.

V Conclusion and Future Work

In this letter, we proposed and described TAdam, a new stochastic gradient optimizer, which makes the Adam algorithm much more robust and provides a way to produce stable and efficient machine learning applications. TAdam is based on the robust mean estimate rule of the Student-t distribution as an alternative to the standard EMA. We verified that TAdam outperformed Adam in terms of robustness on supervised learning (regression and classification) tasks, and reinforcement learning tasks.

In this work, TAdam uses a fixed degrees of freedom which is equal to the dimension of the gradients, and therefore has a fixed robustness. A straightforward improvement is therefore to design a mechanism that automatically updates the parameter during the learning process, according to the presence or absence of outliers.

Appendix

V-a Proof of Theorem 1

First, we start by noticing that the basic bound of the regret from the convergence proof by Reddi et al. [29] also holds for TAdam, i.e.:

(19)

However, to further refine this upper bound, we need to redefine the Lemma 2 used in the proof of Reddi et al, since does not hold anymore for all time step . For this purpose, we use the expected value of , , instead of , to define the upper bound and, following the same process as Reddi et al., define a similar expression to their Lemma 2 in the case of TAdam:

(20)

Based on this new lemma, the remaining steps are completely identical to the proof of Reddi et al., and the final regret bound of TAdam is given by:

(21)

V-B Proof of Theorem 2

Assuming that the gradients ultimately follow an asymptotic normal distribution , then we know that , where is the degrees of freedom of the chi-squared distribution. Applying this to the Mahalanobis distance in TAdam, we have:

(22)

Now, we know that the expected value of the chi-squared distribution with degrees of freedom is and the expected value of the inverse-chi-squared distribution with the same degrees of freedom is given by . We can therefore define:

This inequality comes from the Jensen’s inequality and from the fact that and are respectively convex and concave. The expected value of the weights in TAdam, can therefore be expressed as:

(23)

We can then infer the mean of the weighted sum :

(24)

Where we have defined and taken advantage of the monotonic decrease of the sequence towards , given that for . We move on to express the upper bound for where

. For this purpose, we make use of the Hartley and Ross unbiased estimator for the mean of the ratio between two random variables 

[36, 37], which, based on the fact that the covariance between and is positive, gives:

(25)

The last inequality is drawn from the relations depicted by Eq. 24.

References