pytorch-lr-dropout
"Learning Rate Dropout" in PyTorch
view repo
The performance of a deep neural network is highly dependent on its training, and finding better local optimal solutions is the goal of many optimization algorithms. However, existing optimization algorithms show a preference for descent paths that converge slowly and do not seek to avoid bad local optima. In this work, we propose Learning Rate Dropout (LRD), a simple gradient descent technique for training related to coordinate descent. LRD empirically aids the optimizer to actively explore in the parameter space by randomly setting some learning rates to zero; at each iteration, only parameters whose learning rate is not 0 are updated. As the learning rate of different parameters is dropped, the optimizer will sample a new loss descent path for the current update. The uncertainty of the descent path helps the model avoid saddle points and bad local minima. Experiments show that LRD is surprisingly effective in accelerating training while preventing overfitting.
READ FULL TEXT VIEW PDF
Proper optimization of deep neural networks is an open research question...
read it
Finding methods for making generalizable predictions is a fundamental pr...
read it
A multiplicative constant scaling factor is often applied to the model o...
read it
Online updating a tracking model to adapt to object appearance variation...
read it
The choice of initial learning rate can have a profound effect on the
pe...
read it
When an agent encounters a continual stream of new tasks in the lifelong...
read it
Machine learning practitioners invest significant manual and computation...
read it
"Learning Rate Dropout" in PyTorch
None
Deep neural networks are trained by optimizing high-dimensional non-convex loss functions. The success of training hinges on how well we can minimize these loss functions, both in terms of the quality of the convergence points and the time it takes to find them. These loss functions are usually optimized using gradient-descent-based algorithms. These optimization algorithms search the descending path of the loss function in the parameter space according to a predefined paradigm. Existing optimization algorithms can be roughly categorized into adaptive and non-adaptive methods. Specifically, adaptive algorithms (Adam
[15], Amsgrad [27], RMSprop
[31], RAdam [19], etc) tend to find paths that descend the loss quickly, but usually converge to bad local minima. In contrast, non-adaptive algorithms (e.g. SGD-momentum) converge better, but the training speed is slow. Helping the optimizer maintain fast training speed and good convergence is an open problem.Implicit regularization techniques (e.g. dropout [30], weight decay [16], noisy label [37]) are widely used to help training. The most popular one is dropout, which can prevent feature co-adaptation (a sign of overfitting) effectively by randomly dropping the hidden units (i.e. their activation is zeroed). Dropout can be interpreted as a way of regularizing training by adding noise to the hidden units. Other methods can achieve similar effects by injecting noise into gradients [24], label [37]
[10]. The principle of these methods is to force the optimizer to randomly change the loss descent path in the high-dimensional parameter space by injecting interference into the training. The uncertainty of the loss descent path gives the model more opportunities to escape local minima and find a better result. However, many researchers have found that these regularization methods can improve generalization at the cost of training time [30, 34]. This is because random noise in training may create an erroneous loss descent path (i.e. current parameter updates may increase the loss), resulting in slow convergence. In fact, what is needed is a technique that finds good local optima and converges quickly.A possible solution is to inject uncertainty into the loss descent path while ensuring that the resulting path is correct. A correct loss descent path means it decreases the loss. On the other hand, inspired by the coordinate descent algorithm [36, 25, 23], we realize a fundamental fact that the decline of loss does not depend solely on the simultaneous update of all weight parameters. According to coordinate descent, even if only one parameter is updated, the loss is reduced. In particular, the loss descent path is determined by all updated parameters. This means that we can sample a new descent path by randomly pausing the update of certain parameters. Based on these observations, we propose a new regularization technique, learning rate dropout (LRD), to help the optimizer randomly sample the correct loss descent path at each iteration.
Learning rate dropout provides a way to inject randomness into the loss descent path while maintaining a descent direction. The key difference from standard dropout is to randomly drop the learning rate of the model parameters instead of the hidden units. During training, the optimizer dynamically calculates the update and allocates a learning rate for each parameter. Learning rate dropout works by randomly determining which parameters are not updated (i.e.
the learning rate is set to zero) for the current training iteration. In the simplest case, the learning rate for each parameter is retained with a fixed probability
, independently from other parameters, or is dropped with probability . A dropped learning rate is temporarily set to zero, which does not affect the update of other parameters or subsequent updates. A neural network with parameters has feasible descent paths. Since each path is correct in that it decreases the objective function, our learning rate dropout does not hinder training unlike previous methods [30, 37]. However, convergence issues such as saddle points and bad local optima are avoided more easily using the proposed method. Furthermore, LRD can be trivially integrated into existing gradient-based optimization algorithm, such as Adam.Learning rate dropout can be interpreted as a regularization technique for loss descent path by adding noise to the learning rate. The idea of adding noise [1, 13, 9, 32] to the training of neural networks has drawn much attention. Adilova, et al. [1] demonstrate that noise injection substantially improves model quality for non-linear neural network. An [2] explored the effects of noise injection to the inputs, outputs, and weights of multilayer feedforward neural networks. Blundell, et al. [4] regularize the weights by minimizing a variational free energy. Neelakantan, et al. [24] also find that adding noise to gradients can help avoid overfitting and result in lower training loss. Xie, et al. [37] imposes regularization within the loss layer by randomly setting the labels to be incorrect. Similarly, the standard dropout [30] is a way of regularizing a neural network by adding noise to its hidden units. Wan, et al. [33] further proposed DropConnect, which is a generalization of Dropout. DropConnect sets a randomly selected subset of weights to zero, rather than hidden units. On the other hand, adding noise to activation functions can prevent the early saturation. Gulcehre, et al. [10] found that noisy activation functions (e.g., sigmoid and tanh) are easier to optimize. Replacing the non-linearities by their noisy counterparts usually leads to better results. Chen, et al. [5] propose a noisy softmax to mitigate the early saturation issue by injecting annealed noise to the softmax input. All of these methods can effectively prevent overfitting. However, they result in slower convergence.
Deep neural networks are optimized using gradient-descent-based algorithms. Stochastic gradient descent (SGD)
[28] is a widely used approach, which performs well in many research fields. However, it has been empirically observed that SGD has slow convergence since it scales the gradient uniformly in all directions. To address this issue, variants of SGD that adaptively rescale or average the gradient direction have achieved some success. Examples include RMSprop [31], Adadelta [38] and Adam [15]. In particular, Adam is the most popular adaptive optimization algorithm due to its rapid training speed. However, many publications [35, 27] indicate that Adam has poor convergence. Recently, Amsgrad [27] and Adabound [22], two variants of Adam, were proposed to solve the convergence issues of Adam by bounding the learning rates. Furthermore, RAdam [19]also aims to solve the convergence issue of Adam by rectifying the variance of the adaptive learning rate. While these adaptive methods often display faster progress in training, they have also been observed to fail to converge well in many cases.
We start our discussion on learning rate dropout by integrating it into an online optimization problem [40]. We provide a generic framework of optimization methods with learning rate dropout in Algorithm 1 (all multiplications are element-wise). Consider a neural network with weight parameters . We assume . In the online setup, at each time step , the optimization algorithm modifies the current parameters using the loss function over data time , and gradient . Finally, the optimizer calculates the update for using and possibly other terms including earlier gradients.
Algorithm 1 encapsulates many popular adaptive and non-adaptive methods by the definition of gradient accumulation terms and . In particular, the adaptive methods are distinguished by the choice of , which is absent in non-adaptive methods. Most methods contain the similar momentum component
(1) | ||||
where , , is the momentum parameter. The momentum accumulates the exponentially moving average of previous gradients to correct the current update.
During training, the optimizer assigns a learning rate , which is typically a constant, to each parameter. Applying learning rate dropout in optimization, in one iteration the learning rate of each parameter is kept with probability , independently from other parameters, or set to zero otherwise. At each time step , a random binary mask matrix is sampled to encode the learning rate information with each element . Then a learning rate matrix at time step is obtained by
(2) |
For a parameter, a learning rate of means not updating its value. For a model, applying learning rate dropout is equivalent to uniformly sampling one of possible parameter subsets for which to perform a gradient update in the current iteration (see Figure 3). This is closely connected to coordinate descent, in which each set of a partition of parameters is fully optimized in a cycle, before returning to the first set for the next iteration. Therefore, each update of LRD will still cause a decrease in the loss function according to the gradient descent theory [29, 36, 25], while better avoiding saddle points. On the other hand, our LRD does not interrupt the gradient calculation of any parameters. If there is a gradient accumulation term (e.g. momentum) in the optimizer, the gradients of each parameter will be stored for subsequent updates, regardless of whether the learning rate is dropped. Furthermore, our method won’t slow training like previous methods [30, 37, 33].
The learning rate dropout can accelerate training and improve generalization. By adding stochasticity to loss descent path, this technique helps the model to traverse quickly through the “transient” plateau (e.g. saddle points or local minima) and gives the model more chances to find a better minimum. To verify the effectiveness of learning rate dropout, we show a toy example to visualize the loss descent path during optimization. Consider the following nonconvex function:
(3) | ||||
where , . We use the popular optimizer Adam to search the minima of the function in the two dimensional parameter space. For this function, the point is the optimal solution. In Figure 4, we see that the convergence of Adam is sensitive to the initial point. Different initializations lead to different convergence results. And, Adam is easily trapped by the minimum near the initial point, even if it is a bad local minimum. In contrast, our learning rate dropout makes Adam more active. Even if the optimizer reach a local minimum, the learning rate dropout still encourages the optimizer to search for other possible paths instead of doing nothing. This example illustrates that the learning rate dropout can effectively help the optimizer escape from suboptimal points and find a better result.
To empirically evaluate the proposed method, we perform different popular machine learning tasks, including image classification, image segmentation and object detection. Using different models and optimization algorithms, we demonstrate that learning rate dropout is a general technique for improving neural network training not specific to any particular application domain.
Dataset | Model | Optimization algorithms (without / with learning rate dropout) | ||||
SGDM LRD | RMSprop LRD | Adam LRD | AMSGrad LRD | RAdam LRD | ||
MNIST | FCNet | 97.82 / 98.21 | 98.06 / 98.40 | 97.88 / 98.15 | 97.88 / 98.51 | 97.97 / 98.21 |
CIFAR-10 | ResNet-34 | 95.30 / 95.54 | 92.71 / 93.68 | 93.05 / 93.77 | 93.31 / 93.73 | 94.24 / 94.61 |
CIFAR-100 | DenseNet-121 | 79.09 / 79.42 | 70.21 / 74.02 | 72.55 / 74.34 | 73.91 / 75.23 | 71.81 / 73.04 |
We first apply learning rate dropout to the multiclass classification problem of the MNIST, CIFAR-10 and CIFAR-100 datasets.
The MNIST digits dataset [17] contains training and test images of size
. The task is to classify the images into 10 digit classes. Following the previous publications
[15, 3], we train a simple 2-hidden fully connected layer neural network (FCNet) on MNIST. We use a fully connected 1000 rectified linear units (ReLU) as each hidden layer for this experiment. The training was on mini-batches with 128 images per batch for 100 epochs through the training set. A decay scheme is not used.
The CIFAR-10 and CIFAR-100 datasets consist of 60,000 RGB images of size
, drawn from 10 and 100 categories, respectively. 50,000 images are used for training and the rest for testing. In both datasets, training and testing images are uniformly distributed over all the categories. To show the broad applicability of the proposed method, we use the ResNet-34
[12] for CIFAR-10 and DenseNet-121 [14] for CIFAR-100. Both models are trained for 200 epoches with a mini-batch size 128. We reduce the learning rates by 10 times at the 100-th and 150-th epoches. The weight decay rate is set to .To do a reliable evaluation, the classifiers are trained multiple times using several optimization algorithms including SGD-momentum (SGDM) [26], RMSprop [31], Adam [15], AMSGrad [27] and RAdam [19]. We apply learning rate dropout to each optimization algorithms to show how this technique can help with training. Unless otherwise specified, the dropout rate (the probability of performing an individual parameter update in any given iteration) is set to . For each optimization algorithm, the initial learning rate has the greatest impact on the ultimate solution. Therefore, we only tune the initial learning rate while retaining the default settings for other hyper-parameters. Specifically, the initial learning rates for SGDM, RMSprop, Adam, AMSGrad, and RAdam are 0.1, 0.001, 0.001, 0.001, 0.03, respectively.
We first show the learning curves for MNIST in Figure 5. We find that all methods show fast convergence speed and good generalization. However, compared with their prototypes, the optimizers that use learning rate dropout still display better performance in terms of training speed and test accuracy. We further report the results for CIFAR datasets in Figure 6 and Figure 7. The test accuracy obtained by different methods is summarized in Table 1. The same models with and without learning rate dropout have very different convergence. These optimization algorithms without learning rate dropout either have slow training or converge to poor results. In contrast, learning rate dropout benefits all optimization algorithms by improving training speed and generalization. In particular we observe that by applying learning rate dropout to the non-adaptive method SGDM, we can achieve convergence speed comparable to other adaptive methods. On CIFAR-100, learning rate dropout even helps RMSprop obtain an accuracy improvement of close to . Furthermore, learning rate dropout incurs negligible computational costs and no parameter tuning aside from the dropout rate .
We also consider the semantic segmentation task, which assigns each pixel in an image a semantic label [7, 21]. We evaluate our method on the PASCAL VOC2012 semantic segmentation dataset [8], which consists of 20 object categories and one background category. Following the conventional setting in [6, 18], the dataset is augmented by extra annotated VOC images provided in [11], which results in 10,582, 1,449 and 1,456 images for training, validation and testing, respectively. We use the state-of-the-art segmentation model PSPNet [39] to conduct the experiments. We use Adam solver with the initial learning rate
. Other hyperparameters such as batch size and weight decay follow the setting in
[39]. The segmentation performance is measured by the mean of class-wise intersection over union (Mean IoU) and pixel-wise accuracy (Pixel Accuracy). Results for this experiment are reported in Figure 8. With our proposed learning rate dropout, the model yields results 0.688/0.921 in terms of Mean IoU and Pixel Accuracy, exceeding the vanilla Adam of 0.637/0.905. In addition, Figure 8 shows once again that learning rate dropout leads to faster convergence.Object detection is a challenging and important problem in the computer vision community. We next apply learning rate dropout to an object detection task. We use VOC2012
and VOC2007 datasets for training, and use VOC2007 as our validation and testing data. We train a one-stage detection network SSD [20] using Adam with a learning rate 0.001. Other hyperparameter settings are the same as in [20]. To show the effect of learning rate dropout on training, we use the training loss and validation loss as the evaluation metric. We run the model for 100 epochs and show the results in Figure
9. After applying learning rate dropout, the training loss and validation loss drop faster and converge to better results. This means that the detector achieves higher detection accuracy.Adam | SGDM | |
---|---|---|
(No LRD) | 93.05 | 95.30 |
93.32 | 94.32 | |
93.93 | 95.39 | |
93.77 | 95.54 | |
93.50 | 95.42 | |
93.35 | 95.26 |
As mentioned, the hyperparameter which controls the expected size of the coordinate update set is the only parameter that needs to be tuned. In this section, we explore the effect of various . We conduct experiments with ResNet-34 on the CIFAR-10 dataset, where is chosen from a broad range . We use Adam and SGDM to train the model, other hyperparameters are the same as in previous experiments. The results are shown in Figure 10. We can see that for any , learning rate dropout can speed up training, but the smaller the , the faster the convergence. In addition, we report the test accuracy in Table 2. As can be seen, almost all values of lead to an improvement in test accuracy. In practical applications, in order to balance the training speed and generalization, we recommend setting in the range of to .
We further compare our learning rate dropout with three popular regularization methods: standard dropout [30], noisy label [37] and noisy gradient [24]. The standard dropout regularizes the network on hidden units, and we set the probability that each hidden unit is retained to . The noise label disturbs each train sample with the probability (i.e., a label is correct with a probability ). For each changed sample, the label is randomly drawn uniformly from the other labels except the correct one. The noisy gradient adds Gaussian noise to the gradient at every time step. The variance of the Gaussian noise gradually decays to 0 according to the parameter setting in [24]. We apply these regularization techniques to the ResNet-34 trained on CIFAR-10. This model is trained multiple times using SGDM and Adam. For clarity, we use the terms “”, “”, “” to denote training with standard dropout, noisy label and noisy gradient, respectively.
We report the learning curves in Figure 11. As can be seen, our learning rate dropout can speed up training while other regularization methods hinder convergence. This is because other regularization methods inject random noise into the training, which may lead to the wrong loss descent path. In contrast, our learning rate dropout always samples a correct loss descent path, so it does not worsen the training. We also show the test accuracy in Table 3. The results show that the standard dropout and our learning rate dropout can effectively improve generalization, while the effects of noise label and noisy gradient are disappointing. On the other hand, we find that our learning rate dropout and other regularization methods can be complementary. We use “” to represent the simultaneous use of learning rate dropout and other optimizers. From Figure 11 and Table 3, we see that learning rate dropout can work with other regularization methods to further improve the performance of the model. This shows that learning rate dropout is not limited to being a substitute for standard dropout or other methods.
Adam | SGDM | |
---|---|---|
No regularization | 93.05 | 95.30 |
Standard dropout (SD) | 94.05 | 95.33 |
Noise gradient (NG) | 93.70 | 95.00 |
Noise label (NL) | 92.59 | 94.61 |
Learning rate dropout (LRD) | 93.77 | 95.54 |
LRD and SD | 94.73 | 95.58 |
LRD and NG | 93.92 | 95.26 |
Learning rate dropout is also equivalent to applying dropout to the update of parameters at each timestep . The idea is to inject uncertainty into the loss descent path, while ensuring that the loss descends in the correct direction at each iteration. One may wonder if there are other ways to realize this idea. For example, using dropout on the gradient is also a possible solution. The dropout on can also interfere with the loss descent path and does not lead to the wrong descent direction. Here, we compare our learning rate dropout with dropout on (DG). DG follows the setting of LRD, the gradient of a parameter is retained with a probability , or is set to 0 with probability . In this experiments, . We show the learning curves in Figure 12. As can be seen , DG has no positive effect on training. The causes of this may be multiple. First, the gradient accumulation terms (e.g., momentum) can greatly suppress the impact of dropping gradients on the loss descent path. In addition, dropping the gradient may slow down training due to the lack of gradient information. In contrast, our LRD only temporarily stops updating some parameters, and all gradient information is stored by the gradient accumulation terms. Therefore, in our learning rate dropout training, there is no loss of gradient information.
We have shown that LRD can speed up training and improve generalization with almost any dropout rate (). The standard dropout (SD) also has a tunable dropout rate . In this section, we explore the effect of various . We apply SD to ResNet-34 trained on CIFAR-10 dataset, where is chosen from . We show the results in Figure 13 and Table 4. As can be seen, the performance of standard dropout is sensitive to the choice of dropout rate . The smaller the , the worse the convergence. This indicates that we should choose
carefully when applying SD to convolution neural networks.
1 (No SD) | 0.9 | 0.8 | 0.7 | 0.6 | |
---|---|---|---|---|---|
Adam | 93.05 | 94.05 | 94.08 | 93.34 | 92.89 |
We presented learning rate dropout, a new technique for regularizing neural network training with gradient descent. This technique encourages the optimizer to actively explore in the parameter space by randomly dropping some learning rates. The uncertainty of the learning rate helps models quickly escape poor local optima and gives models more opportunities to search for better results. Experiments show the substantial ability of learning rate dropout to accelerate training and enhance generalization. In addition, this technique is found to be effective in wide variety of application domains including image classification, image segmentation and object detection. This shows that learning rate dropout is a general technique, which has great potential in practical applications.
The effects of adding noise during backpropagation training on a generalization performance
. Neural computation 8 (3), pp. 643–674. Cited by: §2.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 8522–8531. Cited by: §4.1.Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion
. Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §2.
Comments
There are no comments yet.