Learning Rate Dropout

by   Huangxing Lin, et al.
Xiamen University
Columbia University

The performance of a deep neural network is highly dependent on its training, and finding better local optimal solutions is the goal of many optimization algorithms. However, existing optimization algorithms show a preference for descent paths that converge slowly and do not seek to avoid bad local optima. In this work, we propose Learning Rate Dropout (LRD), a simple gradient descent technique for training related to coordinate descent. LRD empirically aids the optimizer to actively explore in the parameter space by randomly setting some learning rates to zero; at each iteration, only parameters whose learning rate is not 0 are updated. As the learning rate of different parameters is dropped, the optimizer will sample a new loss descent path for the current update. The uncertainty of the descent path helps the model avoid saddle points and bad local minima. Experiments show that LRD is surprisingly effective in accelerating training while preventing overfitting.


page 1

page 2

page 3

page 4


Training Aware Sigmoidal Optimizer

Proper optimization of deep neural networks is an open research question...

On Compression Principle and Bayesian Optimization for Neural Networks

Finding methods for making generalizable predictions is a fundamental pr...

Unintended Effects on Adaptive Learning Rate for Training Neural Network with Output Scale Change

A multiplicative constant scaling factor is often applied to the model o...

ROAM: Recurrently Optimizing Tracking Model

Online updating a tracking model to adapt to object appearance variation...

On the Outsized Importance of Learning Rates in Local Update Methods

We study a family of algorithms, which we refer to as local update metho...

To Drop or Not to Drop: Robustness, Consistency and Differential Privacy Properties of Dropout

Training deep belief networks (DBNs) requires optimizing a non-convex fu...

TAG: Task-based Accumulated Gradients for Lifelong learning

When an agent encounters a continual stream of new tasks in the lifelong...

Code Repositories


"Learning Rate Dropout" in PyTorch

view repo

1 Introduction

Deep neural networks are trained by optimizing high-dimensional non-convex loss functions. The success of training hinges on how well we can minimize these loss functions, both in terms of the quality of the convergence points and the time it takes to find them. These loss functions are usually optimized using gradient-descent-based algorithms. These optimization algorithms search the descending path of the loss function in the parameter space according to a predefined paradigm. Existing optimization algorithms can be roughly categorized into adaptive and non-adaptive methods. Specifically, adaptive algorithms (Adam

[15], Amsgrad [27]

, RMSprop

[31], RAdam [19], etc) tend to find paths that descend the loss quickly, but usually converge to bad local minima. In contrast, non-adaptive algorithms (e.g. SGD-momentum) converge better, but the training speed is slow. Helping the optimizer maintain fast training speed and good convergence is an open problem.

Figure 1: (a) Gradient updates are trapped in a saddle point. (b) Applying learning rate dropout to training, the optimizer escapes from saddle points more quickly. Red arrow: the update of each parameter in current iteration. Red dot: the initial state. Yellow dots: the subsequent states. : randomly dropped learning rate.
(a) Back propagation
(b) Applying dropout
(c) Applying learning rate dropout
Figure 2: (a) BP algorithm for a neural network. Black lines represent the gradient updates to each weight parameter (e.g. , , ). (b) An example of applying standard dropout, the dropped units do not appear in both forward and back propagation during training. (c) The red line indicates that the learning rate is dropped, so the corresponding weight parameter is not updated. Note that the dropped learning rate does not affect forward propagation and gradient back propagation. At each iteration, different learning rates are dropped.

Implicit regularization techniques (e.g. dropout [30], weight decay [16], noisy label [37]) are widely used to help training. The most popular one is dropout, which can prevent feature co-adaptation (a sign of overfitting) effectively by randomly dropping the hidden units (i.e. their activation is zeroed). Dropout can be interpreted as a way of regularizing training by adding noise to the hidden units. Other methods can achieve similar effects by injecting noise into gradients [24], label [37]

and activation functions

[10]. The principle of these methods is to force the optimizer to randomly change the loss descent path in the high-dimensional parameter space by injecting interference into the training. The uncertainty of the loss descent path gives the model more opportunities to escape local minima and find a better result. However, many researchers have found that these regularization methods can improve generalization at the cost of training time [30, 34]. This is because random noise in training may create an erroneous loss descent path (i.e. current parameter updates may increase the loss), resulting in slow convergence. In fact, what is needed is a technique that finds good local optima and converges quickly.

A possible solution is to inject uncertainty into the loss descent path while ensuring that the resulting path is correct. A correct loss descent path means it decreases the loss. On the other hand, inspired by the coordinate descent algorithm [36, 25, 23], we realize a fundamental fact that the decline of loss does not depend solely on the simultaneous update of all weight parameters. According to coordinate descent, even if only one parameter is updated, the loss is reduced. In particular, the loss descent path is determined by all updated parameters. This means that we can sample a new descent path by randomly pausing the update of certain parameters. Based on these observations, we propose a new regularization technique, learning rate dropout (LRD), to help the optimizer randomly sample the correct loss descent path at each iteration.

Learning rate dropout provides a way to inject randomness into the loss descent path while maintaining a descent direction. The key difference from standard dropout is to randomly drop the learning rate of the model parameters instead of the hidden units. During training, the optimizer dynamically calculates the update and allocates a learning rate for each parameter. Learning rate dropout works by randomly determining which parameters are not updated (i.e.

the learning rate is set to zero) for the current training iteration. In the simplest case, the learning rate for each parameter is retained with a fixed probability

, independently from other parameters, or is dropped with probability . A dropped learning rate is temporarily set to zero, which does not affect the update of other parameters or subsequent updates. A neural network with parameters has feasible descent paths. Since each path is correct in that it decreases the objective function, our learning rate dropout does not hinder training unlike previous methods [30, 37]. However, convergence issues such as saddle points and bad local optima are avoided more easily using the proposed method. Furthermore, LRD can be trivially integrated into existing gradient-based optimization algorithm, such as Adam.

0:   learning rate, function to calculate momentum and adaptive rate, initial parameters, stochastic objective function, dropout rate, False or True.
0:   resulting parameters.
1:  for  to  do
2:      (Calculate gradients w.r.t. stochastic objective at timestep )
3:      (Accumulation of past and current gradients)
4:     if  is True then
5:         (Accumulation of squared gradients)
7:     else
9:     end if
10:     Random sample learning rate dropout mask with each element
11:      (Randomly drop learning rates at timestep )
12:      (Calculate the update for each parameter)
14:  end for
Algorithm 1 Generic framework of optimization with learning rate dropout. indicates element-wise multiplication.

2 Related work

Noise injection:

Learning rate dropout can be interpreted as a regularization technique for loss descent path by adding noise to the learning rate. The idea of adding noise [1, 13, 9, 32] to the training of neural networks has drawn much attention. Adilova, et al. [1] demonstrate that noise injection substantially improves model quality for non-linear neural network. An [2] explored the effects of noise injection to the inputs, outputs, and weights of multilayer feedforward neural networks. Blundell, et al. [4] regularize the weights by minimizing a variational free energy. Neelakantan, et al. [24] also find that adding noise to gradients can help avoid overfitting and result in lower training loss. Xie, et al. [37] imposes regularization within the loss layer by randomly setting the labels to be incorrect. Similarly, the standard dropout [30] is a way of regularizing a neural network by adding noise to its hidden units. Wan, et al. [33] further proposed DropConnect, which is a generalization of Dropout. DropConnect sets a randomly selected subset of weights to zero, rather than hidden units. On the other hand, adding noise to activation functions can prevent the early saturation. Gulcehre, et al. [10] found that noisy activation functions (e.g., sigmoid and tanh) are easier to optimize. Replacing the non-linearities by their noisy counterparts usually leads to better results. Chen, et al. [5] propose a noisy softmax to mitigate the early saturation issue by injecting annealed noise to the softmax input. All of these methods can effectively prevent overfitting. However, they result in slower convergence.

Optimization algorithms:

Deep neural networks are optimized using gradient-descent-based algorithms. Stochastic gradient descent (SGD)

[28] is a widely used approach, which performs well in many research fields. However, it has been empirically observed that SGD has slow convergence since it scales the gradient uniformly in all directions. To address this issue, variants of SGD that adaptively rescale or average the gradient direction have achieved some success. Examples include RMSprop [31], Adadelta [38] and Adam [15]. In particular, Adam is the most popular adaptive optimization algorithm due to its rapid training speed. However, many publications [35, 27] indicate that Adam has poor convergence. Recently, Amsgrad [27] and Adabound [22], two variants of Adam, were proposed to solve the convergence issues of Adam by bounding the learning rates. Furthermore, RAdam [19]

also aims to solve the convergence issue of Adam by rectifying the variance of the adaptive learning rate. While these adaptive methods often display faster progress in training, they have also been observed to fail to converge well in many cases.

3 Method description

3.1 Online optimization

We start our discussion on learning rate dropout by integrating it into an online optimization problem [40]. We provide a generic framework of optimization methods with learning rate dropout in Algorithm 1 (all multiplications are element-wise). Consider a neural network with weight parameters . We assume . In the online setup, at each time step , the optimization algorithm modifies the current parameters using the loss function over data time , and gradient . Finally, the optimizer calculates the update for using and possibly other terms including earlier gradients.

Algorithm 1 encapsulates many popular adaptive and non-adaptive methods by the definition of gradient accumulation terms and . In particular, the adaptive methods are distinguished by the choice of , which is absent in non-adaptive methods. Most methods contain the similar momentum component


where , , is the momentum parameter. The momentum accumulates the exponentially moving average of previous gradients to correct the current update.

3.2 Learning rate dropout

During training, the optimizer assigns a learning rate , which is typically a constant, to each parameter. Applying learning rate dropout in optimization, in one iteration the learning rate of each parameter is kept with probability , independently from other parameters, or set to zero otherwise. At each time step , a random binary mask matrix is sampled to encode the learning rate information with each element . Then a learning rate matrix at time step is obtained by


LRD and coordinate descent:

For a parameter, a learning rate of means not updating its value. For a model, applying learning rate dropout is equivalent to uniformly sampling one of possible parameter subsets for which to perform a gradient update in the current iteration (see Figure 3). This is closely connected to coordinate descent, in which each set of a partition of parameters is fully optimized in a cycle, before returning to the first set for the next iteration. Therefore, each update of LRD will still cause a decrease in the loss function according to the gradient descent theory [29, 36, 25], while better avoiding saddle points. On the other hand, our LRD does not interrupt the gradient calculation of any parameters. If there is a gradient accumulation term (e.g. momentum) in the optimizer, the gradients of each parameter will be stored for subsequent updates, regardless of whether the learning rate is dropped. Furthermore, our method won’t slow training like previous methods [30, 37, 33].

Figure 3: Applying learning rate dropout to training. This model contains 3 parameters, so there are 8 loss descent paths to choose in each iteration. The blue dot is the model state. The solid line is the update of each parameter. The dashed line is the resulting update for the model. “” represents dropping the learning rate.

Toy example:

The learning rate dropout can accelerate training and improve generalization. By adding stochasticity to loss descent path, this technique helps the model to traverse quickly through the “transient” plateau (e.g. saddle points or local minima) and gives the model more chances to find a better minimum. To verify the effectiveness of learning rate dropout, we show a toy example to visualize the loss descent path during optimization. Consider the following nonconvex function:


where , . We use the popular optimizer Adam to search the minima of the function in the two dimensional parameter space. For this function, the point is the optimal solution. In Figure 4, we see that the convergence of Adam is sensitive to the initial point. Different initializations lead to different convergence results. And, Adam is easily trapped by the minimum near the initial point, even if it is a bad local minimum. In contrast, our learning rate dropout makes Adam more active. Even if the optimizer reach a local minimum, the learning rate dropout still encourages the optimizer to search for other possible paths instead of doing nothing. This example illustrates that the learning rate dropout can effectively help the optimizer escape from suboptimal points and find a better result.

(a) Adam
(b) Adam with learning rate dropout
Figure 4: Visualization of the loss descent paths. The learning rate dropout can help Adam escape from the local minimum. “” is optimal point .
Figure 5: The learning curves for 2-layers FCNet on MNIST. Top: Training loss. Middle: Training accuracy. Bottom: Test accuracy.
Figure 6: The learning curves for ResNet-34 on CIFAR-10. Top: Training loss. Middle: Training accuracy. Bottom: Test accuracy.

4 Experiments

To empirically evaluate the proposed method, we perform different popular machine learning tasks, including image classification, image segmentation and object detection. Using different models and optimization algorithms, we demonstrate that learning rate dropout is a general technique for improving neural network training not specific to any particular application domain.

Figure 7: The learning curves for DenseNet-121 on CIFAR-100. Top: Training loss. Middle: Training accuracy. Bottom: Test accuracy.
Dataset Model Optimization algorithms (without / with learning rate dropout)
MNIST FCNet 97.82 / 98.21 98.06 / 98.40 97.88 / 98.15 97.88 / 98.51 97.97 / 98.21
CIFAR-10 ResNet-34 95.30 / 95.54 92.71 / 93.68 93.05 / 93.77 93.31 / 93.73 94.24 / 94.61
CIFAR-100 DenseNet-121 79.09 / 79.42 70.21 / 74.02 72.55 / 74.34 73.91 / 75.23 71.81 / 73.04
Table 1: Test accuracy on image classification. ()

4.1 Image classification

We first apply learning rate dropout to the multiclass classification problem of the MNIST, CIFAR-10 and CIFAR-100 datasets.


The MNIST digits dataset [17] contains training and test images of size

. The task is to classify the images into 10 digit classes. Following the previous publications

[15, 3]

, we train a simple 2-hidden fully connected layer neural network (FCNet) on MNIST. We use a fully connected 1000 rectified linear units (ReLU) as each hidden layer for this experiment. The training was on mini-batches with 128 images per batch for 100 epochs through the training set. A decay scheme is not used.


The CIFAR-10 and CIFAR-100 datasets consist of 60,000 RGB images of size

, drawn from 10 and 100 categories, respectively. 50,000 images are used for training and the rest for testing. In both datasets, training and testing images are uniformly distributed over all the categories. To show the broad applicability of the proposed method, we use the ResNet-34

[12] for CIFAR-10 and DenseNet-121 [14] for CIFAR-100. Both models are trained for 200 epoches with a mini-batch size 128. We reduce the learning rates by 10 times at the 100-th and 150-th epoches. The weight decay rate is set to .

To do a reliable evaluation, the classifiers are trained multiple times using several optimization algorithms including SGD-momentum (SGDM) [26], RMSprop [31], Adam [15], AMSGrad [27] and RAdam [19]. We apply learning rate dropout to each optimization algorithms to show how this technique can help with training. Unless otherwise specified, the dropout rate (the probability of performing an individual parameter update in any given iteration) is set to . For each optimization algorithm, the initial learning rate has the greatest impact on the ultimate solution. Therefore, we only tune the initial learning rate while retaining the default settings for other hyper-parameters. Specifically, the initial learning rates for SGDM, RMSprop, Adam, AMSGrad, and RAdam are 0.1, 0.001, 0.001, 0.001, 0.03, respectively.

We first show the learning curves for MNIST in Figure 5. We find that all methods show fast convergence speed and good generalization. However, compared with their prototypes, the optimizers that use learning rate dropout still display better performance in terms of training speed and test accuracy. We further report the results for CIFAR datasets in Figure 6 and Figure 7. The test accuracy obtained by different methods is summarized in Table 1. The same models with and without learning rate dropout have very different convergence. These optimization algorithms without learning rate dropout either have slow training or converge to poor results. In contrast, learning rate dropout benefits all optimization algorithms by improving training speed and generalization. In particular we observe that by applying learning rate dropout to the non-adaptive method SGDM, we can achieve convergence speed comparable to other adaptive methods. On CIFAR-100, learning rate dropout even helps RMSprop obtain an accuracy improvement of close to . Furthermore, learning rate dropout incurs negligible computational costs and no parameter tuning aside from the dropout rate .

4.2 Image segmentation

We also consider the semantic segmentation task, which assigns each pixel in an image a semantic label [7, 21]. We evaluate our method on the PASCAL VOC2012 semantic segmentation dataset [8], which consists of 20 object categories and one background category. Following the conventional setting in [6, 18], the dataset is augmented by extra annotated VOC images provided in [11], which results in 10,582, 1,449 and 1,456 images for training, validation and testing, respectively. We use the state-of-the-art segmentation model PSPNet [39] to conduct the experiments. We use Adam solver with the initial learning rate

. Other hyperparameters such as batch size and weight decay follow the setting in

[39]. The segmentation performance is measured by the mean of class-wise intersection over union (Mean IoU) and pixel-wise accuracy (Pixel Accuracy). Results for this experiment are reported in Figure 8. With our proposed learning rate dropout, the model yields results 0.688/0.921 in terms of Mean IoU and Pixel Accuracy, exceeding the vanilla Adam of 0.637/0.905. In addition, Figure 8 shows once again that learning rate dropout leads to faster convergence.

Figure 8: Results for PSPNet on VOC2012 semantic segmentation dataset. Left: mean IOU. Right: Pixel Accuracy.
Figure 9: Results for object detection. Left: Training loss. Right: Validation loss.

4.3 Object detection

Object detection is a challenging and important problem in the computer vision community. We next apply learning rate dropout to an object detection task. We use VOC2012

and VOC2007 datasets for training, and use VOC2007 as our validation and testing data. We train a one-stage detection network SSD [20] using Adam with a learning rate 0.001. Other hyperparameter settings are the same as in [20]

. To show the effect of learning rate dropout on training, we use the training loss and validation loss as the evaluation metric. We run the model for 100 epochs and show the results in Figure

9. After applying learning rate dropout, the training loss and validation loss drop faster and converge to better results. This means that the detector achieves higher detection accuracy.

(a) Training accuracy
(b) Test accuracy
Figure 10: Results obtained using different dropout rate . Top: Adam. Bottom: SGDM.
(No LRD) 93.05 95.30
93.32 94.32
93.93 95.39
93.77 95.54
93.50 95.42
93.35 95.26
Table 2: Test accuracy on CIFAR-10 using different dropout rate . ()

4.4 Effect of dropout rate

As mentioned, the hyperparameter which controls the expected size of the coordinate update set is the only parameter that needs to be tuned. In this section, we explore the effect of various . We conduct experiments with ResNet-34 on the CIFAR-10 dataset, where is chosen from a broad range . We use Adam and SGDM to train the model, other hyperparameters are the same as in previous experiments. The results are shown in Figure 10. We can see that for any , learning rate dropout can speed up training, but the smaller the , the faster the convergence. In addition, we report the test accuracy in Table 2. As can be seen, almost all values of lead to an improvement in test accuracy. In practical applications, in order to balance the training speed and generalization, we recommend setting in the range of to .

(a) Training accuracy
(b) Test accuracy
Figure 11: Results on CIFAR-10 using different regularization strategies. Top: Adam. Bottom: SGDM.

4.5 Comparison with other regularizations

We further compare our learning rate dropout with three popular regularization methods: standard dropout [30], noisy label [37] and noisy gradient [24]. The standard dropout regularizes the network on hidden units, and we set the probability that each hidden unit is retained to . The noise label disturbs each train sample with the probability (i.e., a label is correct with a probability ). For each changed sample, the label is randomly drawn uniformly from the other labels except the correct one. The noisy gradient adds Gaussian noise to the gradient at every time step. The variance of the Gaussian noise gradually decays to 0 according to the parameter setting in [24]. We apply these regularization techniques to the ResNet-34 trained on CIFAR-10. This model is trained multiple times using SGDM and Adam. For clarity, we use the terms “”, “”, “” to denote training with standard dropout, noisy label and noisy gradient, respectively.

We report the learning curves in Figure 11. As can be seen, our learning rate dropout can speed up training while other regularization methods hinder convergence. This is because other regularization methods inject random noise into the training, which may lead to the wrong loss descent path. In contrast, our learning rate dropout always samples a correct loss descent path, so it does not worsen the training. We also show the test accuracy in Table 3. The results show that the standard dropout and our learning rate dropout can effectively improve generalization, while the effects of noise label and noisy gradient are disappointing. On the other hand, we find that our learning rate dropout and other regularization methods can be complementary. We use “” to represent the simultaneous use of learning rate dropout and other optimizers. From Figure 11 and Table 3, we see that learning rate dropout can work with other regularization methods to further improve the performance of the model. This shows that learning rate dropout is not limited to being a substitute for standard dropout or other methods.

No regularization 93.05 95.30
Standard dropout (SD) 94.05 95.33
Noise gradient (NG) 93.70 95.00
Noise label (NL) 92.59 94.61
Learning rate dropout (LRD) 93.77 95.54
LRD and SD 94.73 95.58
LRD and NG 93.92 95.26
Table 3: Test accuracy on CIFAR-10 using different regularization strategies. ()
(a) Training accuracy
(b) Test accuracy
Figure 12: LRD vs. DG on CIFAR-10 (ResNet-34 is used). Top: Adam. Bottom: SGDM.

4.6 Dropout on

Learning rate dropout is also equivalent to applying dropout to the update of parameters at each timestep . The idea is to inject uncertainty into the loss descent path, while ensuring that the loss descends in the correct direction at each iteration. One may wonder if there are other ways to realize this idea. For example, using dropout on the gradient is also a possible solution. The dropout on can also interfere with the loss descent path and does not lead to the wrong descent direction. Here, we compare our learning rate dropout with dropout on (DG). DG follows the setting of LRD, the gradient of a parameter is retained with a probability , or is set to 0 with probability . In this experiments, . We show the learning curves in Figure 12. As can be seen , DG has no positive effect on training. The causes of this may be multiple. First, the gradient accumulation terms (e.g., momentum) can greatly suppress the impact of dropping gradients on the loss descent path. In addition, dropping the gradient may slow down training due to the lack of gradient information. In contrast, our LRD only temporarily stops updating some parameters, and all gradient information is stored by the gradient accumulation terms. Therefore, in our learning rate dropout training, there is no loss of gradient information.

5 Dropout rate of standard dropout

We have shown that LRD can speed up training and improve generalization with almost any dropout rate (). The standard dropout (SD) also has a tunable dropout rate . In this section, we explore the effect of various . We apply SD to ResNet-34 trained on CIFAR-10 dataset, where is chosen from . We show the results in Figure 13 and Table 4. As can be seen, the performance of standard dropout is sensitive to the choice of dropout rate . The smaller the , the worse the convergence. This indicates that we should choose

carefully when applying SD to convolution neural networks.

(a) Training accuracy
(b) Test accuracy
Figure 13: Results obtained using different .
1 (No SD) 0.9 0.8 0.7 0.6
Adam 93.05 94.05 94.08 93.34 92.89
Table 4: Test accuracy on CIFAR-10 using different . ()

6 Conclusion

We presented learning rate dropout, a new technique for regularizing neural network training with gradient descent. This technique encourages the optimizer to actively explore in the parameter space by randomly dropping some learning rates. The uncertainty of the learning rate helps models quickly escape poor local optima and gives models more opportunities to search for better results. Experiments show the substantial ability of learning rate dropout to accelerate training and enhance generalization. In addition, this technique is found to be effective in wide variety of application domains including image classification, image segmentation and object detection. This shows that learning rate dropout is a general technique, which has great potential in practical applications.


  • [1] L. Adilova, N. Paul, and P. Schlicht (2018) Introducing noise in decentralized training of neural networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 37–48. Cited by: §2.
  • [2] G. An (1996)

    The effects of adding noise during backpropagation training on a generalization performance

    Neural computation 8 (3), pp. 643–674. Cited by: §2.
  • [3] W. An, H. Wang, Q. Sun, J. Xu, Q. Dai, and L. Zhang (2018) A pid controller approach for stochastic optimization of deep networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 8522–8531. Cited by: §4.1.
  • [4] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §2.
  • [5] B. Chen, W. Deng, and J. Du (2017) Noisy softmax: improving the generalization ability of dcnn via postponing the early softmax saturation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5372–5381. Cited by: §2.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §4.2.
  • [7] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §4.2.
  • [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §4.2.
  • [9] A. Graves (2011) Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356. Cited by: §2.
  • [10] C. Gulcehre, M. Moczulski, M. Denil, and Y. Bengio (2016) Noisy activation functions. In International conference on machine learning, pp. 3059–3068. Cited by: §1, §2.
  • [11] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pp. 991–998. Cited by: §4.2.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • [13] G. E. Hinton and S. T. Roweis (2003) Stochastic neighbor embedding. In Advances in neural information processing systems, pp. 857–864. Cited by: §2.
  • [14] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.1.
  • [15] D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Computer Science. Cited by: §1, §2, §4.1, §4.1.
  • [16] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1.
  • [17] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
  • [18] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid (2016) Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3194–3203. Cited by: §4.2.
  • [19] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2019) On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265. Cited by: §1, §2, §4.1.
  • [20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §4.3.
  • [21] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §4.2.
  • [22] L. Luo, Y. Xiong, Y. Liu, and X. Sun (2019) Adaptive gradient methods with dynamic bound of learning rate. In ICLR, Cited by: §2.
  • [23] I. Necoara (2013) Random coordinate descent algorithms for multi-agent convex optimization over networks. IEEE Transactions on Automatic Control 58 (8), pp. 2001–2012. Cited by: §1.
  • [24] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens (2015) Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807. Cited by: §1, §2, §4.5.
  • [25] Y. Nesterov (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization 22 (2), pp. 341–362. Cited by: §1, §3.2.
  • [26] N. Qian (1999) On the momentum term in gradient descent learning algorithms. Neural networks 12 (1), pp. 145–151. Cited by: §4.1.
  • [27] S. J. Reddi, S. Kale, and S. Kumar (2018) On the convergence of adam and beyond. In ICLR, Cited by: §1, §2, §4.1.
  • [28] H. Robbins and S. Monro (1951) A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §2.
  • [29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1985) Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §3.2.
  • [30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §1, §1, §2, §3.2, §4.5.
  • [31] T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §1, §2, §4.1.
  • [32] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010)

    Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion

    Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §2.
  • [33] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. Cited by: §2, §3.2.
  • [34] S. Wang and C. Manning (2013) Fast dropout training. In international conference on machine learning, pp. 118–126. Cited by: §1.
  • [35] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht (2017) The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148–4158. Cited by: §2.
  • [36] S. J. Wright (2015) Coordinate descent algorithms. Mathematical Programming 151 (1), pp. 3–34. Cited by: §1, §3.2.
  • [37] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian (2016) Disturblabel: regularizing cnn on the loss layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4753–4762. Cited by: §1, §1, §2, §3.2, §4.5.
  • [38] M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §2.
  • [39] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §4.2.
  • [40] M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936. Cited by: §3.1.