Gradient-Coherent Strong Regularization for Deep Neural Networks

11/20/2018 ∙ by Dae Hoon Park, et al. ∙ HUAWEI Technologies Co., Ltd. Association for Computing Machinery 0

Deep neural networks are often prone to over-fitting with their numerous parameters, so regularization plays an important role in generalization. L1 and L2 regularizers are common regularization tools in machine learning with their simplicity and effectiveness. However, we observe that imposing strong L1 or L2 regularization on deep neural networks with stochastic gradient descent easily fails, which limits the generalization ability of the underlying neural networks. To understand this phenomenon, we first investigate how and why learning fails when strong regularization is imposed on deep neural networks. We then propose a novel method, gradient-coherent strong regularization, which imposes regularization only when the gradients are kept coherent in the presence of strong regularization. Experiments are performed with multiple deep architectures on three benchmark data sets for image recognition. Experimental results show that our proposed approach indeed endures strong regularization and significantly improves both accuracy and compression, which could not be achieved otherwise.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Regularization has been very common for machine learning to prevent over-fitting and obtain sparse solutions. Deep neural networks (DNNs), which have shown huge success in many areas such as computer vision

[11, 16, 7] and speech recognition [8]

, often contain a number of trainable parameters in multiple layers with non-linear activation functions, in order to gain enough expressive power. DNNs with such many parameters are often prone to over-fitting, so the need for regularization has been emphasized. While regularization techniques such as dropout

[17] and pruning [6] have been proposed to solve the problem, the traditional regularization techniques using L1 or L2 norms have cooperated well with them to further improve the performance significantly. For example, our empirical results on image recognition show that strong L2 regularization to DNNs that already has dropout layers can reduce the error rate by up to 24% on a benchmark data set. However, we observe that DNNs easily fail to learn when strong L1 or L2 regularization is imposed with stochastic gradient descent.

Strong regularization on DNNs is often desired for two main reasons. First, strong regularization can result in better generalization especially when a model is greatly over-fitted, which is often the case for DNNs. Strong regularization yields a simple solution, which is less prone to over-fitting and preferred over complex ones with the principle of Occam’s razor. Second, strong regularization by sparse regularizers such as L1 regularizer compresses a solution into a sparse one while keeping or even improving the generalization performance. As DNNs typically consist of numerous parameters, such sparse solutions in sparse matrices may reduce a storage overhead or reside in a memory with less energy consumption. For example, a 9x compressed competitive DNN solution in sparse matrices achieved a storage overhead of only about 16% of non-compressed one while it resides in on-chip SRAM instead of off-chip DRAM that consumes more than 100x energy [6].

Unfortunately, imposing strong L1 or L2 regularization on DNNs is difficult for stochastic gradient descent. Indeed, difficulties related to non-convexity and the use of stochastic gradient descent are often overlooked in the literature. We observe and analyze how and why learning fails in DNNs with strong L1 or L2 regularization. We hypothesize that the dominance of gradients from the regularization term, caused by strong regularization, iteratively diminishes magnitudes of both weights and gradients. To prevent the failure in learning, we propose a novel approach “gradient-coherent strong regularization” that imposes strong regularization only when the gradients from regularization do not obstruct learning. That is, if the sum of regularization gradients and loss gradients is not coherent with the loss gradients, our approach does not impose regularization. Experiments were performed for standard DNNs and data sets, and the results indicate that our proposed approach indeed achieves strong regularization, resulting in both better generalization and more compression. The summary of our main contributions in this work is:

  • We provide a novel analysis how and why learning fails with strong L1/L2 regularization in DNNs. To the best of our knowledge, there is no existing work that theoretically or empirically analyzes this phenomenon.

  • We propose a novel approach, gradient-coherent strong regularization, that selectively imposes strong L1/L2 regularization to avoid failure in learning.

  • We perform experiments with multiple deep architectures on three benchmark data sets for image recognition. Our proposed approach does not fail for strong regularization and significantly improves the accuracy. It also compresses DNNs up to 9.9x without losing its accuracy.

2 Problem Analysis

2.1 DNNs and Strong L1/L2 Regularization

Let us denote a generic DNN by where

is an input vector,

is a flattened vector of all parameters in the network , and is an output vector after feed-forwarding through multiple layers in . The network is trained by finding an optimal set of with the following objective function.

(1)

where is the training data, and

is the loss function, which is usually cross-entropy loss for classification tasks. Here, the regularization term

is added to impose penalty on complexity of the solution, and , which we refer to as regularization strength, is set to zero for non-regularized models. A higher value of thus means stronger regularization.

With a gradient descent method, each model parameter at time , , is updated with the following formula:

(2)

where is a learning rate. We refer to and as and throughout the paper. The regularization function is an L2 regularizer, which is the most commonly used regularizer and is also called as weight decay

in deep learning literature. As shown, L2 regularizer reduces the magnitude of a parameter proportionally to it. It thus imposes a greater penalty on parameters with greater magnitudes and a less penalty on parameters with less magnitudes, yielding a simple and effective solution that is less prone to over-fitting. On the other hand, L1 regularizer (also known as Lasso

[18]): is often employed to induce sparsity in the solution (i.e., make a portion of zero), by imposing the same magnitude () of penalty on all parameters. In both L1 and L2 regularizers, strong regularization thus means great penalty on magnitudes of the parameters.

2.2 Imposing Strong Regularization Makes Learning Fail.

(a) L1 regularization
(b) L2 regularization
Figure 1: Validation accuracies on CIFAR-100. Note the sharp accuracy drop.
(a) Training loss (L2 Reg.)
(b) Avg. (L2 Reg.)
(c) Avg. (close-up)
(d) Avg. (L2 Reg.)
Figure 2: Statistics for different by VGG-16 on CIFAR-100. Best shown in color.

Strong regularization is especially useful for deep learning because the DNNs often contain a large number of parameters while the training data are relatively limited in practice. However, we observe a phenomenon where learning suddenly fails when strong regularization is imposed with stochastic gradient descent, which is the most commonly used solver for deep learning. The example of the phenomenon is depicted in Figure 1. The architectures VGG-16 [16] and AlexNet [11] were employed for the dataset CIFAR-100 [10].111Details of the experiment setting are described in the experiments section. As shown in this example, the accuracy increases as we enforce stronger regularization with greater . However, it suddenly drops to 1.0% after enforcing a little stronger regularization, which means that the model fails to learn. This observation raises three questions: (i) How and why does learning fail with strong regularization in deep neural networks? (ii) How can we avoid the failure? (iii) The performance improves as regularization strength increases until they fail. Will it even improve if it does not fail? We study these questions throughout this paper.

Learning fails when going beyond a tolerance level of regularization strength.

In order to understand this phenomenon in depth, we show training loss and gradients in Figure 2. The training loss excluding regularization penalty in Figure 1(a) shows that the model learns faster with stronger regularization up to , but training loss does not decrease at all when even stronger regularization () is imposed. This means there exists a tolerance level of regularization strength, which decides success or failure of entire learning. Gradients on parameters from are shown in 1(b) to see how such loss does not result in learning. Compared to less strong regularization (), the average by stronger regularization (

) is much smaller and seems stagnant. A close-up view with the gradients in logarithmic scale for the first 20 epochs is depicted in Figure

1(c). In a couple of epochs, the models with less strong L1 and L2 regularization start to obtain gradients that are two orders of magnitude greater than their initial gradients. On the other hand, models with stronger regularization fail to obtain such large gradients, and the magnitudes of gradients rather decay exponentially. The failure in learning can be explained by the following mechanism:

  • If the regularization is too strong, also becomes large.

    • If , we have smaller through the weight update in equation (2). For small , is far suppressed while is suppressed linearly with (L2) or constantly (L1). The iterative weight updates can make dominate over , yielding failure in learning.

    • Otherwise, may not decrease.

We explain why small by strong regularization implies far suppressed in the next paragraph.

Why does decrease so fast with strong regularization?

It is not difficult to see why decreases so fast when the regularization is strong. In deep neural networks, the gradients are dictated by back-propagation. It is well known that the gradients at the layer are given by

(3)

where

is the output of the neurons at the

layer and is the -layer residual which follows the recursive relation

(4)

where and denote the element-wise multiplications and derivatives of the activation function respectively.

Using the recursive relation, we obtain

(5)

If the regularization is too strong, the weights would be significantly suppressed with penalty in (2), which is also observed in Figure 1(d). From (5), since the gradients are proportional to the product of the weights at later layers (whose magnitudes are typically much less than 1, particularly in the beginning of training [4]), they are even more suppressed.

In fact, the suppression is more severe than what we have deduced above. The factor in (5

) could actually lead to further suppression to the gradients when the weights are very small, for the following reasons. First of all, we use ReLU as the activation function and it could be written as

(6)

where is the Heaviside step function. Using this, we could write

(7)

Applying (7) recursively, we can see that is proportional to the product of the weights at previous layers. Again, when the weights are suppressed by strong regularization, would be suppressed correspondingly. Putting everything together, we can conclude that in the presence of strong regularization, the gradients are far more suppressed than the weights.

Strictly speaking, the derivations above are valid only for fully-connected layers. For convolutional layers, the derivations are more complicated but similar. Our conclusions above would still be valid.

Discussion

Please note that this phenomenon is different from vanishing gradients caused by weight initialization or saturating activation functions such as sigmoid and hyperbolic tangent, which typically make learning slow and are significantly relieved by ReLU, a non-saturating activation function. In contrast, strong regularization does not even let learning start, and the symptom worsens as training proceeds, which makes learning completely fail. In addition, ReLU is adopted for both baselines and our approaches in our experiments.

In order to claim that the sudden failure also follows from vanishing gradients in deep networks, we would at least need to know that the gradients would “suddenly” vanish as the regularization strength increases. To the best of our knowledge, there is no such an analysis, be it theoretical or empirical. In addition, using equation (6) and applying equation (7) recursively, we show that vanishing weights lead to another level of suppression on the gradients, which is novel.

2.3 Gradient-Coherent Strong Regularization

In order to prevent failure in learning, we propose gradient-coherent strong regularization to selectively impose strong regularization only when the gradients from regularization () do not obstruct learning too much. By comparing with , we can find out how the gradients from regularization affect the overall learning, according to equation (2). That is, we measure quality of regularization gradients for , where we assume that there will be no learning if their quality is sufficiently low. Thus, we define the regularization strength at step as

(8)

where is the quality of for , and is a hyper-parameter. Only when the quality is high enough, the regularization is imposed.

Meanwhile, it is rather difficult to measure the actual interference by regularization for each weight. Even when there is a great magnitude change in a gradient after adding the gradient from strong regularization, if the resulting gradient keeps the same sign, the change may still be useful for learning and may even accelerate learning. On the other hand, it is obvious that the resulting gradient with opposite sign is harmful. Hence, we propose a gradient sign coherence rate to approximate coherence between and , to measure quality of . It is defined as

(9)

where is the number of parameters, and lies in [0,1], where means a complete coherence. Thus, a sign coherence rate between two random vectors is expected to be 0.5.222In practice, when we compute , the parameters with are excluded because we want to measure coherence when there is learning, i.e., . What the proposed approach does is that it measures how much the enforcement of regularization will change the direction of and it enforces regularization only if the direction is not too much changed. The computation of requires only a couple of vector computations, which can be done efficiently with GPUs.

Figure 3: Gradient sign coherence rate (left) and its close-up view (right), by VGG-16 on CIFAR-100.

The example of the gradient sign coherence rate for strong regularization is depicted in Figure 3, where is used only to compute the rate and is not added for learning. Indeed, is about 0.5 in the first couple of epochs, which means that the gradients would be greatly affected by strong regularization, and it quickly increases for the next 20 epochs. Then, with the convergence of the model, it decreases a bit, which is reasonable.

Proximal gradient algorithm for L1 regularizer

Meanwhile, since L1 norm is not differentiable at zero, we employ the proximal gradient algorithm [13], which enables us to obtain proper sparsity (i.e., guaranteed convergence) for non-smooth regularizers. We use the following update formulae:

(10)

Discussion

Normalization techniques such as batch normalization

[9] and weight normalization [14] can be possible approaches to prevent from diminishing quickly. However, it has been shown that L2 regularization has no regularizing effect when combined with normalization but only influences on the effective learning rate [19]. In other words, the normalization techniques do not actually simplify the solution since the decrease of parameter magnitude is canceled by normalization. This does not meet our goal, which is to heavily simplify solutions to reduce over-fitting and compress networks.

3 Experiments

Dataset Classes Training Images Test Images
per Class per Class
CIFAR-10 10 5000 1000
CIFAR-100 100 500 100
SVHN 10 7325.7 (avg.) 2603.2 (avg.)
Table 1: Dataset statistics for CIFAR-10 and CIFAR-100.

We first evaluate the effectiveness of our proposed method with popular architectures, AlexNet and VGG-16 on the public datasets CIFAR-10 and CIFAR-100 [10]. Then, we employ variations of VGG on another public dataset SVHN [12], in order to see the effect of the number of hidden layers on the tolerance level for strong regularization. Please note that we do not employ architectures that contain normalization techniques such as batch normalization [9], for the reason described in the previous section. All the datasets contain images of 3232 resolution with 3 color channels. The dataset statistics are described in Table 1. AlexNet and VGG-16 for CIFAR-10 and CIFAR-100 contain 2.6 and 15.2 million parameters, respectively. VGG-11 and VGG-19 contain 9.8 and 20.6 millions of parameters, respectively.

L1/L2 regularization is applied to all network parameters except bias terms. We use PyTorch

333http://pytorch.org/ framework for all experiments, and we use its official computer vision library444https://github.com/pytorch/vision

for the implementations of the networks. In order to accommodate the datasets, we made some modifications to the networks. The kernel size of AlexNet’s max-pooling layers is modified from 3 to 2, and the first convolution layer’s padding size is modified from 2 to 5. All of its fully connected layers are modified to have 256 neurons. For VGG, we modified the fully connected layers to have 512 neurons. The output layers of both networks have 10 neurons for CIFAR-10 and SVHN, and 100 neurons for CIFAR-100. The networks are learned by stochastic gradient descent with momentum of 0.9. The batch size is set to 128, and the initial learning rate is set to 0.05 and decayed by a factor of 2 every 30 epochs. We use dropout layers (with drop probability 0.5) and pre-process training data

555We apply horizontal random flipping and random cropping to original images in each batch. We do not apply them to SVHN as they may harm the performance. in order to report the extra performance boost on top of common regularization techniques.

AlexNet and VGG-16 are experimented for different regularization methods (L1 and L2) and different datasets (CIFAR-10 and CIFAR-100), yielding 8 experiment sets. Then, VGG-11, VGG-16, and VGG-19 are experimented for L1 and L2 regularization methods on SVHN, yielding 6 experiment sets. For each experiment set, we set the baseline method as the one with best-tuned L1 or L2 regularization but without our gradient-coherent regularization. For each of L1 and L2 regularization, we try more than 10 different values of , and for each

, we report average accuracy of three independent runs and report 95% confidence interval. We perform statistical significance test (t-test) for the improvement over the baseline method. We also report

sparsity of each trained model, which is the proportion of the number of zero-valued parameters to the number of all parameters. Please note that we mean the sparsity by the one actually derived by the models, not by pruning parameters with threshold after training.

In this specific task, we observe in Figure 1(c) and 3 that the average is elevated two orders of magnitude and the gradient sign coherent rate quickly drifts from 0.5 (a random coherence) in the first couple of epochs. At the same time, a fixed amount of regularization needs to be imposed without skipping arbitrary many training steps, to reach a certain level of smoothness or sparsity in the solution. Therefore, we also employ the following regularization schedule:

(11)

where epoch is the epoch number of the time step , and is a hyper-parameter that is determined by . The formula means that we do not impose any regularization until epoch, and then impose strong regularization until training ends. We call this approach ours and employ it for the majority of the experiments, but we also experiment with the original approach in equation (8), which we call ours_orig. Considering dynamics of stochastic gradient descent, we can set the starting point as the time step where becomes a little greater than 0.5. We set , where reaches 0.7 in general. We did not find significantly different results for .

3.1 Results on CIFAR-10 and CIFAR-100

(a) VGG-16, CIFAR-100, L2
(b) VGG-16, CIFAR-100, L1
(c) VGG-16, CIFAR-10, L2
(d) VGG-16, CIFAR-10, L1
(e) AlexNet, CIFAR-100, L2
(f) AlexNet, CIFAR-100, L1
(g) AlexNet, CIFAR-10, L2
(h) AlexNet, CIFAR-10, L1
Figure 4: Accuracy obtained by VGG-16 (a,b,c,d) and AlexNet (e,f,g,h). A green dotted horizontal line is accuracy obtained by a model without L1/L2 regularization (but with dropout). The error bars indicate 95% confidence interval.
(a) VGG-16, CIFAR-100, L1
(b) VGG-16, CIFAR-10, L1
(c) AlexNet, CIFAR-100, L1
(d) AlexNet, CIFAR-10, L1
Figure 5: Accuracy for different sparsity when L1 regularization is employed. The error bars indicate 95% confidence interval.

The experimental results on CIFAR-10 and CIFAR-100 are depicted in Figure 4 and 5. As we investigated in the previous section, the baseline method suddenly fails beyond certain values of tolerance level. However, our proposed method does not fail for higher values of , and it indeed achieves higher accuracy in general. Another interesting observation is that, unlike VGG-16, we obtain more improvement by AlexNet with L1 regularization. Meanwhile, tuning with the regularization parameter can be difficult as the curves have somewhat sharp peak, but our proposed method ease the problem to some extent by preventing the sharp drop, i.e., the sudden failure. Our L1 regularizer obtains better sparsity for the similar level of accuracy (Figure 5), which means that strong regularization is promising for compression of DNNs. Overall, the improvement is more prominent on CIFAR-100 than on CIFAR-10, and we think this is because over-fitting can more likely occur on CIFAR-100 as there are fewer images per class in CIFAR-100 than in CIFAR-10.

Interestingly, our proposed method often obtains higher accuracy even when the baseline does not fail on CIFAR-10, and this is prominent especially when is a little less than the tolerance level (better shown in Figure 4(b), 4(c), 4(d)). One possible explanation is that avoiding strong regularization in the early stage of training can help the model to explore the parameter space more freely, and the better exploration results in finding superior local optima.

CIFAR-100 CIFAR-10
VGG-16
No L1/L2 62.080.81 90.800.23
1-3 L2 baseline 69.160.46 92.420.16
L2 ours 71.010.33 92.600.16
Rel. improvement +2.67% +0.19%
1-3 L1 baseline 66.940.24 91.290.16
L1 ours 67.550.12 91.550.10
Rel. improvement +0.91% +0.28%
AlexNet
No L1/L2 43.090.25 75.050.20
1-3 L2 baseline 46.910.15 78.660.17
L2 ours 47.640.33 78.650.29
Rel. improvement +1.56% -0.01%
1-3 L1 baseline 45.700.10 76.870.24
L1 ours 47.480.29 77.630.34
Rel. improvement +3.89% +0.99%
Table 2: Accuracy with 95% confidence interval. Note that when no L1/L2 regularization is imposed, dropout is still employed. Statistically significant improvements are bold-faced.

The exact accuracy obtained is shown in Table 2. Our proposed model always improves the baselines by up to 3.89%, except AlexNet with L1 regularization on CIFAR-10, and most (6 out of 7) improvements are statistically significant. Also, L1/L2 regularization seems indeed useful even when dropout is employed; our model improves the baseline that is without L1 or L2 regularization but with dropout, by 14.4% in accuracy.

Figure 6: Acccuracy on CIFAR-100 by VGG-16 with ours and ours_orig. L2 regularizer (left) and L1 regularizer (right) are employed. is set to 0.6 for both.

Results by ours_orig

We also perform experiments by ours_orig in equation (8) and compare the results with ours; the results are shown in Figure 6. Although ours does not suffer from sudden failure in learning by strong regularization, it performs poorly for very strong regularization. This is because the gradients from regularization are too big so that the overall gradients are too much corrupted. However, ours_orig skips regularization if the quality of gradients from regularization is not good enough, so it can still perform well for very strong regularization. As a result, it can be easier to set with ours_orig. The results are similar for other data sets and architectures and thus are omitted.

3.2 SVHN Results: Does the Number of Layers Affect the Failure by Strong Regularization?

(a) VGG-11
(b) VGG-16
(c) VGG-19
Figure 7: Accuracy obtained by variations of VGG with L2 regularization on SVHN. A green dotted horizontal line is an accuracy obtained by a model without L2 regularization (but with dropout). The error bars indicate 95% confidence interval.

The analysis in Section 2.2 implies that the number of hidden layers would affect the tolerance level when strong regularization is imposed. That is, if there are more hidden layers in the neural network architecture, the learning will more easily fail by strong regularization. In order to substantiate the hypothesis empirically, we employ variations of the VGG architecture, i.e., VGG-11, VGG-16, and VGG-19, which consist of 11, 16, and 19 hidden layers, respectively. Experiments are performed on the SVHN dataset.

The results by L2 regularization are depicted in Figure 7.666The results by L1 regularization show similar patterns and are omitted due to space limit. For all VGG variations, the peaks of our method’s accuracy is formed around . As more hidden layers are added to the network, the tolerance level where the baseline suddenly fails is more and more decreased. This means that deeper architectures are indeed more likely to fail by strong regularization, as hypothesized by our analysis.

Because the method without L1/L2 regularization already performs well on this dataset and there are relatively many training images per class, the improvements by L1/L2 regularization are not big. Our method still outperforms the baseline in all experiments (6 out of 6), but the improvement is less statistically significant (2 of 6) compared to CIFAR-10 and CIFAR-100 experiments.

3.3 Network Compression by Strong Regularization

Sparsity Accuracy Compression
rate
AlexNet on CIFAR-100
L1 baseline 0.219 45.700.10 4.2
L1 ours (sparse) 0.814 45.770.32
1-4 AlexNet on CIFAR-10
L1 baseline 0.766 76.870.24 1.9
L1 ours (sparse) 0.877 76.900.22
VGG-16 on CIFAR-100
L1 baseline 0.269 66.940.24 2.4
L1 ours (sparse) 0.697 67.060.62
1-4 VGG-16 on CIFAR-10
L1 baseline 0.808 91.290.16 2.6
L1 ours (sparse) 0.926 91.380.05
VGG-11 on SVHN
L1 baseline 0.519 94.680.08 3.1
L1 ours (sparse) 0.845 94.710.01
VGG-16 on SVHN
L1 baseline 0.450 95.340.11 2.7
L1 ours (sparse) 0.795 95.380.11
VGG-19 on SVHN
L1 baseline 0.122 95.370.11 9.9
L1 ours (sparse) 0.911 95.410.07
Table 3: Compression rate, i.e., how much the non-zero-valued parameters is reduced, obtained by our sparse models.

L1 regularization naturally compresses neural networks by setting a portion of parameters to zero while it can even improve generalization with the simplified solutions. In order to see how much sparsity our method can obtain while keeping the baseline’s best accuracy, we choose a sparse model whose accuracy is equal to or higher than that of L1 baseline. Our proposed model’s sparsity and compression rate over the baseline are shown in Table 3. Our model, in general, needs only about 1030% of all parameters to perform as good as the baseline, which needs about 2090% of all parameters. Our approach always (7 out of 7) obtains higher sparsity with compression rate up to 9.9 than baselines, meaning that our approach is promising for compressing neural networks.

3.4 Empirical Validation of Our Hypothesis

(a) Avg. (close-up)
(b) Avg.
Figure 8: Results by VGG-16 with L2 regularization on CIFAR-100. The results are from baseline unless it is labeled as “ours”.

We hypothesized that (i) if we skip strong regularization when the gradients are not coherent enough, the model will not fail to learn, and (ii) if the model does not suffer from continuous suppression in , then may not decrease. It is shown in Figure 7(a) that our proposed model obtains great elevation instead of exponential decay in unlike the baseline; this means that it indeed does not fail to learn. In Figure 7(b), although the same strong regularization is enforced after a couple of epochs, the magnitude of weights in our model stops decreasing around epoch 20, while that in baseline (a green dotted line) keeps decreasing towards zero. As there is no continuous suppression in for our proposed model, the magnitudes of parameters indeed do not decrease after a certain point. Comparing our approach with to baseline with , we can also see that our approach with strong regularization indeed further simplifies the solution.

4 Related Work

The related work is partially covered in the introduction section, and we extend other related work here. It has been shown that L2 regularization is important for training DNNs [11, 3]. Although there has been a new regularization method for DNNs such as dropout, L2 regularization has been shown to reduce the test error effectively when combined with dropout [17]. Meanwhile, L1 regularization has also been used often in order to obtain sparse solutions. To reduce computation and power consumption, L1 regularization and its variations such as group sparsity regularization have been promising for deep neural networks [20, 15, 21]. However, for both L1 and L2 regularization, the phenomenon that learning fails with strong regularization in DNNs has not been emphasized previously. [2] showed that tuning hyper-parameters such as L2 regularization strength can be effectively done through random search instead of grid search, but they did not study the phenomenon by strong regularization. [22] visualized activations to understand deep neural networks and showed that strong L2 regularization fails to learn. However, it was still not shown how and why learning fails and how strong regularization can be achieved. [1] applies a group sparsity regularizer only once per each epoch to compress DNNs, but the purpose is not to avoid failure in learning but to determine the number of necessary neurons in each layer. To the best of our knowledge, there does not exist work that studies vanishing gradients and failure in learning by strong regularization.

5 Discussion

Our proposed method can be especially useful when strong regularization is desired. For example, deep learning projects that cannot afford a huge labeled dataset can benefit from our method. On the other hand, strong regularization may not be necessary in some other cases where the large labeled dataset is available or the networks do not contain many parameters.

Our work can be further extended in several ways. Since our approach can achieve strong regularization, it will be interesting to see how the strongly regularized model performs in terms of both accuracy and sparsity if combined with pruning-based approaches [6, 5]. We applied our approach to only L1 and L2 regularizers. However, applying it to other regularizers such as group sparsity regularizers will be promising as they are often employed for DNNs to compress them. Lastly, our proposed coherence rate is general, so one can apply it to other joint optimization problems where unfavorable gradients dominate in overall gradients. All these directions are left for our future work.

References

  • [1] J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems, pages 2270–2278, 2016.
  • [2] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
  • [3] L. Deng, G. Hinton, and B. Kingsbury. New types of deep neural network learning for speech recognition and related applications: An overview. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8599–8603. IEEE, 2013.
  • [4] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    , pages 249–256, 2010.
  • [5] S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, and W. J. Dally. Dsd: Regularizing deep neural networks with dense-sparse-dense training flow. arXiv preprint arXiv:1607.04381, 2016.
  • [6] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • [8] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
  • [9] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
  • [10] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [12] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
  • [13] N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends® in Optimization, 1(3):127–239, 2014.
  • [14] T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909, 2016.
  • [15] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
  • [16] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [17] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
  • [18] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • [19] T. van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017.
  • [20] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
  • [21] J. Yoon and S. J. Hwang. Combined group and exclusive sparsity for deep neural networks. In International Conference on Machine Learning, pages 3958–3966, 2017.
  • [22] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.