Recent years have witnessed the success of deep neural networks in dealing with a plenty of practical problems. Dropout has played an essential role in many successful deep neural networks, by inducing regularization in the model training. In this paper, we present a new regularized training approach: Shakeout. Instead of randomly discarding units as Dropout does at the training stage, Shakeout randomly chooses to enhance or reverse each unit's contribution to the next layer. This minor modification of Dropout has the statistical trait: the regularizer induced by Shakeout adaptively combines L_0, L_1 and L_2 regularization terms. Our classification experiments with representative deep architectures on image datasets MNIST, CIFAR-10 and ImageNet show that Shakeout deals with over-fitting effectively and outperforms Dropout. We empirically demonstrate that Shakeout leads to sparser weights under both unsupervised and supervised settings. Shakeout also leads to the grouping effect of the input units in a layer. Considering the weights in reflecting the importance of connections, Shakeout is superior to Dropout, which is valuable for the deep model compression. Moreover, we demonstrate that Shakeout can effectively reduce the instability of the training process of the deep architecture.READ FULL TEXT VIEW PDF
A particular interesting advance in the training techniques is the invention of Dropout . At the operational level, Dropout adjusts the network evaluation step (feed-forward) at the training stage, where a portion of units are randomly discarded. The effect of this simple trick is impressive. Dropout enhances the generalization performance of neural networks considerably, and is behind many record-holders of widely recognized benchmarks [14, 2, 15]. The success has attracted much research attention, and found applications in a wider range of problems [16, 17, 18]. Theoretical research from the viewpoint of statistical learning has pointed out the connections between Dropout and model regularization, which is the de facto recipe of reducing over-fitting for complex models in practical machine learning. For example, Wager et al.  showed that for a generalized linear model (GLM), Dropout implicitly imposes an adaptive
regularizer of the network weights through an estimation of the inverse diagonal Fisher information matrix.
Sparsity is of vital importance in deep learning. It is straightforward that through removing unimportant weights, deep neural networks perform prediction faster. Additionally, it is expected to obtain better generalization performance and reduce the number of examples needed in the training stage. Recently much evidence has shown that the accuracy of a trained deep neural network will not be severely affected by removing a majority of connections and many researchers focus on the deep model compression task [20, 21, 22, 23, 24, 25]. One effective way of compression is to train a neural network, prune the connections and fine-tune the weights iteratively [21, 22]. However, if we can cut the connections naturally via imposing sparsity-inducing penalties in the training process of a deep neural network, the work-flow will be greatly simplified.
In this paper, we propose a new regularized deep neural network training approach: Shakeout, which is easy to implement: randomly choosing to enhance or reverse each unit’s contribution to the next layer in the training stage. Note that Dropout can be considered as a special “flat” case of our approach: randomly keeping (enhance factor is ) or discarding (reverse factor is ) each unit’s contribution to the next layer. Shakeout enriches the regularization effect. In theory, we prove that it adaptively combines , and regularization terms. and regularization terms are known as sparsity-inducing penalties. The combination of sparsity-inducing penalty and penalty of the model parameters has shown to be effective in statistical learning: the Elastic Net  has the desirable properties of producing sparse models while maintaining the grouping effect of the weights of the model. Because of the randomly “shaking” process and the regularization characteristic pushing network weights to zero, our new approach is named “Shakeout”.
As discussed above, it is expected to obtain much sparser weights using Shakeout than using Dropout because of the combination of and
regularization terms induced in the training stage. We apply Shakeout on one-hidden-layer autoencoder and obtain much sparser weights than that resulted by Dropout. To show the regularization effect on the classification tasks, we conduct the experiments on image datasets including MNIST, CIFAR-10 and ImageNet with the representative deep neural network architectures. In our experiments we find that by using Shakeout, the trained deep neural networks always outperform those by using Dropout, especially when the data is scarce. Besides the fact that Shakeout leads to much sparser weights, we also empirically find that it groups the input units of a layer. Due to the inducedand regularization terms, Shakeout can result in the weights reflecting the importance of the connections between units, which is meaningful for conducting compression. Moreover, we demonstrate that Shakeout can effectively reduce the instability of the training process of the deep architecture.
This journal paper extends our previous work  theoretically and experimentally. The main extensions are listed as follows: 1) we derive the analytical formula for the regularizer induced by Shakeout in the context of GLM and prove several important properties; 2) we conduct experiments using Wide Residual Network  on CIFAR-10 to show Shakeout outperforms Dropout and standard back-propagation in promoting the generalization performance of a much deeper architecture; 3) we conduct experiments using AlexNet 
on ImageNet dataset with Shakeout and Dropout. Shakeout obtains comparable classification performance to Dropout, but with superior regularization effect; 4) we illustrate that Shakeout can effectively reduce the instability of the training process of the deep architecture. Moreover, we provide a much clearer and detailed description of Shakeout, derive the forward-backward update rule for deep convolutional neural networks with Shakeout, and give several recommendations to help the practitioners make full use of Shakeout.
In the rest of the paper, we give a review about the related work in Section 2. Section 3 presents Shakeout in detail, along with theoretical analysis of the regularization effect induced by Shakeout. In Section 4, we first demonstrate the regularization effect of Shakeout on the autoencoder model. The classification experiments on MNIST , CIFAR-10 and ImageNet illustrate that Shakeout outperforms Dropout considering the generalization performance, the regularization effect on the weights, and the stabilization effect on the training process of the deep architecture. Finally, we give some recommendations for the practitioners to make full use of Shakeout.
Deep neural networks have shown their success in a wide variety of applications. One of the key factors contributes to this success is the creation of powerful training techniques. The representative power of the network becomes stronger as the architecture gets deeper . However, millions of parameters make deep neural networks easily over-fit. Regularization [28, 16] is an effective way to obtain a model that generalizes well. There exist many approaches to regularize the training of deep neural networks, like weight decay , early stopping , etc. Shakeout belongs to the family of regularized training techniques.
Among these regularization techniques, our work is closely related to Dropout . Many subsequent works were devised to improve the performance of Dropout [31, 32, 33]. The underlying reason why Dropout improves performance has also attracted the interest of many researchers. Evidence has shown that Dropout may work because of its good approximation to model averaging and regularization on the network weights [34, 35, 36]. Srivastava  and Warde-Farley 
exhibited through experiments that the weight scaling approximation is an accurate alternative for the geometric mean over all possible sub-networks. Gal et al. claimed that training the deep neural network with Dropout is equivalent to performing variational inference in a deep Gaussian Process. Dropout can also be regarded as a way of adding noise into the neural network. By marginalizing the noise, Srivastava 
proved for linear regression that the deterministic version of Dropout is equivalent to adding an adaptiveregularization on the weights. Furthermore, Wager  extended the conclusion to generalized linear models (GLMs) using a quadratic approximation to the induced regularizer. The inductive bias of Dropout was studied by Helmbold et al.  to illustrate the properties of the regularizer induced by Dropout further. In terms of implicitly inducing regularizer of the network weights, Shakeout can be viewed as a generalization of Dropout. It enriches the regularization effect of Dropout, i.e. besides the regularization term, it also induces the and regularization terms, which may lead to sparse weights of the model.
Due to the implicitly induced and regularization terms, Shakeout is also related to sparsity-inducing approaches. Olshausen et al. 
introduced the concept of sparsity in computational neuroscience and proposed the sparse coding method in the visual system. In machine learning, the sparsity constraint enables a model to capture the implicit statistical data structure, performs feature selection and regularization, compresses the data at a low loss of the accuracy, and helps us to better understand our models and explain the obtained results. Sparsity is one of the key factors underlying many successful deep neural network architectures[40, 41, 42, 2] and training algorithms . A Convolutional neural network is much sparser than the fully-connected one, which results from the concept of local receptive field . Sparsity has been a design principle and motivation for Inception-series models [41, 42, 2]
. Besides working as the heuristic principle of designing a deep architecture, sparsity often works as a penalty induced to regularize the training process of a deep neural network. There exist two kinds of sparsity penalties in deep neural networks, which lead to the activity sparsity and the connectivity sparsity 
respectively. The difference between Shakeout and these sparsity-inducing approaches is that for Shakeout, the sparsity is induced through simple stochastic operations rather than manually designed architectures or explicit norm-based penalties. This implicit way enables Shakeout to be easily optimized by stochastic gradient descent (SGD)the representative approach for the optimization of a deep neural network.
Shakeout applies on the weights in a linear module. The linear module, i.e. weighted sum,
is arguably the most widely adopted component in data models. For example, the variables , , , can be input attributes of a model, e.g. the extracted features for a GLM, or the intermediate outputs of earlier processing steps, e.g. the activations of the hidden units in a multilayer artificial neural network. Shakeout randomly modifies the computation in Eq. (1). Specifically, Shakeout can be realized by randomly modifying the weights
Step 1: Draw , where .
Step 2: Adjust the weight according to ,
where takes depending on the sign of or takes 0 if . As shown above, Shakeout chooses (randomly by drawing ) between two fundamentally different ways to modify the weights. Modification (A) is to set the weights to constant magnitudes, despite their original values except for the signs (to be opposite to the original ones). Modification (B) updates the weights by a factor and a bias depending on the signs. Note both (A) and (B) preserve zero values of the weights, i.e. if then
with probability 1. Let, and Shakeout leaves unbiased, i.e. . The hyper-parameters and configure the property of Shakeout.
Shakeout is naturally connected to the widely adopted operation of Dropout [13, 34]. We will show that Shakeout has regularization effect on model training similar to but beyond what is induced by Dropout. From an operational point of view, Fig. 1 compares Shakeout and Dropout. Note that Shakeout includes Dropout as a special case when the hyper-parameter in Shakeout is set to zero.
When applied at the training stage, Shakeout alters the objective the quantity to be minimized by adjusting the weights. In particular, we will show that Shakeout (with expectation over the random switch) induces a regularization term effectively penalizing the magnitudes of the weights and leading to sparse weights. Shakeout is an approach designed for helping model training, when the models are trained and deployed, one should relieve the disturbance to allow the model work with its full capacity, i.e. we adopt the resulting network without any modification of the weights at the test stage.
Shakeout randomly modifies the weights in a linear module, and thus can be regarded as injecting noise into each variable , i.e. is randomly scaled by : . Note that , the modification of is actually determined by the random switch . Shakeout randomly chooses to enhance (i.e. when , ) or reverse (i.e. when , ) each original variable ’s contribution to the output at the training stage (see Fig. 1). However, the expectation of over the noise remains unbiased, i.e. .
is the feature vector randomly modified by the noise induced by. The regularization term is determined by the characteristic of the noise. For example, Wager et al. showed that Dropout, corresponding to inducing blackout noise to the features, helps introduce an adaptive penalty on . In this section we illustrate how Shakeout helps regularize model parameters using an example of GLMs.
Formally, a GLM is a probabilistic model of predicting target given features , in terms of the weighted sum in Eq. (1):
With different and
functions, GLM can be specialized to various useful models or modules, such as logistic regression model or a layer in a feed-forward neural network. However, roughly speaking, the essence of a GLM is similar to that of a standard linear model which aims to find weightsso that aligns with (functions and are independent of and
respectively). The loss function of a GLM with respect tois defined as
Let the loss with Shakeout be
where , and represents the features randomly modified with .
Taking expectation over , the loss with Shakeout becomes
is named Shakeout regularizer. Note that if is -th order derivable, let the order derivative where , to make the denotation simple.
Let , and , then Shakeout regularizer is
We illustrate several properties of Shakeout regularizer based on Eq. (7). The proof of the following propositions can be found in the appendices.
If is convex, .
Suppose , . If is convex, monotonically increases with . If , monotonically increases with .
Proposition 3 implies that the hyper-parameters and relate to the strength of the regularization effect. It is reasonable because higher or means the noise injected into the features
has larger variance.
Suppose i) , , and ii) .
i) if ,
ii) if ,
Proposition 4 implies that under certain conditions, starting from a zero weight vector, Shakeout regularizer penalizes the magnitude of and its regularization effect is bounded by a constant value. For example, for logistic regression, , which is illustrated in Fig. 2. This bounded property has been proved to be useful: capped-norm 
is more robust to outliers than the traditionalor norm.
Based on the Eq. (7), the specific formulas for the representative GLM models can be derived:
i) Linear regression: , then
where denotes the element-wise product and the term can be decomposed into the summation of three components
where is an indicator function which satisfies . This decomposition implies that Shakeout regularizer penalizes the combination of -norm, -norm and -norm of the weights after scaling them with the square of corresponding features. The and regularization terms can lead to sparse weights.
ii) Logistic regression: , then
Fig. 3 illustrates the contour of Shakeout regularizer based on Eq. (9) in the 2D weight space. On the whole, the contour of Shakeout regularizer indicates that the regularizer combines , and regularization terms. As goes to zero, the contour around becomes less sharper, which implies hyper-parameter relates to the strength of and components. When , Shakeout degenerates to Dropout, the contour of which implies Dropout regularizer consists of regularization term.
The difference between Shakeout and Dropout regularizers is also illustrated in Fig. 2. We set , for Shakeout, and for Dropout to make the bounds of the regularization effects of two regularizers the same. In this one dimension circumstance, the main difference is that at (see the enlarged snapshot for comparison), Shakeout regularizer is sharp and discontinuous while Dropout regularizer is smooth. Thus compared to Dropout, Shakeout may lead to much sparser weights of the model.
The , already shown in Eq. (8), consists of the combination of , , regularization terms. It tends to penalize the weight whose corresponding feature’s magnitude is large. Meanwhile, the weights whose corresponding features are always zeros are less penalized. The term is proportional to the variance of prediction given and . Penalizing encourages the weights to move towards making the model be more ”confident” about its predication, i.e. be more discriminative.
Generally speaking, Shakeout regularizer adaptively combines , and regularization terms, the property of which matches what we have observed in Fig. 3. It prefers penalizing the weights who have large magnitudes and encourages the weights to move towards making the model more discriminative. Moreover, the weights whose corresponding features are always zeros are less penalized. The and components can induce sparse weights.
Last but not the least, we want to emphasize that when , the noise is eliminated and the model becomes a standard GLM. Moreover, Dropout can be viewed as the special case of Shakeout when , and a higher value of means a stronger regularization effect imposed on the weights. Generally, when is fixed (, a higher value of means a stronger effect of the and components imposed and leads to much sparser weights of the model. We will verify this property in our experiment section later.
It has been illustrated that Shakeout regularizes the weights in linear modules. Linear module is the basic component of multilayer neural networks. That is, the linear operations connect the outputs of two successive layers. Thus Shakeout is readily applicable to the training of multilayer neural networks.
Considering the forward computation from layer to layer , for a fully-connected layer, the Shakeout forward computation is as follows
where denotes the index of the output unit of layer , and denotes the index of the output unit of layer . The output unit of a layer is represented by . The weight of the connection between unit and unit is represented as . The bias for the -th unit is denoted by . The is the sign of corresponding weight . After Shakeout operation, the linear combination
is sent to the activation functionto obtain the corresponding output . Note that the weights that connect to the same input unit are controlled by the same random variable .
During back-propagation, we should compute the gradients with respect to each unit to propagate the error. In Shakeout, takes the form
And the weights are updated following
where represents the derivative of a sgn function. Because the sgn function is not continuous at zero and thus the derivative is not defined, we approximate this derivative with . Empirically we find that this approximation works well.
Note that the forward-backward computations with Shakeout can be easily extended to the convolutional layer. For a convolutional layer, the Shakeout feed-forward process can be formalized as
where represents the -th feature map. is the -th random mask which has the same spatial structure (i.e. the same height and width) as the corresponding feature map . denotes the kernel connecting and . And is set as . The symbol * denotes the convolution operation. And the symbol means element-wise product.
Correspondingly, during the back-propagation process, the gradient with respect to a unit of the layer on which Shakeout is applied takes the form
where means the position of a unit in the output feature map of a layer, and represents the position of a weight in the corresponding kernel.
The weights are updated following
In this section, we report empirical evaluations of Shakeout in training deep neural networks on representative datasets. The experiments are performed on three kinds of image datasets: the hand-written image dataset MNIST , the CIFAR-10 image dataset  and the ImageNet-2012 dataset . MNIST consists of 60,000+10,000 (training+test) 2828 images of hand-written digits. CIFAR-10 contains 50,000+10,000 (training+test) 3232 images of 10 object classes. ImageNet-2012 consists of 1,281,167+50,000+150,000 (training+validation+test) variable-resolution images of 1000 object classes. We first demonstrate that Shakeout leads to sparse models as our theoretical analysis implies under the unsupervised setting. Then we show that for the classification task, the sparse models have desirable generalization performances. Further, we illustrate the regularization effect of Shakeout on the weights in the classification task. Moreover, the effect of Shakeout on stabilizing the training processes of the deep architectures is demonstrated. Finally, we give some practical recommendations of Shakeout. All the experiments are implemented based on the modifications of Caffe library . Our code is released on the github: https://github.com/kgl-prml/shakeout-for-caffe.
Since Shakeout implicitly imposes penalty and penalty of the weights, we expect the weights of neural networks learned by Shakeout contain more zeros than those learned by the standard back-propagation (BP)  or Dropout . In this experiment, we employ an autoencoder model for the MNIST hand-written data, train the model using standard BP, Dropout and Shakeout, respectively, and compare the degree of sparsity of the weights of the learned encoders. For the purpose of demonstration, we employ the simple autoencoder with one hidden layer of 256 units; Dropout and Shakeout are applied on the input pixels.
To verify the regularization effect, we compare the weights of the four autoencoders trained under different settings which correspond to standard BP, Dropout () and Shakeout (, ). All the training methods aim to produce hidden units which can capture good visual features of the handwritten digits. The statistical traits of these different resulting weights are shown in Fig. 4. Moreover, Fig. 5 shows the features captured by each hidden unit of the autoencoders.
As shown in the Fig. 4, the probability density of weights around the zero obtained by standard BP training is quite small compared to the one obtained either by Dropout or Shakeout. This indicates the strong regularization effect induced by Dropout and Shakeout. Furthermore, the sparsity level of weights obtained from training by Shakeout is much higher than the one obtained from training by Dropout. Using the same , increasing makes the weights much sparser, which is consistent with the characteristics of penalty and penalty induced by Shakeout. Intuitively, we can find that due to the induced regularization, the distribution of weights obtained from training by the Dropout is like a Gaussian, while the one obtained from training by Shakeout is more like a Laplacian because of the additionally induced regularization. Fig. 5 shows that features captured by the hidden units via standard BP training are not directly interpretable, corresponding to insignificant variants in the training data. Both Dropout and Shakeout suppress irrelevant weights by their regularization effects, where Shakeout produces much sparser and more global features thanks to the combination of , and regularization terms.
The autoencoder trained by Dropout or Shakeout can be viewed as the denosing autoencoder, where Dropout or Shakeout injects special kind of noise into the inputs. Under this unsupervised setting, the denoising criterion (i.e. minimizing the error between imaginary images reconstructed from the noisy inputs and the real images without noise) is to guide the learning of useful high level feature representations [11, 12]
. To verify that Shakeout helps learn better feature representations, we adopt the hidden layer activations as features to train SVM classifiers, and the classification accuracies on test set for standard BP, Dropout and Shakeout are 95.34%, 96.41% and 96.48%, respectively. We can see that Shakeout leads to much sparser weights without defeating the main objective.
Gaussian Dropout has similar effect on the model training as standard Dropout , which multiplies the activation of each unit by a Gaussian variable with mean 1 and variance . The relationship between and is that . The distribution of the weights trained by Gaussian Dropout (, i.e. ) is illustrated in Fig. 4. From Fig. 4, we find no notable statistical difference between two kinds of Dropout implementations which all exhibit a kind of regularization effect on the weights. The classification performances of SVM classifiers on test set based on the hidden layer activations as extracted features for both kinds of Dropout implementations are quite similar (i.e. and for standard and Gaussian Dropout respectively). Due to these observations, we conduct the following classification experiments using standard Dropout as a representative implementation (of Dropout) for comparison.
Sparse models often indicate lower complexity and better generalization performance [53, 26, 39, 54]. To verify the effect of and regularization terms induced by Shakeout on the model performance, we apply Shakeout, along with Dropout and standard BP, on training representative deep neural networks for classification tasks. In all of our classification experiments, the hyper-parameters and in Shakeout, and the hyper-parameter in Dropout are determined by validation.
We train two different neural networks, a shallow fully-connected one and a deep convolutional one. For the fully-connected neural network, a big hidden layer size is adopted with its value at 4096. The non-linear activation unit adopted is the rectifier linear unit (ReLU). The deep convolutional neural network employed is based on the modifications of the LeNet, which contains two convolutional layers and two fully-connected layers. The detailed architecture information of this convolutional neural network is described in Tab. I.
. Dropout and Shakeout are applied on the hidden units of the fully-connected layer. The table compares the errors of the networks trained by standard back-propagation, Dropout and Shakeout. The mean and standard deviation of the classification errors are obtained by 5 runs of the experiment and are shown in percentage. We can see from the results that when the training data is not sufficient enough, due to over-fitting, all the models perform worse. However, the models trained by Dropout and Shakeout consistently perform better than the one trained by standard BP. Moreover, when the training data is scarce, Shakeout leads to superior model performance compared to the Dropout. Fig.6 shows the results in a more intuitive way.
We use the simple convolutional network feature extractor described in cuda-convnet (layers-80sec.cfg) 
. We apply Dropout and Shakeout on the first fully-connected layer. We call this architecture “AlexFastNet” for the convenience of description. In this experiment, 10,000 colour images are separated from the training dataset for validation and no data augmentation is utilized. The per-pixel mean computed over the training set is subtracted from each image. We first train for 100 epochs with an initial learning rate of 0.001 and then another 50 epochs with the learning rate of 0.0001. The mean and standard deviation of the classification errors are obtained by 5 runs of the experiment and are shown in percentage.
As shown in Tab. IV, the performances of models trained by Dropout and Shakeout are consistently superior to the one trained by standard BP. Furthermore, the model trained by Shakeout also outperforms the one trained by Dropout when the training data is scarce. Fig. 7 shows the results in a more intuitive way.
To test the performance of Shakeout on a much deeper architecture, we also conduct experiments based on the Wide Residual Network (WRN) . The configuration of WRN adopted is WRN-16-4, which means WRN has 16 layers in total and the number of feature maps for the convolutional layer of each residual block is 4 times as the corresponding original one . Because the complexity is much higher than that of “AlexFastNet”, the experiments are performed on relatively larger training sets with sizes of 15000, 40000, 50000. Dropout and Shakeout are applied on the second convolutional layer of each residual block, following the protocol in 
. All the training starts from the same initial weights. Batch Normalization is applied the same way as to promote the optimization. No data-augmentation or data pre-processing is adopted. All the other hyper-parameters other than and are set the same as . The results are listed in Tab. V. For the training of CIFAR-10 with 50000 training samples, we adopt the same hyper-parameters as those chosen in the training with training set size at 40000. From Tab. V, we can arrive at the same conclusion as previous experiments, i.e. the performances of the models trained by Dropout and Shakeout are consistently superior to the one trained by standard BP. Moreover, Shakeout outperforms Dropout when the data is scarce.
Shakeout is a different way to regularize the training process of deep neural networks from Dropout. For a GLM model, we have proved that the regularizer induced by Shakeout adaptively combines , and regularization terms. In section 4.1, we have demonstrated that for a one-hidden layer autoencoder, it leads to much sparser weights of the model. In this section, we will illustrate the regularization effect of Shakeout on the weights in the classification task and make a comparison to that of Dropout.
The results shown in this section are mainly based on the experiments conducted on ImageNet-2012 dataset using the representative deep architecture: AlexNet . For AlexNet, we apply Dropout or Shakeout on layers FC7 and FC8 which are the last two fully-connected layers. We train the model from the scratch and obtain the comparable classification performances on validation set for Shakeout (top-1 error: 42.88%; top-5 error: 19.85%) and Dropout (top-1 error: 42.99%; top-5 error: 19.60%). The model is trained based on the same hyper-parameter settings provided by Shelhamer in Caffe  other than the hyper-parameters and for Shakeout. The initial weights for training by Dropout and Shakeout are kept the same.
Fig. 8 illustrates the distributions of the magnitude of weight resulted by Shakeout and Dropout. It can be seen that the weights learned by Shakeout are much sparser than those learned by Dropout, due to the implicitly induced and components.
The regularizer induced by Shakeout not only contains and regularization terms but also contains regularization term, the combination of which is expected to discard a group of weights simultaneously. In Fig. 9, we use the maximum magnitude of the weights connected to one input unit of a layer to represent the importance of that unit for the subsequent output units. From Fig. 9, it can be seen that for Shakeout, the units can be approximately separated into two groups according to the maximum magnitudes of the connected weights and the group around zero can be discarded, whereas for Dropout, the units are concentrated. This implies that compared to Dropout which may encourage a “distributed code” for the features captured by the units of a layer, Shakeout tends to discard the useless features (or units) and award the important ones. This experiment result verifies the regularization properties of Shakeout and Dropout further.
As known to us, and regularization terms are related to performing feature selection [56, 57]. For a deep architecture, it is expected to obtain a set of weights using Shakeout suitable for reflecting the importance of connections between units. We perform the following experiment to verify this effect. After a model is trained, for the layer on which Dropout or Shakeout is applied, we sort the magnitudes of the weights increasingly. Then we prune the first of the sorted weights and evaluate the performance of the pruned model again. The pruning ratio goes from 0 to 1. We calculate the relative accuracy loss (we write for simplification) at each pruning ratio as
Fig. 10 shows the curves for Dropout and Shakeout based on the AlexNet model on ImageNet-2012 dataset. The models trained by Dropout and Shakeout are under the optimal hyper-parameter settings. Apparently, the relative accuracy loss for Dropout is more severe than that for Shakeout. For example, the largest margin of the relative accuracy losses between Dropout and Shakeout is , which occurs at the weight pruning ratio . This result proves that considering the trained weights in reflecting the importance of connections, Shakeout is much better than Dropout, which benefits from the implicitly induced and regularization effect.
This kind of property is useful for the popular compression task in deep learning area which aims to cut the connections or throw units of a deep neural network to a maximum extent without obvious loss of accuracy. The above experiments illustrate that Shakeout can play a considerable role in selecting important connections, which is meaningful for promoting the performance of a compression task. This is a potential subject for the future research.
In both research and production, it is always desirable to have a level of certainty about how a model’s fitness to the data improves over optimization iterations, namely, to have a stable training process. In this section, we show that Shakeout helps reduce fluctuation in the improvement of model fitness during training.
The first experiment is on the family of Generative Adversarial Networks (GANs) , which is known to be instable in the training stage [59, 60, 61]. The purpose of the following tests is to demonstrate the Shakeout’s capability of stabilizing the training process of neural networks in a general sense. GAN plays a min-max game between the generator and the discriminator over the expected log-likelihood of real data and imaginary data where represents the random input
The architecture that we adopt is DCGAN . The numbers of feature maps of the deconvolutional layers in the generator are 1024, 64 and 1 respectively, with the corresponding spatial sizes 77, 1414 and 2828. We train DCGANs on MNIST dataset using standard BP, Dropout and Shakeout. We follow the same experiment protocol described in  except for adopting Dropout or Shakeout on all layers of the discriminator. The values of during training are illustrated in Fig. 11. It can be seen that during training by standard BP oscillates greatly, while for Dropout and Shakeout, the training processes are much steadier. Compared with Dropout, the training process by Shakeout has fewer spikes and is smoother. Fig. 12 demonstrates the minimum and maximum values of within fixed length intervals moving from the start to the end of the training by standard BP, Dropout and Shakeout. It can be seen that the gaps between the minimum and maximum values of trained by Dropout and Shakeout are much smaller than that trained by standard BP, while that by Shakeout is the smallest, which implies the stability of the training process by Shakeout is the best.
The second experiment is based on Wide Residual Network architecture to perform the classification task. In the classification task, generalization performance is the main focus and thus, we compare the validation errors during the training processes by Dropout and Shakeout. Fig. 13 demonstrates the validation error as a function of the training epoch for Dropout and Shakeout on CIFAR-10 with 40000 training examples. The architecture adopted is WRN-16-4. The experiment settings are the same as those described in Section 4.2.2. Considering the generalization performance, the learning rate schedule adopted is the one optimized through validation to make the models obtain the best generalization performances. Under this schedule, we find that the validation error temporarily increases when lowering the learning rate at the early stage of training, which has been repeatedly observed by . Nevertheless, it can be seen from Fig. 13 that the extent of error increase is less severe for Shakeout than Dropout. Moreover, Shakeout recovers much faster than Dropout does. At the final stage, both of the validation errors steadily decrease. Shakeout obtains comparable or even superior generalization performance to Dropout. In a word, Shakeout significantly stabilizes the entire training process with superior generalization performance.
Selection of Hyper-parameters The most practical and popular way to perform hyper-parameter selection is to partition the training data into a training set and a validation set to evaluate the classification performance of different hyper-parameters on it. Due to the expensive cost of time for training a deep neural network, cross-validation is barely adopted. There exist many hyper-parameter selection methods in the domain of deep learning, such as the grid search, random search , Bayesian optimization methods , gradient-based hyper-parameter Optimization , etc. For applying Shakeout on a deep neural network, we need to decide two hyper-parameters and . From the regularization perspective, we need to decide the most suitable strength of regularization effect to obtain an optimal trade-off between model bias and variance. We have pointed out that in a unified framework, Dropout is a special case of Shakeout when Shakeout hyper-parameter is set to zero. Empirically we find that the optimal for Shakeout is not higher than that for Dropout. After determining the optimal , keeping the order of magnitude of hyper parameter the same as ( represents the number of training samples) is an effective choice. If you want to obtain a model with much sparser weights but meanwhile with superior or comparable generalization performance to Dropout, a relatively lower and larger for Shakeout always works.
Shakeout combined with Batch Normalization Batch Normalization  is the widely-adopted technique to promote the optimization of the training process for a deep neural network. In practice, combining Shakeout with Batch Normalization to train a deep architecture is a good choice. For example, we observe that the training of WRN-16-4 model on CIFAR-10 is slow to converge without using Batch Normalization in the training. Moreover, the generalization performance on the test set for Shakeout combined with Batch Normalization always outperforms that for standard BP with Batch Normalization consistently for quite a large margin, as illustrated in Tab. V. These results imply the important role of Shakeout in reducing over-fitting of a deep neural network.
We have proposed Shakeout, which is a new regularized training approach for deep neural networks. The regularizer induced by Shakeout is proved to adaptively combine , and regularization terms. Empirically we find that
1) Compared to Dropout, Shakeout can afford much larger models. Or to say, when the data is scarce, Shakeout outperforms Dropout with a large margin.
2) Shakeout can obtain much sparser weights than Dropout with superior or comparable generalization performance of the model. While for Dropout, if one wants to obtain the same level of sparsity as that obtained by Shakeout, the model may bear a significant loss of accuracy.
3) Some deep architectures in nature may result in the instability of the training process, such as GANs, however, Shakeout can reduce this instability effectively.
In future, we want to put emphasis on the inductive bias of Shakeout and attempt to apply Shakeout to the compression task.
This research is supported by Australian Research Council Projects (No. FT-130101457, DP-140102164 and LP-150100671).
European Conference on Computer Vision. Springer, 2016, pp. 630–645.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” inProceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017, pp. 4278–4284.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” inProceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 1096–1103.
N. Chen, J. Zhu, J. Chen, and B. Zhang, “Dropout training for support vector machines,” inTwenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
, “Sparse feature learning for deep belief networks,” inAdvances in neural information processing systems, 2008, pp. 1185–1192.
M. Thom and G. Palm, “Sparse activity and sparse connectivity in supervised learning,”Journal of Machine Learning Research, vol. 14, no. Apr, pp. 1091–1143, 2013.
D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyperparameter optimization through reversible learning,” inProceedings of the 32nd International Conference on Machine Learning, 2015.