Shakeout: A New Approach to Regularized Deep Neural Network Training

04/13/2019 ∙ by Guoliang Kang, et al. ∙ University of Technology Sydney The University of Sydney 20

Recent years have witnessed the success of deep neural networks in dealing with a plenty of practical problems. Dropout has played an essential role in many successful deep neural networks, by inducing regularization in the model training. In this paper, we present a new regularized training approach: Shakeout. Instead of randomly discarding units as Dropout does at the training stage, Shakeout randomly chooses to enhance or reverse each unit's contribution to the next layer. This minor modification of Dropout has the statistical trait: the regularizer induced by Shakeout adaptively combines L_0, L_1 and L_2 regularization terms. Our classification experiments with representative deep architectures on image datasets MNIST, CIFAR-10 and ImageNet show that Shakeout deals with over-fitting effectively and outperforms Dropout. We empirically demonstrate that Shakeout leads to sparser weights under both unsupervised and supervised settings. Shakeout also leads to the grouping effect of the input units in a layer. Considering the weights in reflecting the importance of connections, Shakeout is superior to Dropout, which is valuable for the deep model compression. Moreover, we demonstrate that Shakeout can effectively reduce the instability of the training process of the deep architecture.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 12

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have recently achieved impressive success in a number of machine learning and pattern recognition tasks and been under intensive research

[1, 2, 3, 4, 5, 6, 7]. Hierarchical neural networks have been known for decades, and there are a number of essential factors contributing to its recent rising, such as the availability of big data and powerful computational resources. However, arguably the most important contributor to the success of deep neural network is the discovery of efficient training approaches [8, 9, 10, 11, 12].

A particular interesting advance in the training techniques is the invention of Dropout [13]. At the operational level, Dropout adjusts the network evaluation step (feed-forward) at the training stage, where a portion of units are randomly discarded. The effect of this simple trick is impressive. Dropout enhances the generalization performance of neural networks considerably, and is behind many record-holders of widely recognized benchmarks [14, 2, 15]. The success has attracted much research attention, and found applications in a wider range of problems [16, 17, 18]. Theoretical research from the viewpoint of statistical learning has pointed out the connections between Dropout and model regularization, which is the de facto recipe of reducing over-fitting for complex models in practical machine learning. For example, Wager et al. [16] showed that for a generalized linear model (GLM), Dropout implicitly imposes an adaptive

regularizer of the network weights through an estimation of the inverse diagonal Fisher information matrix.

Sparsity is of vital importance in deep learning. It is straightforward that through removing unimportant weights, deep neural networks perform prediction faster. Additionally, it is expected to obtain better generalization performance and reduce the number of examples needed in the training stage

[19]. Recently much evidence has shown that the accuracy of a trained deep neural network will not be severely affected by removing a majority of connections and many researchers focus on the deep model compression task [20, 21, 22, 23, 24, 25]. One effective way of compression is to train a neural network, prune the connections and fine-tune the weights iteratively [21, 22]. However, if we can cut the connections naturally via imposing sparsity-inducing penalties in the training process of a deep neural network, the work-flow will be greatly simplified.

In this paper, we propose a new regularized deep neural network training approach: Shakeout, which is easy to implement: randomly choosing to enhance or reverse each unit’s contribution to the next layer in the training stage. Note that Dropout can be considered as a special “flat” case of our approach: randomly keeping (enhance factor is ) or discarding (reverse factor is ) each unit’s contribution to the next layer. Shakeout enriches the regularization effect. In theory, we prove that it adaptively combines , and regularization terms. and regularization terms are known as sparsity-inducing penalties. The combination of sparsity-inducing penalty and penalty of the model parameters has shown to be effective in statistical learning: the Elastic Net [26] has the desirable properties of producing sparse models while maintaining the grouping effect of the weights of the model. Because of the randomly “shaking” process and the regularization characteristic pushing network weights to zero, our new approach is named “Shakeout”.

As discussed above, it is expected to obtain much sparser weights using Shakeout than using Dropout because of the combination of and

regularization terms induced in the training stage. We apply Shakeout on one-hidden-layer autoencoder and obtain much sparser weights than that resulted by Dropout. To show the regularization effect on the classification tasks, we conduct the experiments on image datasets including MNIST, CIFAR-10 and ImageNet with the representative deep neural network architectures. In our experiments we find that by using Shakeout, the trained deep neural networks always outperform those by using Dropout, especially when the data is scarce. Besides the fact that Shakeout leads to much sparser weights, we also empirically find that it groups the input units of a layer. Due to the induced

and regularization terms, Shakeout can result in the weights reflecting the importance of the connections between units, which is meaningful for conducting compression. Moreover, we demonstrate that Shakeout can effectively reduce the instability of the training process of the deep architecture.

This journal paper extends our previous work [27] theoretically and experimentally. The main extensions are listed as follows: 1) we derive the analytical formula for the regularizer induced by Shakeout in the context of GLM and prove several important properties; 2) we conduct experiments using Wide Residual Network [15] on CIFAR-10 to show Shakeout outperforms Dropout and standard back-propagation in promoting the generalization performance of a much deeper architecture; 3) we conduct experiments using AlexNet [14]

on ImageNet dataset with Shakeout and Dropout. Shakeout obtains comparable classification performance to Dropout, but with superior regularization effect; 4) we illustrate that Shakeout can effectively reduce the instability of the training process of the deep architecture. Moreover, we provide a much clearer and detailed description of Shakeout, derive the forward-backward update rule for deep convolutional neural networks with Shakeout, and give several recommendations to help the practitioners make full use of Shakeout.

In the rest of the paper, we give a review about the related work in Section 2. Section 3 presents Shakeout in detail, along with theoretical analysis of the regularization effect induced by Shakeout. In Section 4, we first demonstrate the regularization effect of Shakeout on the autoencoder model. The classification experiments on MNIST , CIFAR-10 and ImageNet illustrate that Shakeout outperforms Dropout considering the generalization performance, the regularization effect on the weights, and the stabilization effect on the training process of the deep architecture. Finally, we give some recommendations for the practitioners to make full use of Shakeout.

2 Related Work

Deep neural networks have shown their success in a wide variety of applications. One of the key factors contributes to this success is the creation of powerful training techniques. The representative power of the network becomes stronger as the architecture gets deeper [9]. However, millions of parameters make deep neural networks easily over-fit. Regularization [28, 16] is an effective way to obtain a model that generalizes well. There exist many approaches to regularize the training of deep neural networks, like weight decay [29], early stopping [30], etc. Shakeout belongs to the family of regularized training techniques.

Among these regularization techniques, our work is closely related to Dropout [13]. Many subsequent works were devised to improve the performance of Dropout [31, 32, 33]. The underlying reason why Dropout improves performance has also attracted the interest of many researchers. Evidence has shown that Dropout may work because of its good approximation to model averaging and regularization on the network weights [34, 35, 36]. Srivastava [34] and Warde-Farley [35]

exhibited through experiments that the weight scaling approximation is an accurate alternative for the geometric mean over all possible sub-networks. Gal et al.

[37] claimed that training the deep neural network with Dropout is equivalent to performing variational inference in a deep Gaussian Process. Dropout can also be regarded as a way of adding noise into the neural network. By marginalizing the noise, Srivastava [34]

proved for linear regression that the deterministic version of Dropout is equivalent to adding an adaptive

regularization on the weights. Furthermore, Wager [16] extended the conclusion to generalized linear models (GLMs) using a quadratic approximation to the induced regularizer. The inductive bias of Dropout was studied by Helmbold et al. [38] to illustrate the properties of the regularizer induced by Dropout further. In terms of implicitly inducing regularizer of the network weights, Shakeout can be viewed as a generalization of Dropout. It enriches the regularization effect of Dropout, i.e. besides the regularization term, it also induces the and regularization terms, which may lead to sparse weights of the model.

Due to the implicitly induced and regularization terms, Shakeout is also related to sparsity-inducing approaches. Olshausen et al. [39]

introduced the concept of sparsity in computational neuroscience and proposed the sparse coding method in the visual system. In machine learning, the sparsity constraint enables a model to capture the implicit statistical data structure, performs feature selection and regularization, compresses the data at a low loss of the accuracy, and helps us to better understand our models and explain the obtained results. Sparsity is one of the key factors underlying many successful deep neural network architectures

[40, 41, 42, 2] and training algorithms [43][44]. A Convolutional neural network is much sparser than the fully-connected one, which results from the concept of local receptive field [40]. Sparsity has been a design principle and motivation for Inception-series models [41, 42, 2]

. Besides working as the heuristic principle of designing a deep architecture, sparsity often works as a penalty induced to regularize the training process of a deep neural network. There exist two kinds of sparsity penalties in deep neural networks, which lead to the activity sparsity

[43][44] and the connectivity sparsity [45]

respectively. The difference between Shakeout and these sparsity-inducing approaches is that for Shakeout, the sparsity is induced through simple stochastic operations rather than manually designed architectures or explicit norm-based penalties. This implicit way enables Shakeout to be easily optimized by stochastic gradient descent (SGD)

the representative approach for the optimization of a deep neural network.

3 Shakeout

Shakeout applies on the weights in a linear module. The linear module, i.e. weighted sum,

(1)

is arguably the most widely adopted component in data models. For example, the variables , , , can be input attributes of a model, e.g. the extracted features for a GLM, or the intermediate outputs of earlier processing steps, e.g. the activations of the hidden units in a multilayer artificial neural network. Shakeout randomly modifies the computation in Eq. (1). Specifically, Shakeout can be realized by randomly modifying the weights

Step 1: Draw , where .

Step 2: Adjust the weight according to ,

where takes depending on the sign of or takes 0 if . As shown above, Shakeout chooses (randomly by drawing ) between two fundamentally different ways to modify the weights. Modification (A) is to set the weights to constant magnitudes, despite their original values except for the signs (to be opposite to the original ones). Modification (B) updates the weights by a factor and a bias depending on the signs. Note both (A) and (B) preserve zero values of the weights, i.e. if then

with probability 1. Let

, and Shakeout leaves unbiased, i.e. . The hyper-parameters and configure the property of Shakeout.

Shakeout is naturally connected to the widely adopted operation of Dropout [13, 34]. We will show that Shakeout has regularization effect on model training similar to but beyond what is induced by Dropout. From an operational point of view, Fig. 1 compares Shakeout and Dropout. Note that Shakeout includes Dropout as a special case when the hyper-parameter in Shakeout is set to zero.

Fig. 1: Comparison between Shakeout and Dropout operations. This figure shows how Shakeout and Dropout are applied to the weights in a linear module. In the original linear module, the output is the summation of the inputs weighted by , while for Dropout and Shakeout, the weights are first randomly modified. In detail, a random switch controls how each is modified. The manipulation of is illustrated within the amplifier icons (the red curves, best seen with colors). The coefficients are and , where extracts the sign of and , . Note the sign of is always the same as that of . The magnitudes of coefficients and are determined by the Shakeout hyper-parameters and . Dropout can be viewed as a special case of Shakeout when because is zero at this circumstance.

When applied at the training stage, Shakeout alters the objective the quantity to be minimized by adjusting the weights. In particular, we will show that Shakeout (with expectation over the random switch) induces a regularization term effectively penalizing the magnitudes of the weights and leading to sparse weights. Shakeout is an approach designed for helping model training, when the models are trained and deployed, one should relieve the disturbance to allow the model work with its full capacity, i.e. we adopt the resulting network without any modification of the weights at the test stage.

3.1 Regularization Effect of Shakeout

Shakeout randomly modifies the weights in a linear module, and thus can be regarded as injecting noise into each variable , i.e. is randomly scaled by : . Note that , the modification of is actually determined by the random switch . Shakeout randomly chooses to enhance (i.e. when , ) or reverse (i.e. when , ) each original variable ’s contribution to the output at the training stage (see Fig. 1). However, the expectation of over the noise remains unbiased, i.e. .

It is well-known that injecting artificial noise into the input features will regularize the training objective [16, 46, 47], i.e. , where

is the feature vector randomly modified by the noise induced by

. The regularization term is determined by the characteristic of the noise. For example, Wager et al.[16] showed that Dropout, corresponding to inducing blackout noise to the features, helps introduce an adaptive penalty on . In this section we illustrate how Shakeout helps regularize model parameters using an example of GLMs.

Formally, a GLM is a probabilistic model of predicting target given features , in terms of the weighted sum in Eq. (1):

(2)

With different and

functions, GLM can be specialized to various useful models or modules, such as logistic regression model or a layer in a feed-forward neural network. However, roughly speaking, the essence of a GLM is similar to that of a standard linear model which aims to find weights

so that aligns with (functions and are independent of and

respectively). The loss function of a GLM with respect to

is defined as

(3)
(4)

The loss (3) is the negative logarithm of probability (2), where we keep only terms relevant to .

Let the loss with Shakeout be

(5)

where , and represents the features randomly modified with .

Taking expectation over , the loss with Shakeout becomes

where

(6)

is named Shakeout regularizer. Note that if is -th order derivable, let the order derivative where , to make the denotation simple.

Theorem 1

Let , and , then Shakeout regularizer is

(7)
Proof:

Note that , then for Eq. (6)

Because arbitrary two random variables

, are independent unless and , , then

Then

Further, let , , becomes

The theorem is proved.

We illustrate several properties of Shakeout regularizer based on Eq. (7). The proof of the following propositions can be found in the appendices.

Proposition 1

Proposition 2

If is convex, .

Proposition 3

Suppose , . If is convex, monotonically increases with . If , monotonically increases with .

Proposition 3 implies that the hyper-parameters and relate to the strength of the regularization effect. It is reasonable because higher or means the noise injected into the features

has larger variance.

Proposition 4

Suppose i) , , and ii) .

Then

i) if ,

ii) if ,

Proposition 4 implies that under certain conditions, starting from a zero weight vector, Shakeout regularizer penalizes the magnitude of and its regularization effect is bounded by a constant value. For example, for logistic regression, , which is illustrated in Fig. 2. This bounded property has been proved to be useful: capped-norm [48]

is more robust to outliers than the traditional

or norm.

(a) Shakeout regularizer: ,
(b) Dropout regularizer:
Fig. 2: Regularization effect as a function of a single weight when other weights are fixed to zeros for logistic regression model. The corresponding feature is fixed at 1.

Based on the Eq. (7), the specific formulas for the representative GLM models can be derived:

i) Linear regression: , then

where denotes the element-wise product and the term can be decomposed into the summation of three components

(8)

where is an indicator function which satisfies . This decomposition implies that Shakeout regularizer penalizes the combination of -norm, -norm and -norm of the weights after scaling them with the square of corresponding features. The and regularization terms can lead to sparse weights.

ii) Logistic regression: , then

(9)

Fig. 3 illustrates the contour of Shakeout regularizer based on Eq. (9) in the 2D weight space. On the whole, the contour of Shakeout regularizer indicates that the regularizer combines , and regularization terms. As goes to zero, the contour around becomes less sharper, which implies hyper-parameter relates to the strength of and components. When , Shakeout degenerates to Dropout, the contour of which implies Dropout regularizer consists of regularization term.

The difference between Shakeout and Dropout regularizers is also illustrated in Fig. 2. We set , for Shakeout, and for Dropout to make the bounds of the regularization effects of two regularizers the same. In this one dimension circumstance, the main difference is that at (see the enlarged snapshot for comparison), Shakeout regularizer is sharp and discontinuous while Dropout regularizer is smooth. Thus compared to Dropout, Shakeout may lead to much sparser weights of the model.

To simplify the analysis and prove the intuition we have observed in Fig. 3 about the properties of Shakeout regularizer, we quadratically approximate Shakeout regularizer of Eq. (7) by

(10)

The , already shown in Eq. (8), consists of the combination of , , regularization terms. It tends to penalize the weight whose corresponding feature’s magnitude is large. Meanwhile, the weights whose corresponding features are always zeros are less penalized. The term is proportional to the variance of prediction given and . Penalizing encourages the weights to move towards making the model be more ”confident” about its predication, i.e. be more discriminative.

Generally speaking, Shakeout regularizer adaptively combines , and regularization terms, the property of which matches what we have observed in Fig. 3. It prefers penalizing the weights who have large magnitudes and encourages the weights to move towards making the model more discriminative. Moreover, the weights whose corresponding features are always zeros are less penalized. The and components can induce sparse weights.

Last but not the least, we want to emphasize that when , the noise is eliminated and the model becomes a standard GLM. Moreover, Dropout can be viewed as the special case of Shakeout when , and a higher value of means a stronger regularization effect imposed on the weights. Generally, when is fixed (, a higher value of means a stronger effect of the and components imposed and leads to much sparser weights of the model. We will verify this property in our experiment section later.

Fig. 3: The contour plots of the regularization effect induced by Shakeout in 2D weight space with input . Note that Dropout is a special case of Shakeout with .

3.2 Shakeout in Multilayer Neural Networks

It has been illustrated that Shakeout regularizes the weights in linear modules. Linear module is the basic component of multilayer neural networks. That is, the linear operations connect the outputs of two successive layers. Thus Shakeout is readily applicable to the training of multilayer neural networks.

Considering the forward computation from layer to layer , for a fully-connected layer, the Shakeout forward computation is as follows

(11)
(12)

where denotes the index of the output unit of layer , and denotes the index of the output unit of layer . The output unit of a layer is represented by . The weight of the connection between unit and unit is represented as . The bias for the -th unit is denoted by . The is the sign of corresponding weight . After Shakeout operation, the linear combination

is sent to the activation function

to obtain the corresponding output . Note that the weights that connect to the same input unit are controlled by the same random variable .

During back-propagation, we should compute the gradients with respect to each unit to propagate the error. In Shakeout, takes the form

(13)

And the weights are updated following

(14)

where represents the derivative of a sgn function. Because the sgn function is not continuous at zero and thus the derivative is not defined, we approximate this derivative with . Empirically we find that this approximation works well.

Note that the forward-backward computations with Shakeout can be easily extended to the convolutional layer. For a convolutional layer, the Shakeout feed-forward process can be formalized as

(15)
(16)

where represents the -th feature map. is the -th random mask which has the same spatial structure (i.e. the same height and width) as the corresponding feature map . denotes the kernel connecting and . And is set as . The symbol * denotes the convolution operation. And the symbol means element-wise product.

Correspondingly, during the back-propagation process, the gradient with respect to a unit of the layer on which Shakeout is applied takes the form

(17)

where means the position of a unit in the output feature map of a layer, and represents the position of a weight in the corresponding kernel.

The weights are updated following

(18)

4 Experiments

In this section, we report empirical evaluations of Shakeout in training deep neural networks on representative datasets. The experiments are performed on three kinds of image datasets: the hand-written image dataset MNIST [40], the CIFAR-10 image dataset [49] and the ImageNet-2012 dataset [50]. MNIST consists of 60,000+10,000 (training+test) 2828 images of hand-written digits. CIFAR-10 contains 50,000+10,000 (training+test) 3232 images of 10 object classes. ImageNet-2012 consists of 1,281,167+50,000+150,000 (training+validation+test) variable-resolution images of 1000 object classes. We first demonstrate that Shakeout leads to sparse models as our theoretical analysis implies under the unsupervised setting. Then we show that for the classification task, the sparse models have desirable generalization performances. Further, we illustrate the regularization effect of Shakeout on the weights in the classification task. Moreover, the effect of Shakeout on stabilizing the training processes of the deep architectures is demonstrated. Finally, we give some practical recommendations of Shakeout. All the experiments are implemented based on the modifications of Caffe library [51]. Our code is released on the github: https://github.com/kgl-prml/shakeout-for-caffe.

4.1 Shakeout and Weight Sparsity

Since Shakeout implicitly imposes penalty and penalty of the weights, we expect the weights of neural networks learned by Shakeout contain more zeros than those learned by the standard back-propagation (BP) [52] or Dropout [13]. In this experiment, we employ an autoencoder model for the MNIST hand-written data, train the model using standard BP, Dropout and Shakeout, respectively, and compare the degree of sparsity of the weights of the learned encoders. For the purpose of demonstration, we employ the simple autoencoder with one hidden layer of 256 units; Dropout and Shakeout are applied on the input pixels.

To verify the regularization effect, we compare the weights of the four autoencoders trained under different settings which correspond to standard BP, Dropout () and Shakeout (, ). All the training methods aim to produce hidden units which can capture good visual features of the handwritten digits. The statistical traits of these different resulting weights are shown in Fig. 4. Moreover, Fig. 5 shows the features captured by each hidden unit of the autoencoders.

As shown in the Fig. 4, the probability density of weights around the zero obtained by standard BP training is quite small compared to the one obtained either by Dropout or Shakeout. This indicates the strong regularization effect induced by Dropout and Shakeout. Furthermore, the sparsity level of weights obtained from training by Shakeout is much higher than the one obtained from training by Dropout. Using the same , increasing makes the weights much sparser, which is consistent with the characteristics of penalty and penalty induced by Shakeout. Intuitively, we can find that due to the induced regularization, the distribution of weights obtained from training by the Dropout is like a Gaussian, while the one obtained from training by Shakeout is more like a Laplacian because of the additionally induced regularization. Fig. 5 shows that features captured by the hidden units via standard BP training are not directly interpretable, corresponding to insignificant variants in the training data. Both Dropout and Shakeout suppress irrelevant weights by their regularization effects, where Shakeout produces much sparser and more global features thanks to the combination of , and regularization terms.

The autoencoder trained by Dropout or Shakeout can be viewed as the denosing autoencoder, where Dropout or Shakeout injects special kind of noise into the inputs. Under this unsupervised setting, the denoising criterion (i.e. minimizing the error between imaginary images reconstructed from the noisy inputs and the real images without noise) is to guide the learning of useful high level feature representations [11, 12]

. To verify that Shakeout helps learn better feature representations, we adopt the hidden layer activations as features to train SVM classifiers, and the classification accuracies on test set for standard BP, Dropout and Shakeout are 95.34%, 96.41% and 96.48%, respectively. We can see that Shakeout leads to much sparser weights without defeating the main objective.

Gaussian Dropout has similar effect on the model training as standard Dropout [34], which multiplies the activation of each unit by a Gaussian variable with mean 1 and variance . The relationship between and is that . The distribution of the weights trained by Gaussian Dropout (, i.e. ) is illustrated in Fig. 4. From Fig. 4, we find no notable statistical difference between two kinds of Dropout implementations which all exhibit a kind of regularization effect on the weights. The classification performances of SVM classifiers on test set based on the hidden layer activations as extracted features for both kinds of Dropout implementations are quite similar (i.e. and for standard and Gaussian Dropout respectively). Due to these observations, we conduct the following classification experiments using standard Dropout as a representative implementation (of Dropout) for comparison.

Fig. 4: Distributions of the weights of the autoencoder models learned by different training approaches. Each curve in the figure shows the frequencies of the weights of an autoencoder taking particular values, i.e. the empirical population densities of the weights. The five curves correspond to five autoencoders learned by standard back-propagation, Dropout (), Gaussian Dropout () and Shakeout (, ). The sparsity of the weights obtained via Shakeout can be seen by comparing the curves.
(a) standard BP
(b) Dropout:
(c) Shakeout: ,
Fig. 5: Features captured by the hidden units of the autoencoder models learned by different training methods. The features captured by a hidden unit are represented by a group of weights that connect the image pixels with this corresponding hidden unit. One image patch in a sub-graph corresponds to the features captured by one hidden unit.

4.2 Classification Experiments

Sparse models often indicate lower complexity and better generalization performance [53, 26, 39, 54]. To verify the effect of and regularization terms induced by Shakeout on the model performance, we apply Shakeout, along with Dropout and standard BP, on training representative deep neural networks for classification tasks. In all of our classification experiments, the hyper-parameters and in Shakeout, and the hyper-parameter in Dropout are determined by validation.

4.2.1 Mnist

We train two different neural networks, a shallow fully-connected one and a deep convolutional one. For the fully-connected neural network, a big hidden layer size is adopted with its value at 4096. The non-linear activation unit adopted is the rectifier linear unit (ReLU). The deep convolutional neural network employed is based on the modifications of the LeNet

[40], which contains two convolutional layers and two fully-connected layers. The detailed architecture information of this convolutional neural network is described in Tab. I.

Layer 1 2 3 4
Type conv. conv. FC FC
Channels 20 50 500 10
Filter size - -

Conv. stride

1 1 - -
Pooling type max max - -
Pooling size - -
Pooling stride 2 2 - -
Non-linear ReLU ReLU ReLU Softmax
TABLE I: The architecture of convolutional neural network adopted for MNIST classification experiment

We separate 10,000 training samples from original training dataset for validation. The results are shown in Tab. II and Tab. III

. Dropout and Shakeout are applied on the hidden units of the fully-connected layer. The table compares the errors of the networks trained by standard back-propagation, Dropout and Shakeout. The mean and standard deviation of the classification errors are obtained by 5 runs of the experiment and are shown in percentage. We can see from the results that when the training data is not sufficient enough, due to over-fitting, all the models perform worse. However, the models trained by Dropout and Shakeout consistently perform better than the one trained by standard BP. Moreover, when the training data is scarce, Shakeout leads to superior model performance compared to the Dropout. Fig.

6 shows the results in a more intuitive way.

Size std-BP Dropout Shakeout
500 13.660.66 11.760.09 10.810.32
1000 8.490.23 8.050.05 7.190.15
3000 5.540.09 4.870.06 4.600.07
8000 3.570.14 2.950.05 2.960.09
20000 2.280.09 1.820.07 1.920.06
50000 1.550.03 1.360.03 1.350.07
TABLE II: Classification on MNIST using training sets of different sizes: fully-connected neural network
(a) Fully-connected neural network
(b) Convolutional neural network
Fig. 6: Classification of two kinds of neural networks on MNIST using training sets of different sizes. The curves show the performances of the models trained by standard BP, and those by Dropout and Shakeout applied on the hidden units of the fully-connected layer.

4.2.2 Cifar-10

We use the simple convolutional network feature extractor described in cuda-convnet (layers-80sec.cfg) [55]

. We apply Dropout and Shakeout on the first fully-connected layer. We call this architecture “AlexFastNet” for the convenience of description. In this experiment, 10,000 colour images are separated from the training dataset for validation and no data augmentation is utilized. The per-pixel mean computed over the training set is subtracted from each image. We first train for 100 epochs with an initial learning rate of 0.001 and then another 50 epochs with the learning rate of 0.0001. The mean and standard deviation of the classification errors are obtained by 5 runs of the experiment and are shown in percentage.

Size std-BP Dropout Shakeout
500 9.760.26 6.160.23 4.830.11
1000 6.730.12 4.010.16 3.430.06
3000 2.930.10 2.060.06 1.860.13
8000 1.700.03 1.230.13 1.310.06
20000 0.970.01 0.830.06 0.77.001
50000 0.780.05 0.620.04 0.580.10
TABLE III: Classification on MNIST using training sets of different sizes: convolutional neural network
Size std-BP Dropout Shakeout
300 68.260.57 65.340.75 63.710.28
700 59.780.24 56.040.22 54.660.22
2000 50.730.29 46.240.49 44.390.41
5500 41.410.52 36.010.13 34.540.31
15000 32.530.25 27.280.26 26.530.17
40000 24.480.23 20.500.32 20.560.12
TABLE IV: Classification on CIFAR-10 using training sets of different sizes: AlexFastNet

As shown in Tab. IV, the performances of models trained by Dropout and Shakeout are consistently superior to the one trained by standard BP. Furthermore, the model trained by Shakeout also outperforms the one trained by Dropout when the training data is scarce. Fig. 7 shows the results in a more intuitive way.

Fig. 7: Classification on CIFAR-10 using training sets of different sizes. The curves show the performances of the models trained by standard BP, and those by Dropout and Shakeout applied on the hidden units of the fully-connected layer.

To test the performance of Shakeout on a much deeper architecture, we also conduct experiments based on the Wide Residual Network (WRN) [15]. The configuration of WRN adopted is WRN-16-4, which means WRN has 16 layers in total and the number of feature maps for the convolutional layer of each residual block is 4 times as the corresponding original one [1]. Because the complexity is much higher than that of “AlexFastNet”, the experiments are performed on relatively larger training sets with sizes of 15000, 40000, 50000. Dropout and Shakeout are applied on the second convolutional layer of each residual block, following the protocol in [15]

. All the training starts from the same initial weights. Batch Normalization is applied the same way as

[15] to promote the optimization. No data-augmentation or data pre-processing is adopted. All the other hyper-parameters other than and are set the same as [15]. The results are listed in Tab. V. For the training of CIFAR-10 with 50000 training samples, we adopt the same hyper-parameters as those chosen in the training with training set size at 40000. From Tab. V, we can arrive at the same conclusion as previous experiments, i.e. the performances of the models trained by Dropout and Shakeout are consistently superior to the one trained by standard BP. Moreover, Shakeout outperforms Dropout when the data is scarce.

Size std-BP Dropout Shakeout
15000 20.95 15.05 14.68
40000 15.37 9.32 9.01
50000 14.39 8.03 7.97
TABLE V: Classification on CIFAR-10 using training sets of different sizes: WRN-16-4
(a) AlexNet FC7 layer
(b) AlexNet FC8 layer
Fig. 8: Comparison of the distributions of the magnitude of weights trained by Dropout and Shakeout. The experiments are conducted using AlexNet on ImageNet-2012 dataset. Shakeout or Dropout is applied on the last two fully-connected layers, i.e. FC7 layer and FC8 layer.
(a) AlexNet FC7 layer
(b) AlexNet FC8 layer
Fig. 9: Distributions of the maximum magnitude of the weights connected to the same input unit of a layer. The maximum magnitude of the weights connected to one input unit can be regarded as a metric of the importance of that unit. The experiments are conducted using AlexNet on ImageNet-2012 dataset. For Shakeout, the units can be approximately separated into two groups and the one around zero is less important than the other, whereas for Dropout, the units are more concentrated.

4.2.3 Regularization Effect on the Weights

Shakeout is a different way to regularize the training process of deep neural networks from Dropout. For a GLM model, we have proved that the regularizer induced by Shakeout adaptively combines , and regularization terms. In section 4.1, we have demonstrated that for a one-hidden layer autoencoder, it leads to much sparser weights of the model. In this section, we will illustrate the regularization effect of Shakeout on the weights in the classification task and make a comparison to that of Dropout.

The results shown in this section are mainly based on the experiments conducted on ImageNet-2012 dataset using the representative deep architecture: AlexNet [14]. For AlexNet, we apply Dropout or Shakeout on layers FC7 and FC8 which are the last two fully-connected layers. We train the model from the scratch and obtain the comparable classification performances on validation set for Shakeout (top-1 error: 42.88%; top-5 error: 19.85%) and Dropout (top-1 error: 42.99%; top-5 error: 19.60%). The model is trained based on the same hyper-parameter settings provided by Shelhamer in Caffe [51] other than the hyper-parameters and for Shakeout. The initial weights for training by Dropout and Shakeout are kept the same.

Fig. 8 illustrates the distributions of the magnitude of weight resulted by Shakeout and Dropout. It can be seen that the weights learned by Shakeout are much sparser than those learned by Dropout, due to the implicitly induced and components.

The regularizer induced by Shakeout not only contains and regularization terms but also contains regularization term, the combination of which is expected to discard a group of weights simultaneously. In Fig. 9, we use the maximum magnitude of the weights connected to one input unit of a layer to represent the importance of that unit for the subsequent output units. From Fig. 9, it can be seen that for Shakeout, the units can be approximately separated into two groups according to the maximum magnitudes of the connected weights and the group around zero can be discarded, whereas for Dropout, the units are concentrated. This implies that compared to Dropout which may encourage a “distributed code” for the features captured by the units of a layer, Shakeout tends to discard the useless features (or units) and award the important ones. This experiment result verifies the regularization properties of Shakeout and Dropout further.

As known to us, and regularization terms are related to performing feature selection [56, 57]. For a deep architecture, it is expected to obtain a set of weights using Shakeout suitable for reflecting the importance of connections between units. We perform the following experiment to verify this effect. After a model is trained, for the layer on which Dropout or Shakeout is applied, we sort the magnitudes of the weights increasingly. Then we prune the first of the sorted weights and evaluate the performance of the pruned model again. The pruning ratio goes from 0 to 1. We calculate the relative accuracy loss (we write for simplification) at each pruning ratio as

Fig. 10 shows the curves for Dropout and Shakeout based on the AlexNet model on ImageNet-2012 dataset. The models trained by Dropout and Shakeout are under the optimal hyper-parameter settings. Apparently, the relative accuracy loss for Dropout is more severe than that for Shakeout. For example, the largest margin of the relative accuracy losses between Dropout and Shakeout is , which occurs at the weight pruning ratio . This result proves that considering the trained weights in reflecting the importance of connections, Shakeout is much better than Dropout, which benefits from the implicitly induced and regularization effect.

(a) standard BP
(b) Dropout
(c) Shakeout
Fig. 11: The value of as a function of iteration for the training process of DCGAN. DCGANs are trained using standard BP, Dropout and Shakeout for comparison. Dropout or Shakeout is applied on the discriminator of GAN.

This kind of property is useful for the popular compression task in deep learning area which aims to cut the connections or throw units of a deep neural network to a maximum extent without obvious loss of accuracy. The above experiments illustrate that Shakeout can play a considerable role in selecting important connections, which is meaningful for promoting the performance of a compression task. This is a potential subject for the future research.

Fig. 10: Relative accuracy loss as a function of the weight pruning ratio for Dropout and Shakeout based on AlexNet architecture on ImageNet-2012. The relative accuracy loss for Dropout is much severe than that for Shakeout. The largest margin of the relative accuracy losses between Dropout and Shakeout is , which occurs at the weight pruning ratio .

4.3 Stabilization Effect on the Training Process

In both research and production, it is always desirable to have a level of certainty about how a model’s fitness to the data improves over optimization iterations, namely, to have a stable training process. In this section, we show that Shakeout helps reduce fluctuation in the improvement of model fitness during training.

The first experiment is on the family of Generative Adversarial Networks (GANs) [58], which is known to be instable in the training stage [59, 60, 61]. The purpose of the following tests is to demonstrate the Shakeout’s capability of stabilizing the training process of neural networks in a general sense. GAN plays a min-max game between the generator and the discriminator over the expected log-likelihood of real data and imaginary data where represents the random input

(19)

The architecture that we adopt is DCGAN [59]. The numbers of feature maps of the deconvolutional layers in the generator are 1024, 64 and 1 respectively, with the corresponding spatial sizes 77, 1414 and 2828. We train DCGANs on MNIST dataset using standard BP, Dropout and Shakeout. We follow the same experiment protocol described in [59] except for adopting Dropout or Shakeout on all layers of the discriminator. The values of during training are illustrated in Fig. 11. It can be seen that during training by standard BP oscillates greatly, while for Dropout and Shakeout, the training processes are much steadier. Compared with Dropout, the training process by Shakeout has fewer spikes and is smoother. Fig. 12 demonstrates the minimum and maximum values of within fixed length intervals moving from the start to the end of the training by standard BP, Dropout and Shakeout. It can be seen that the gaps between the minimum and maximum values of trained by Dropout and Shakeout are much smaller than that trained by standard BP, while that by Shakeout is the smallest, which implies the stability of the training process by Shakeout is the best.

Fig. 12: The minimum and maximum values of within fixed length intervals moving from the start to the end of the training by standard BP, Dropout and Shakeout. The optimal value log(4) is obtained when the imaginary data distribution matches with the real data distribution .

The second experiment is based on Wide Residual Network architecture to perform the classification task. In the classification task, generalization performance is the main focus and thus, we compare the validation errors during the training processes by Dropout and Shakeout. Fig. 13 demonstrates the validation error as a function of the training epoch for Dropout and Shakeout on CIFAR-10 with 40000 training examples. The architecture adopted is WRN-16-4. The experiment settings are the same as those described in Section 4.2.2. Considering the generalization performance, the learning rate schedule adopted is the one optimized through validation to make the models obtain the best generalization performances. Under this schedule, we find that the validation error temporarily increases when lowering the learning rate at the early stage of training, which has been repeatedly observed by [15]. Nevertheless, it can be seen from Fig. 13 that the extent of error increase is less severe for Shakeout than Dropout. Moreover, Shakeout recovers much faster than Dropout does. At the final stage, both of the validation errors steadily decrease. Shakeout obtains comparable or even superior generalization performance to Dropout. In a word, Shakeout significantly stabilizes the entire training process with superior generalization performance.

Fig. 13: Validation error as a function of training epoch for Dropout and Shakeout on CIFAR-10 with training set size at 40000. The architecture adopted is WRN-16-4. “DPO” and “SKO” represent “Dropout” and “Shakeout” respectively. The following two numbers denote the hyper-parameters and respectively. The learning rate decays at epoch 60, 120, and 160. After the first decay of learning rate, the validation error increases greatly before the steady decrease (see the enlarged snapshot for training epochs from 60 to 80). It can be seen that the extent of error increase is less severe for Shakeout than Dropout. Moreover, Shakeout recovers much faster than Dropout does. At the final stage, both of the validation errors steadily decrease (see the enlarged snapshot for training epochs from 160 to 200). Shakeout obtains comparable or even superior generalization performance to Dropout.

4.4 Practical Recommendations

Selection of Hyper-parameters The most practical and popular way to perform hyper-parameter selection is to partition the training data into a training set and a validation set to evaluate the classification performance of different hyper-parameters on it. Due to the expensive cost of time for training a deep neural network, cross-validation is barely adopted. There exist many hyper-parameter selection methods in the domain of deep learning, such as the grid search, random search [62], Bayesian optimization methods [63], gradient-based hyper-parameter Optimization [64], etc. For applying Shakeout on a deep neural network, we need to decide two hyper-parameters and . From the regularization perspective, we need to decide the most suitable strength of regularization effect to obtain an optimal trade-off between model bias and variance. We have pointed out that in a unified framework, Dropout is a special case of Shakeout when Shakeout hyper-parameter is set to zero. Empirically we find that the optimal for Shakeout is not higher than that for Dropout. After determining the optimal , keeping the order of magnitude of hyper parameter the same as ( represents the number of training samples) is an effective choice. If you want to obtain a model with much sparser weights but meanwhile with superior or comparable generalization performance to Dropout, a relatively lower and larger for Shakeout always works.

Shakeout combined with Batch Normalization Batch Normalization [65] is the widely-adopted technique to promote the optimization of the training process for a deep neural network. In practice, combining Shakeout with Batch Normalization to train a deep architecture is a good choice. For example, we observe that the training of WRN-16-4 model on CIFAR-10 is slow to converge without using Batch Normalization in the training. Moreover, the generalization performance on the test set for Shakeout combined with Batch Normalization always outperforms that for standard BP with Batch Normalization consistently for quite a large margin, as illustrated in Tab. V. These results imply the important role of Shakeout in reducing over-fitting of a deep neural network.

5 Conclusion

We have proposed Shakeout, which is a new regularized training approach for deep neural networks. The regularizer induced by Shakeout is proved to adaptively combine , and regularization terms. Empirically we find that

1) Compared to Dropout, Shakeout can afford much larger models. Or to say, when the data is scarce, Shakeout outperforms Dropout with a large margin.

2) Shakeout can obtain much sparser weights than Dropout with superior or comparable generalization performance of the model. While for Dropout, if one wants to obtain the same level of sparsity as that obtained by Shakeout, the model may bear a significant loss of accuracy.

3) Some deep architectures in nature may result in the instability of the training process, such as GANs, however, Shakeout can reduce this instability effectively.

In future, we want to put emphasis on the inductive bias of Shakeout and attempt to apply Shakeout to the compression task.

Acknowledgments

This research is supported by Australian Research Council Projects (No. FT-130101457, DP-140102164 and LP-150100671).

References

  • [1] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in

    European Conference on Computer Vision

    .   Springer, 2016, pp. 630–645.
  • [2]

    C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017, pp. 4278–4284.
  • [3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 142–158, Jan 2016.
  • [4] Y. Sun, X. Wang, and X. Tang, “Hybrid deep learning for face verification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1997–2009, Oct 2016.
  • [5] Y. Zheng, Y. J. Zhang, and H. Larochelle, “A deep and autoregressive approach for topic modeling of multimodal data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 6, pp. 1056–1069, June 2016.
  • [6] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “Hcp: A flexible cnn framework for multi-label image classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 9, pp. 1901–1907, 2016.
  • [7] X. Jin, C. Xu, J. Feng, Y. Wei, J. Xiong, and S. Yan, “Deep learning with s-shaped rectified linear activation units,” arXiv preprint arXiv:1512.07030, 2015.
  • [8] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
  • [9] Y. Bengio, “Learning deep architectures for AI,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
  • [10] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, Aug 2013.
  • [11]

    P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in

    Proceedings of the 25th international conference on Machine learning.   ACM, 2008, pp. 1096–1103.
  • [12] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” The Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010.
  • [13] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [15] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, 2016.
  • [16] S. Wager, S. Wang, and P. S. Liang, “Dropout training as adaptive regularization,” in Advances in Neural Information Processing Systems, 2013, pp. 351–359.
  • [17]

    N. Chen, J. Zhu, J. Chen, and B. Zhang, “Dropout training for support vector machines,” in

    Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
  • [18] L. Van Der Maaten, M. Chen, S. Tyree, and K. Q. Weinberger, “Learning with marginalized corrupted features.” in ICML (1), 2013, pp. 410–418.
  • [19] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel, “Optimal brain damage.” in NIPs, vol. 2, 1989, pp. 598–605.
  • [20] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing neural networks with the hashing trick,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015, pp. 2285–2294.
  • [21] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, 2015, pp. 1135–1143.
  • [22] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” CoRR, abs/1510.00149, vol. 2, 2015.
  • [23] M. Denil, B. Shakibi, L. Dinh, N. de Freitas et al., “Predicting parameters in deep learning,” in Advances in Neural Information Processing Systems, 2013, pp. 2148–2156.
  • [24] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in neural information processing systems, 2014, pp. 2654–2662.
  • [25] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [26] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
  • [27] G. Kang, J. Li, and D. Tao, “Shakeout: A new regularized deep neural network training scheme,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • [28] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pre-training help deep learning?” The Journal of Machine Learning Research, vol. 11, pp. 625–660, 2010.
  • [29] J. Moody, S. Hanson, A. Krogh, and J. A. Hertz, “A simple weight decay can improve generalization,” Advances in neural information processing systems, vol. 4, pp. 950–957, 1995.
  • [30] L. Prechelt, “Automatic early stopping using cross validation: quantifying the criteria,” Neural Networks, vol. 11, no. 4, pp. 761–767, 1998.
  • [31] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 1058–1066.
  • [32] J. Ba and B. Frey, “Adaptive dropout for training deep neural networks,” in Advances in Neural Information Processing Systems, 2013, pp. 3084–3092.
  • [33] Z. Li, B. Gong, and T. Yang, “Improved dropout for shallow and deep learning,” in Advances In Neural Information Processing Systems, 2016, pp. 2523–2531.
  • [34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [35] D. Warde-Farley, I. J. Goodfellow, A. Courville, and Y. Bengio, “An empirical analysis of dropout in piecewise linear networks,” arXiv preprint arXiv:1312.6197, 2013.
  • [36] P. Baldi and P. J. Sadowski, “Understanding dropout,” in Advances in Neural Information Processing Systems, 2013, pp. 2814–2822.
  • [37] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, 2016, pp. 1050–1059.
  • [38] D. P. Helmbold and P. M. Long, “On the inductive bias of dropout,” Journal of Machine Learning Research, vol. 16, pp. 3403–3454, 2015.
  • [39] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by v1?” Vision research, vol. 37, no. 23, pp. 3311–3325, 1997.
  • [40] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [41] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
  • [42] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
  • [43] Y.-l. Boureau, Y. L. Cun et al.

    , “Sparse feature learning for deep belief networks,” in

    Advances in neural information processing systems, 2008, pp. 1185–1192.
  • [44] I. J. Goodfellow, A. Courville, and Y. Bengio, “Spike-and-slab sparse coding for unsupervised feature discovery,” arXiv preprint arXiv:1201.3382, 2012.
  • [45]

    M. Thom and G. Palm, “Sparse activity and sparse connectivity in supervised learning,”

    Journal of Machine Learning Research, vol. 14, no. Apr, pp. 1091–1143, 2013.
  • [46] S. Rifai, X. Glorot, Y. Bengio, and P. Vincent, “Adding noise to the input of a model trained with a regularized objective,” arXiv preprint arXiv:1104.3250, 2011.
  • [47] C. M. Bishop, “Training with noise is equivalent to tikhonov regularization,” Neural computation, vol. 7, no. 1, pp. 108–116, 1995.
  • [48] W. Jiang, F. Nie, and H. Huang, “Robust dictionary learning with capped l1-norm,” in Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, 2015, pp. 3590–3596.
  • [49] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images. Technical report, University of Toronto,” 2009.
  • [50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
  • [51] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the ACM International Conference on Multimedia.   ACM, 2014, pp. 675–678.
  • [52] D. R. G. H. R. Williams and G. Hinton, “Learning representations by back-propagating errors,” Nature, pp. 323–533, 1986.
  • [53] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
  • [54] L. Yuan, J. Liu, and J. Ye, “Efficient methods for overlapping group lasso,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 9, pp. 2104–2116, 2013.
  • [55] A. Krizhevsky, “cuda-convnet,” 2012. [Online]. Available: https://code.google.com/p/cuda-convnet/
  • [56] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of machine learning research, vol. 3, no. Mar, pp. 1157–1182, 2003.
  • [57] K. Wang, R. He, L. Wang, W. Wang, and T. Tan, “Joint feature selection and subspace learning for cross-modal retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 2010–2023, Oct 2016.
  • [58] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [59] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
  • [60] M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” in NIPS 2016 Workshop on Adversarial Training. In review for ICLR, vol. 2016, 2017.
  • [61] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
  • [62] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Learning Research, vol. 13, no. Feb, pp. 281–305, 2012.
  • [63] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” in Advances in neural information processing systems, 2012, pp. 2951–2959.
  • [64]

    D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyperparameter optimization through reversible learning,” in

    Proceedings of the 32nd International Conference on Machine Learning, 2015.
  • [65] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015, pp. 448–456.