1 Introduction
Deep neural networks are flexible function approximators that have been very successful in a broad range of tasks. They can easily scale to millions of parameters while allowing for tractable optimization with minibatch stochastic gradient descent (SGD), graphical processing units (GPUs) and parallel computation. Nevertheless they do have drawbacks. Firstly, it has been shown in recent works (Han et al., 2015; Ullrich et al., 2017; Molchanov et al., 2017) that they are greatly overparametrized as they can be pruned significantly without any loss in accuracy; this exhibits unnecessary computation and resources. Secondly, they can easily overfit and even memorize random patterns in the data (Zhang et al., 2016), if not properly regularized. This overfitting can lead to poor generalization in practice.
A way to address both of these issues is by employing model compression and sparsification techniques. By sparsifying the model, we can avoid unnecessary computation and resources, since irrelevant degrees of freedom are pruned away and do not need to be computed. Furthermore, we reduce its complexity, thus penalizing memorization and alleviating overfitting.
A conceptually attractive approach is the norm regularization of (blocks of) parameters; this explicitly penalizes parameters for being different than zero with no further restrictions. However, the combinatorial nature of this problem makes for an intractable optimization for large models.
In this paper we propose a general framework for surrogate regularized objectives. It is realized by smoothing the expected regularized objective with continuous distributions in a way that can maintain the exact
zeros in the parameters while still allowing for efficient gradient based optimization. This is achieved by transforming continuous random variables (r.v.s) with a hard nonlinearity, the hardsigmoid. We further propose and employ a novel distribution obtained by this procedure; the hard concrete. It is obtained by “stretching” a binary concrete random variable
(Maddison et al., 2016; Jang et al., 2016) and then passing its samples through a hardsigmoid. We demonstrate the effectiveness of this simple procedure in various experiments.2 Minimizing the norm of parametric models
One way to sparsify parametric models, such as deep neural networks, with the least assumptions about the parameters is the following; let
be a dataset consisting of i.i.d. input output pairs and consider a regularized empirical risk minimization procedure with an regularization on the parameters of a hypothesis (e.g. a neural network) ^{1}^{1}1This assumption is just for ease of explanation; our proposed framework can be applied to any objective function involving parameters.:(1)  
where is the dimensionality of the parameters, is a weighting factor for the regularization and
corresponds to a loss function, e.g. crossentropy loss for classification or meansquared error for regression. The
norm penalizes the number of nonzero entries of the parameter vector and thus encourages sparsity in the final estimates
. The Akaike Information Criterion (AIC) (Akaike, 1998) and the Bayesian Information Criterion (BIC) (Schwarz et al., 1978), wellknown model selection criteria, correspond to specific choices of . Notice that the norm induces no shrinkage on the actual values of the parameters ; this is in contrast to e.g. regularization and the Lasso (Tibshirani, 1996), where the sparsity is due to shrinking the actual values of . We provide a visualization of this effect in Figure 1.Unfortunately, optimization under this penalty is computationally intractable due to the nondifferentiability and combinatorial nature of possible states of the parameter vector . How can we relax the discrete nature of the penalty such that we allow for efficient continuous optimization of Eq. 1, while allowing for exact zeros in the parameters? This section will present the necessary details of our approach.
2.1 A general recipe for efficiently minimizing norms
Consider the norm under a simple reparametrization of :
(2) 
where the correspond to binary “gates” that denote whether a parameter is present and the norm corresponds to the amount of gates being “on”. By letting
be a Bernoulli distribution over each gate
we can reformulate the minimization of Eq. 1 as penalizing the number of parameters being used, on average, as follows:(3)  
where corresponds to the elementwise product. The objective described in Eq. 3 is in fact a special case of a variational bound over the parameters involving spike and slab (Mitchell & Beauchamp, 1988) priors and approximate posteriors; we refer interested readers to appendix A.
Now the second term of the r.h.s. of Eq. 3 is straightforward to minimize however the first term is problematic for due to the discrete nature of , which does not allow for efficient gradient based optimization. While in principle a gradient estimator such as the REINFORCE (Williams, 1992)
could be employed, it suffers from high variance and control variates
(Mnih & Gregor, 2014; Mnih & Rezende, 2016; Tucker et al., 2017), that require auxiliary models or multiple evaluations of the network, have to be employed. Two simpler alternatives would be to use either the straightthrough (Bengio et al., 2013) estimator as done at Srinivas et al. (2017) or the concrete distribution as e.g. at Gal et al. (2017). Unfortunately both of these approach have drawbacks; the first one provides biased gradients due to ignoring the Heaviside function in the likelihood during the gradient evaluation whereas the second one does not allow for the gates (and hence parameters) to be exactly zero during optimization, thus precluding the benefits of conditional computation (Bengio et al., 2013).Fortunately, there is a simple alternative way to smooth the objective such that we allow for efficient gradient based optimization of the expected norm along with zeros in the parameters . Let be a continuous random variable with a distribution that has parameters . We can now let the gates be given by a hardsigmoid rectification of ^{2}^{2}2We chose to employ a hardsigmoid instead of a rectifier, , so as to have the variable better mimic a binary gate (rather than a scale variable)., as follows:
(4)  
(5) 
This would then allow the gate to be exactly zero and, due to the underlying continuous random variable
, we can still compute the probability of the gate being nonzero (active). This is easily obtained by the cumulative distribution function (CDF)
of :(6) 
i.e. it is the probability of the variable being positive. We can thus smooth the binary Bernoulli gates appearing in Eq. 3 by employing continuous distributions in the aforementioned way:
(7)  
Notice that this is a close surrogate to the original objective function in Eq. 3, as we similarly have a cost that explicitly penalizes the probability of a gate being different from zero. Now for continuous distributions that allow for the reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014) we can express the objective in Eq. 7 as an expectation over a parameter free noise distribution and a deterministic and differentiable transformation of the parameters and :
(8) 
which allows us to make the following Monte Carlo approximation to the (generally) intractable expectation over the noise distribution :
(9) 
corresponds to the error loss that measures how well the model is fitting the current dataset whereas refers to the complexity loss that measures the flexibility of the model. Crucially, the total cost in Eq. 9 is now differentiable w.r.t. , thus enabling for efficient stochastic gradient based optimization, while still allowing for exact zeros at the parameters. One price we pay is that now the gradient of the loglikelihood w.r.t. the parameters of
is sparse due to the rectifications; nevertheless this should not pose an issue considering the prevalence of rectified linear units in neural networks. Furthermore, due to the stochasticity at
the hardsigmoid gate is smoothed to a soft version on average, thus allowing for gradient based optimization to succeed, even when the mean of is negative or larger than one. An example visualization can be seen in Figure 1(b). It should be noted that a similar argument was also shown at Bengio et al. (2013), where with logistic noise a rectifier nonlinearity was smoothed to a softplus^{3}^{3}3 on average.2.2 The hard concrete distribution
The framework described in Section 2.1 gives us the freedom to choose an appropriate smoothing distribution . A choice that seems to work well in practice is the following; assume that we have a binary concrete (Maddison et al., 2016; Jang et al., 2016) random variable distributed in the interval with probability density and cumulative density . The parameters of the distribution are , where is the location and is the temperature. We can “stretch” this distribution to the interval, with and , and then apply a hardsigmoid on its random samples:
(10)  
(11) 
This would then induce a distribution where the probability mass of on the negative values, , is “folded” to a delta peak at zero, the probability mass on values larger than one, , is “folded” to a delta peak at one and the original distribution is truncated to the (0, 1) range. We provide more information and the density of the resulting distribution at the appendix.
Notice that a similar behavior would have been obtained even if we passed samples from any other distribution over the real line through a hardsigmoid. The only requirement of the approach is that we can evaluate the CDF of at 0 and 1. The main reason for picking the binary concrete is its close ties with Bernoulli r.v.s. It was originally proposed at Maddison et al. (2016); Jang et al. (2016) as a smooth approximation to Bernoulli r.vs, a fact that allows for gradient based optimization of its parameters through the reparametrization trick. The temperature controls the degree of approximation, as with we can recover the original Bernoulli r.v. (but lose the differentiable properties) whereas with we obtain a probability density that concentrates its mass near the endpoints (e.g. as shown in Figure 1(a)). As a result, the hard concrete also inherits the same theoretical properties w.r.t. the Bernoulli distribution. Furthermore, it can serve as a better approximation of the discrete nature, since it includes in its support, while still allowing for (sub)gradient optimization of its parameters due to the continuous probability mass that connects those two values. We can also view this distribution as a “rounded" version of the original binary concrete, where values larger than are rounded to one whereas values smaller than are rounded to zero. We provide an example visualization of the hard concrete distribution in Figure 1(a).
The complexity loss of the objective in Eq. 9 under the hard concrete r.v. is conveniently expressed as follows:
(12) 
At test time we use the following estimator for the final parameters under a hard concrete gate:
(13) 
2.3 Combining the norm with other norms
While the norm leads to sparse estimates without imposing any shrinkage on it might still be desirable to impose some form of prior assumptions on the values of with alternative norms, e.g. impose smoothness with the norm (i.e. weight decay). In the following we will show how this combination is feasible for the norm. The expected norm under the Bernoulli gating mechanism can be conveniently expressed as:
(14) 
where corresponds to the success probability of the Bernoulli gate . To maintain a similar expression with our smoothing mechanism, and avoid extra shrinkage for the gates , we can take into account that the standard
norm penalty is proportional to the negative log density of a zero mean Gaussian prior with a standard deviation of
. We will then assume that the for each is governed by in a way that when we have that and when we have that . As a result, we can obtain the following expression for the penalty (where ):(15) 
2.4 Group sparsity under an norm
For reasons of computational efficiency it is usually desirable to perform group sparsity instead of parameter sparsity, as this can allow for practical computation savings. For example, in neural networks speedups can be obtained by employing a dropout (Srivastava et al., 2014)
like procedure with neuron sparsity in fully connected layers or feature map sparsity for convolutional layers
(Wen et al., 2016; Louizos et al., 2017; Neklyudov et al., 2017). This is straightforward to do with hard concrete gates; simply share the gate between all of the members of the group. The expected and, according to section 2.3, penalties in this scenario can be rewritten as:(16)  
(17) 
where corresponds to the number of groups and corresponds to the number of parameters of group . For all of our subsequent experiments we employed neuron sparsity, where we introduced a gate per input neuron for fully connected layers and a gate per output feature map for convolutional layers. Notice that in the interpretation we adopt the gate is shared across all locations of the feature map for convolutional layers, akin to spatial dropout (Tompson et al., 2015). This can lead to practical computation savings while training, a benefit which is not possible with the commonly used independent dropout masks per spatial location (e.g. as at Zagoruyko & Komodakis (2016)).
3 Related work
Compression and sparsification of neural networks has recently gained much traction in the deep learning community. The most common and straightforward technique is parameter / neuron pruning
(LeCun et al., 1990) according to some criterion. Whereas weight pruning (Han et al., 2015; Ullrich et al., 2017; Molchanov et al., 2017) is in general inefficient for saving computation time, neuron pruning (Wen et al., 2016; Louizos et al., 2017; Neklyudov et al., 2017) can lead to computation savings. Unfortunately, all of the aforementioned methods require training the original dense network thus precluding the benefits we can obtain by having exact sparsity on the computation during training. This is in contrast to our approach where sparsification happens during training, thus theoretically allowing conditional computation to speedup training (Bengio et al., 2013, 2015).Emulating binary r.v.s with rectifications of continuous r.v.s is not a new concept and has been previously done with Gaussian distributions in the context of generative modelling
(Hinton & Ghahramani, 1997; Harva & Kabán, 2007; Salimans, 2016) and with logistic distributions at (Bengio et al., 2013) in the context of conditional computation. These distributions can similarly represent the value of exact zero, while still maintaining the tractability of continuous optimization. Nevertheless, they are suboptimal when we require approximations to binary r.v.s (as is the case for the penalty); we cannot represent the bimodal behavior of a Bernoulli r.v. due to the fact that the underlying distribution is unimodal. Another technique that allows for gradient based optimization of discrete r.v.s are the smoothing transformations proposed by Rolfe (2016). There the core idea is that if a model has binary latent variables, then we can smooth them with continuous noise in a way that allows for reparametrization gradients. There are two main differences with the hard concrete distribution we employ here; firstly, the double rectification of the hard concrete r.v.s allows us to represent the values of exact zero and one (instead of just zero) and, secondly, due to the underlying concrete distribution the random samples from the hard concrete will better emulate binary r.v.s.4 Experiments
We validate the effectiveness of our method on two tasks. The first corresponds to the toy classification task of MNIST using a simple multilayer perceptron (MLP) with two hidden layers of size 300 and 100
(LeCun et al., 1998), and a simple convolutional network, the LeNet5Caffe
^{4}^{4}4https://github.com/BVLC/caffe/tree/master/examples/mnist. The second corresponds to the more modern task of CIFAR 10 and CIFAR 100 classification using Wide Residual Networks (Zagoruyko & Komodakis, 2016). For all of our experiments we set , and, following the recommendations from Maddison et al. (2016), set for the concrete distributions. We initialized the locationsby sampling from a normal distribution with a standard deviation of
and a mean that yields to be approximately equal to the original dropout rate employed at each of the networks. We used a single sample of the gate for each minibatch of datapoints during the optimization, even though this can lead to larger variance in the gradients (Kingma et al., 2015). In this way we show that we can obtain the speedups in training with practical implementations, without actually hurting the overall performance of the network. We have made the code publicly available at https://github.com/AMLabAmsterdam/L0_regularization.4.1 MNIST classification and sparsification
For these experiments we did no further regularization besides the norm and optimization was done with Adam (Kingma & Ba, 2014) using the default hyperparameters and temporal averaging. We can see at Table 1 that our approach is competitive with other methods that tackle neural network compression. However, it is worth noting that all of these approaches prune the network posttraining using thresholds while requiring training the full network. We can further see that our approach minimizes the amount of parameters more at layers where the gates affect a larger part of the cost; for the MLP this corresponds to the input layer whereas for the LeNet5 this corresponds to the first fully connected layer. In contrast, the methods with sparsity inducing priors (Louizos et al., 2017; Neklyudov et al., 2017) sparsify parameters irrespective of that extra cost (since they are only encouraged by the prior to move parameters to zero) and as a result they achieve similar sparsity on all of the layers. Nonetheless, it should be mentioned that we can in principle increase the sparsification on specific layers simply by specifying a separate for each layer, e.g. by increasing the for gates that affect less parameters. We provide such results at the “ sep.” rows.
Network & size  Method  Pruned architecture  Error (%) 

MLP  Sparse VD (Molchanov et al., 2017)  51211472  1.8 
784300100  BCGNJ (Louizos et al., 2017)  2789813  1.8 
BCGHS (Louizos et al., 2017)  3118614  1.8  
219214100  1.4  
sep.  2668833  1.8  
LeNet5Caffe  Sparse VD (Molchanov et al., 2017)  1419242131  1.0 
2050800500  GL (Wen et al., 2016)  312192500  1.0 
GD (Srinivas & Babu, 2016)  71320816  1.1  
SBP (Neklyudov et al., 2017)  318284283  0.9  
BCGNJ (Louizos et al., 2017)  8138813  1.0  
BCGHS (Louizos et al., 2017)  5107616  1.0  
202545462  0.9  
sep.  9186525  1.0 
along with the error in the test set after 200 epochs.
denotes the number of training datapoints.To get a better idea about the potential speedup we can obtain in training we plot in Figure 3 the expected, under the probability of the gate being active, floating point operations (FLOPs) as a function of the training iterations. We also included the theoretical speedup we can obtain by using dropout (Srivastava et al., 2014) networks. As we can observe, our minimization procedure that is targeted towards neuron sparsity can potentially yield significant computational benefits compared to the original or dropout architectures, with minimal or no loss in performance. We further observe that there is a significant difference in the flop count for the LeNet model between the and sep. settings. This is because we employed larger values for ( and ) for the convolutional layers (which contribute the most to the computation) in the sep. setting. As a result, this setting is more preferable when we are concerned with speedup, rather than network compression (which is affected only by the number of parameters).
4.2 CIFAR classification
For WideResNets we apply regularization on the weights of the hidden layer of the residual blocks, i.e. where dropout is usually employed. We also employed an regularization term as described in Section 2.3 with the weight decay coefficient used in Zagoruyko & Komodakis (2016). For the layers with the hard concrete gates we divided the weight decay coefficient by 0.7 to ensure that apriori we assume the same lengthscale as the 0.3 dropout equivalent network. For optimization we employed the procedure described in Zagoruyko & Komodakis (2016) with a minibatch of 128 datapoints, which was split between two GPUs, and used a single sample for the gates for each GPU.
Network  CIFAR10  CIFAR100 

originalResNet110 (He et al., 2016a)  6.43  25.16 
preactResNet110 (He et al., 2016b)  6.37   
WRN2810 (Zagoruyko & Komodakis, 2016)  4.00  21.18 
WRN2810dropout (Zagoruyko & Komodakis, 2016)  3.89  18.85 
WRN2810  3.83  18.75 
WRN2810  3.93  19.04 
As we can observe at Table 2, with a of the regularized wide residual network improves upon the accuracy of the dropout equivalent network on both CIFAR 10 and CIFAR 100. Furthermore, it simultaneously allows for potential training time speedup due to gradually decreasing the number of FLOPs, as we can see in Figures 3(a), 3(b). This sparsity is also obtained without any “lag” in convergence speed, as at Figure 3(c) we observe a behaviour that is similar to the dropout network. Finally, we observe that by further increasing we obtain a model that has a slight error increase but can allow for a larger speedup.
5 Discussion
We have described a general recipe that allows for optimizing the norm of parametric models in a principled and effective manner. The method is based on smoothing the combinatorial problem with continuous distributions followed by a hardsigmoid. To this end, we also proposed a novel distribution which we coin as the hard concrete; it is a “stretched” binary concrete distribution, the samples of which are transformed by a hardsigmoid. This in turn better mimics the binary nature of Bernoulli distributions while still allowing for efficient gradient based optimization. In experiments we have shown that the proposed minimization process leads to neural network sparsification that is competitive with current approaches while theoretically allowing for speedup in training. We have further shown that this process can provide a good inductive bias and regularizer, as on the CIFAR experiments with wide residual networks we improved upon dropout.
As for future work; better harnessing the power of conditional computation for efficiently training very large neural networks with learned sparsity patterns is a potential research direction. It would be also interesting to adopt a full Bayesian treatment over the parameters , such as the one employed at Molchanov et al. (2017); Louizos et al. (2017). This would then allow for further speedup and compression due to the ability of automatically learning the bit precision of each weight. Finally, it would be interesting to explore the behavior of hard concrete r.v.s at binary latent variable models, since they can be used as a drop in replacement that allow us to maintain both the discrete nature as well as the efficient reparametrization gradient optimization.
Acknowledgements
We would like to thank Taco Cohen, Thomas Kipf, Patrick Forré, and Rianne van den Berg for feedback on an early draft of this paper.
References
 Akaike (1998) Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike, pp. 199–213. Springer, 1998.

Beal (2003)
Matthew James Beal.
Variational algorithms for approximate Bayesian inference
. 2003.  Bengio et al. (2015) Emmanuel Bengio, PierreLuc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
 Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 Cover & Thomas (2012) Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 Gal et al. (2017) Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. arXiv preprint arXiv:1705.07832, 2017.
 Han et al. (2015) Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 Harva & Kabán (2007) Markus Harva and Ata Kabán. Variational learning for rectified factor analysis. Signal Processing, 87(3):509–527, 2007.

He et al. (2016a)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016a.  He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Springer, 2016b.

Hershey & Olsen (2007)
John R Hershey and Peder A Olsen.
Approximating the kullback leibler divergence between gaussian mixture models.
In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pp. IV–317. IEEE, 2007. 
Hinton & Ghahramani (1997)
Geoffrey E Hinton and Zoubin Ghahramani.
Generative models for discovering sparse distributed representations.
Philosophical Transactions of the Royal Society of London B: Biological Sciences, 352(1358):1177–1190, 1997.  Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Kingma & Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Welling (2014) Diederik P Kingma and Max Welling. Autoencoding variational bayes. International Conference on Learning Representations (ICLR), 2014.
 Kingma et al. (2015) Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583, 2015.
 LeCun et al. (1990) Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems 2, NIPS 1989, volume 2, pp. 598–605. MorganKaufmann Publishers, 1990.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Louizos et al. (2017) Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. arXiv preprint arXiv:1705.08665, 2017.
 Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.

Mitchell & Beauchamp (1988)
Toby J Mitchell and John J Beauchamp.
Bayesian variable selection in linear regression.
Journal of the American Statistical Association, 83(404):1023–1032, 1988.  Mnih & Gregor (2014) Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.

Mnih & Rezende (2016)
Andriy Mnih and Danilo Rezende.
Variational inference for monte carlo objectives.
In
International Conference on Machine Learning
, pp. 2188–2196, 2016.  Molchanov et al. (2017) Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
 Neklyudov et al. (2017) Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Structured bayesian pruning via lognormal multiplicative noise. arXiv preprint arXiv:1705.07283, 2017.

Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pp. 1278–1286, 2014.  Rolfe (2016) Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.
 Salimans (2016) Tim Salimans. A structured variational autoencoder for learning deep hierarchies of sparse features. arXiv preprint arXiv:1602.08734, 2016.
 Schwarz et al. (1978) Gideon Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.
 Srinivas & Babu (2016) Suraj Srinivas and R Venkatesh Babu. Generalized dropout. arXiv preprint arXiv:1611.06791, 2016.
 Srinivas et al. (2017) Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pp. 455–462. IEEE, 2017.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
 Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
 Tompson et al. (2015) Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656, 2015.
 Tucker et al. (2017) George Tucker, Andriy Mnih, Chris J Maddison, and Jascha SohlDickstein. Rebar: Lowvariance, unbiased gradient estimates for discrete latent variable models. arXiv preprint arXiv:1703.07370, 2017.
 Ullrich et al. (2017) Karen Ullrich, Edward Meeds, and Max Welling. Soft weightsharing for neural network compression. ICLR, 2017.
 Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.

Williams (1992)
Ronald J Williams.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine learning, 8(34):229–256, 1992.  Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Appendix
Appendix A Relation to variational inference
The objective function described in Eq. 3 is in fact a special case of a variational lower bound over the parameters of the network under a spike and slab (Mitchell & Beauchamp, 1988) prior. The spike and slab distribution is the golden standard in sparsity as far as Bayesian inference is concerned and it is defined as a mixture of a delta spike at zero and a continuous distribution over the real line (e.g. a standard normal):
(18) 
Since the true posterior distribution over the parameters under this prior is intractable, we will use variational inference (Beal, 2003). Let be a spike and slab approximate posterior over the parameters and gate variables , where we assume that it factorizes over the dimensionality of the parameters . It turns out that we can write the following variational free energy under the spike and slab prior and approximate posterior over a parameter vector :
(19)  
(20) 
where the last step is due to ^{5}^{5}5We can see that this is indeed the case by taking the limit of of the KL divergence of two Gaussians that have the same mean and variance.. The term that involves corresponds to the KLdivergence from the Bernoulli prior to the Bernoulli approximate posterior and can be interpreted as the “code cost” or else the amount of information the parameter contains about the data , measured by the KLdivergence from the prior .
Now consider making the assumption that we are optimizing, rather than integrating, over and further assuming that . We can justify this assumption from an empirical Bayesian procedure: there is a hypothetical prior for each parameter that adapts to in a way that results into needing, approximately, nats to transform to that particular . Those nats are thus the amount of information the can encode about the data had we used that as the prior. Notice that under this view we can consider as the amount of flexibility of that hypothetical prior; with we have a prior that is flexible enough to represent exactly , thus resulting into no code cost and possible overfitting. Under this assumption the variational free energy can be rewritten as:
(21)  
(22) 
where corresponds to the optimized and the last step is due to the positivity of the KLdivergence. Now by taking the negative logprobability of the data to be equal to the loss of Eq. 1 we see that Eq. 22 is the same as Eq. 3. Note that in case that we are interested over the uncertainty of the gates , we should optimize Eq. 21, rather than Eq. 22, as this will properly penalize the entropy of . Furthermore, Eq. 21 also allows for the incorporation of prior information about the behavior of the gates (e.g. gates being active 10% of the time, on average). We have thus shown that the expected minimization procedure is in fact a close surrogate to a variational bound involving a spike and slab distribution over the parameters and a fixed coding cost for the parameters when the gates are active.
Appendix B The hard concrete distribution
As mentioned in the main text, the hard concrete is a straightforward modification of the binary concrete (Maddison et al., 2016; Jang et al., 2016); let
be the probability density function (pdf) and
the cumulative distribution function (CDF) of a binary concrete random variable :(23)  
(24) 
Now by stretching this distribution to the interval, with and we obtain with the following pdf and CDF:
(25) 
and by further rectifying with the hardsigmoid, , we obtain the following distribution over :
(26) 
which is composed by a delta peak at zero with probability , a delta peak at one with probability , and a truncated version of in the (0, 1) range.
Appendix C Negative KLdivergence for hard concrete distributions
In case th 21 is to be optimized with a hard concrete then we have to compute the KLdivergence from a prior to . It is necessary for the prior to have the same support as in order for the KLdivergence to be valid; as a result we can let the prior similarly be a hardsigmoid transformation of an arbitrary continuous distribution with CDF :
(27) 
Since both and
are mixtures with the same number of components we can use the chain rule of relative entropy
(Cover & Thomas, 2012; Hershey & Olsen, 2007) in order to compute the KLdivergence:(28) 
where corresponds to the the prerectified variable. Notice that in case that the integral under the truncated distribution is not available in closed form we can still obtain a Monte Carlo estimate by sampling the truncated distribution, on e.g. a interval, via the inverse transform method:
(29) 
where
corresponds to the quantile function and
to the CDF of the random variable . Furthermore, it should be mentioned that , since the rectifications are not invertible transformations.
Comments
There are no comments yet.