# Learning Sparse Neural Networks through L_0 Regularization

We propose a practical method for L_0 norm regularization for neural networks: pruning the network during training by encouraging weights to become exactly zero. Such regularization is interesting since (1) it can greatly speed up training and inference, and (2) it can improve generalization. AIC and BIC, well-known model selection criteria, are special cases of L_0 regularization. However, since the L_0 norm of weights is non-differentiable, we cannot incorporate it directly as a regularization term in the objective function. We propose a solution through the inclusion of a collection of non-negative stochastic gates, which collectively determine which weights to set to zero. We show that, somewhat surprisingly, for certain distributions over the gates, the expected L_0 norm of the resulting gated weights is differentiable with respect to the distribution parameters. We further propose the hard concrete distribution for the gates, which is obtained by "stretching" a binary concrete distribution and then transforming its samples with a hard-sigmoid. The parameters of the distribution over the gates can then be jointly optimized with the original network parameters. As a result our method allows for straightforward and efficient learning of model structures with stochastic gradient descent and allows for conditional computation in a principled way. We perform various experiments to demonstrate the effectiveness of the resulting approach and regularizer.

## Authors

• 13 publications
• 119 publications
• 16 publications
• ### Differentiable Sparsification for Deep Neural Networks

A deep neural network has relieved the burden of feature engineering by ...
10/08/2019 ∙ by Yognjin Lee, et al. ∙ 0

• ### Embedding Differentiable Sparsity into Deep Neural Network

In this paper, we propose embedding sparsity into the structure of deep ...
06/23/2020 ∙ by Yongjin Lee, et al. ∙ 0

• ### Weight-dependent Gates for Network Pruning

In this paper, we propose a simple and effective network pruning framewo...
07/04/2020 ∙ by Yun Li, et al. ∙ 1

• ### L_0-ARM: Network Sparsification via Stochastic Binary Optimization

We consider network sparsification as an L_0-norm regularized binary opt...
04/09/2019 ∙ by Yang Li, et al. ∙ 0

• ### Implicit Regularization of Normalization Methods

Normalization methods such as batch normalization are commonly used in o...
11/18/2019 ∙ by Xiaoxia Wu, et al. ∙ 17

• ### Controlling Model Complexity in Probabilistic Model-Based Dynamic Optimization of Neural Network Structures

A method of simultaneously optimizing both the structure of neural netwo...
07/15/2019 ∙ by Shota Saito, et al. ∙ 7

• ### Faster Training of Very Deep Networks Via p-Norm Gates

A major contributing factor to the recent advances in deep neural networ...
08/11/2016 ∙ by Trang Pham, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep neural networks are flexible function approximators that have been very successful in a broad range of tasks. They can easily scale to millions of parameters while allowing for tractable optimization with mini-batch stochastic gradient descent (SGD), graphical processing units (GPUs) and parallel computation. Nevertheless they do have drawbacks. Firstly, it has been shown in recent works (Han et al., 2015; Ullrich et al., 2017; Molchanov et al., 2017) that they are greatly overparametrized as they can be pruned significantly without any loss in accuracy; this exhibits unnecessary computation and resources. Secondly, they can easily overfit and even memorize random patterns in the data (Zhang et al., 2016), if not properly regularized. This overfitting can lead to poor generalization in practice.

A way to address both of these issues is by employing model compression and sparsification techniques. By sparsifying the model, we can avoid unnecessary computation and resources, since irrelevant degrees of freedom are pruned away and do not need to be computed. Furthermore, we reduce its complexity, thus penalizing memorization and alleviating overfitting.

A conceptually attractive approach is the norm regularization of (blocks of) parameters; this explicitly penalizes parameters for being different than zero with no further restrictions. However, the combinatorial nature of this problem makes for an intractable optimization for large models.

In this paper we propose a general framework for surrogate regularized objectives. It is realized by smoothing the expected regularized objective with continuous distributions in a way that can maintain the exact

zeros in the parameters while still allowing for efficient gradient based optimization. This is achieved by transforming continuous random variables (r.v.s) with a hard nonlinearity, the hard-sigmoid. We further propose and employ a novel distribution obtained by this procedure; the hard concrete. It is obtained by “stretching” a binary concrete random variable

(Maddison et al., 2016; Jang et al., 2016) and then passing its samples through a hard-sigmoid. We demonstrate the effectiveness of this simple procedure in various experiments.

## 2 Minimizing the L0 norm of parametric models

One way to sparsify parametric models, such as deep neural networks, with the least assumptions about the parameters is the following; let

be a dataset consisting of i.i.d. input output pairs and consider a regularized empirical risk minimization procedure with an regularization on the parameters of a hypothesis (e.g. a neural network) 111This assumption is just for ease of explanation; our proposed framework can be applied to any objective function involving parameters.:

 R(θ)=1N(N∑i=1L(h (1) θ∗=argminθ{R(θ)},

where is the dimensionality of the parameters, is a weighting factor for the regularization and

corresponds to a loss function, e.g. cross-entropy loss for classification or mean-squared error for regression. The

norm penalizes the number of non-zero entries of the parameter vector and thus encourages sparsity in the final estimates

. The Akaike Information Criterion (AIC) (Akaike, 1998) and the Bayesian Information Criterion (BIC) (Schwarz et al., 1978), well-known model selection criteria, correspond to specific choices of . Notice that the norm induces no shrinkage on the actual values of the parameters ; this is in contrast to e.g. regularization and the Lasso (Tibshirani, 1996), where the sparsity is due to shrinking the actual values of . We provide a visualization of this effect in Figure 1.

Unfortunately, optimization under this penalty is computationally intractable due to the non-differentiability and combinatorial nature of possible states of the parameter vector . How can we relax the discrete nature of the penalty such that we allow for efficient continuous optimization of Eq. 1, while allowing for exact zeros in the parameters? This section will present the necessary details of our approach.

### 2.1 A general recipe for efficiently minimizing L0 norms

Consider the norm under a simple re-parametrization of :

 θj=~θjzj,zj∈{0,1},~θj≠0,∥θ∥0=|θ|∑j=1zj, (2)

where the correspond to binary “gates” that denote whether a parameter is present and the norm corresponds to the amount of gates being “on”. By letting

be a Bernoulli distribution over each gate

we can reformulate the minimization of Eq. 1 as penalizing the number of parameters being used, on average, as follows:

 R(~θ,π) (3) ~θ∗,π∗=argmin~θ,π{R(~θ,π)},

where corresponds to the elementwise product. The objective described in Eq. 3 is in fact a special case of a variational bound over the parameters involving spike and slab (Mitchell & Beauchamp, 1988) priors and approximate posteriors; we refer interested readers to appendix A.

Now the second term of the r.h.s. of Eq. 3 is straightforward to minimize however the first term is problematic for due to the discrete nature of , which does not allow for efficient gradient based optimization. While in principle a gradient estimator such as the REINFORCE (Williams, 1992)

could be employed, it suffers from high variance and control variates

(Mnih & Gregor, 2014; Mnih & Rezende, 2016; Tucker et al., 2017), that require auxiliary models or multiple evaluations of the network, have to be employed. Two simpler alternatives would be to use either the straight-through (Bengio et al., 2013) estimator as done at  Srinivas et al. (2017) or the concrete distribution as e.g. at Gal et al. (2017). Unfortunately both of these approach have drawbacks; the first one provides biased gradients due to ignoring the Heaviside function in the likelihood during the gradient evaluation whereas the second one does not allow for the gates (and hence parameters) to be exactly zero during optimization, thus precluding the benefits of conditional computation (Bengio et al., 2013).

Fortunately, there is a simple alternative way to smooth the objective such that we allow for efficient gradient based optimization of the expected norm along with zeros in the parameters . Let be a continuous random variable with a distribution that has parameters . We can now let the gates be given by a hard-sigmoid rectification of 222We chose to employ a hard-sigmoid instead of a rectifier, , so as to have the variable better mimic a binary gate (rather than a scale variable)., as follows:

 s ∼q(s|ϕ) (4) z =min(1,max(0,s)). (5)

This would then allow the gate to be exactly zero and, due to the underlying continuous random variable

, we can still compute the probability of the gate being non-zero (active). This is easily obtained by the cumulative distribution function (CDF)

of :

 q(z≠0|ϕ)=1−Q(s≤0|ϕ), (6)

i.e. it is the probability of the variable being positive. We can thus smooth the binary Bernoulli gates appearing in Eq. 3 by employing continuous distributions in the aforementioned way:

 R(~θ,ϕ) (7) ~θ∗,ϕ∗=argmin~θ,ϕ{R(~θ,ϕ)},g(⋅)=min(1,max(0,⋅)).

Notice that this is a close surrogate to the original objective function in Eq. 3, as we similarly have a cost that explicitly penalizes the probability of a gate being different from zero. Now for continuous distributions that allow for the reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014) we can express the objective in Eq. 7 as an expectation over a parameter free noise distribution and a deterministic and differentiable transformation of the parameters and :

 R(~θ,ϕ) =Ep(ϵ)[1N(N∑i=1L(h(xi;~θ⊙g(f(ϕ,ϵ))),yi))]+λ|θ|∑j=1(1−Q(sj≤0|ϕj)), (8)

which allows us to make the following Monte Carlo approximation to the (generally) intractable expectation over the noise distribution :

 ^R(~θ,ϕ) =1LL∑l=1(1N(N∑i=1L(h(xi;~θ⊙z(l)),yi)))+λ|θ|∑j=1(1−Q(sj≤0|ϕj)) =LE(~θ,ϕ)+λLC(ϕ),wherez(l)=g(f(ϕ,ϵ(l)))andϵ(l)∼p(ϵ). (9)

corresponds to the error loss that measures how well the model is fitting the current dataset whereas refers to the complexity loss that measures the flexibility of the model. Crucially, the total cost in Eq. 9 is now differentiable w.r.t. , thus enabling for efficient stochastic gradient based optimization, while still allowing for exact zeros at the parameters. One price we pay is that now the gradient of the log-likelihood w.r.t. the parameters of

is sparse due to the rectifications; nevertheless this should not pose an issue considering the prevalence of rectified linear units in neural networks. Furthermore, due to the stochasticity at

the hard-sigmoid gate is smoothed to a soft version on average, thus allowing for gradient based optimization to succeed, even when the mean of is negative or larger than one. An example visualization can be seen in Figure 1(b). It should be noted that a similar argument was also shown at Bengio et al. (2013), where with logistic noise a rectifier nonlinearity was smoothed to a softplus333 on average.

### 2.2 The hard concrete distribution

The framework described in Section 2.1 gives us the freedom to choose an appropriate smoothing distribution . A choice that seems to work well in practice is the following; assume that we have a binary concrete (Maddison et al., 2016; Jang et al., 2016) random variable distributed in the interval with probability density and cumulative density . The parameters of the distribution are , where is the location and is the temperature. We can “stretch” this distribution to the interval, with and , and then apply a hard-sigmoid on its random samples:

 u∼U(0,1),s=Sigmoid ((logu−log(1−u)+logα)/β),¯s=s(ζ−γ)+γ, (10) z =min(1,max(0,¯s)). (11)

This would then induce a distribution where the probability mass of on the negative values, , is “folded” to a delta peak at zero, the probability mass on values larger than one, , is “folded” to a delta peak at one and the original distribution is truncated to the (0, 1) range. We provide more information and the density of the resulting distribution at the appendix.

Notice that a similar behavior would have been obtained even if we passed samples from any other distribution over the real line through a hard-sigmoid. The only requirement of the approach is that we can evaluate the CDF of at 0 and 1. The main reason for picking the binary concrete is its close ties with Bernoulli r.v.s. It was originally proposed at Maddison et al. (2016); Jang et al. (2016) as a smooth approximation to Bernoulli r.vs, a fact that allows for gradient based optimization of its parameters through the reparametrization trick. The temperature controls the degree of approximation, as with we can recover the original Bernoulli r.v. (but lose the differentiable properties) whereas with we obtain a probability density that concentrates its mass near the endpoints (e.g. as shown in Figure 1(a)). As a result, the hard concrete also inherits the same theoretical properties w.r.t. the Bernoulli distribution. Furthermore, it can serve as a better approximation of the discrete nature, since it includes in its support, while still allowing for (sub)gradient optimization of its parameters due to the continuous probability mass that connects those two values. We can also view this distribution as a “rounded" version of the original binary concrete, where values larger than are rounded to one whereas values smaller than are rounded to zero. We provide an example visualization of the hard concrete distribution in Figure 1(a).

The complexity loss of the objective in Eq. 9 under the hard concrete r.v. is conveniently expressed as follows:

 LC=|θ|∑j=1(1−Q¯sj(0|ϕ))=|θ|∑j=1Sigmoid(logαj−βlog−γζ). (12)

At test time we use the following estimator for the final parameters under a hard concrete gate:

 ^z=min(1,max(0,Sigmoid(logα)(ζ−γ)+γ)),θ∗=~θ∗⊙^z. (13)

### 2.3 Combining the L0 norm with other norms

While the norm leads to sparse estimates without imposing any shrinkage on it might still be desirable to impose some form of prior assumptions on the values of with alternative norms, e.g. impose smoothness with the norm (i.e. weight decay). In the following we will show how this combination is feasible for the norm. The expected norm under the Bernoulli gating mechanism can be conveniently expressed as:

 Eq(z|π)[∥θ∥22]=|θ|∑j=1Eq(zj|πj)[z2j~θ2j]=|θ|∑j=1πj~θ2j, (14)

where corresponds to the success probability of the Bernoulli gate . To maintain a similar expression with our smoothing mechanism, and avoid extra shrinkage for the gates , we can take into account that the standard

norm penalty is proportional to the negative log density of a zero mean Gaussian prior with a standard deviation of

. We will then assume that the for each is governed by in a way that when we have that and when we have that . As a result, we can obtain the following expression for the penalty (where ):

 =|θ|∑j=1(Q¯sj(0|ϕj)01+(1−Q¯sj(0|ϕj))Eq(zj|ϕj,¯sj>0)[~θ2jz2jz2j]) =|θ|∑j=1(1−Q¯sj(0|ϕj))~θ2j. (15)

### 2.4 Group sparsity under an L0 norm

For reasons of computational efficiency it is usually desirable to perform group sparsity instead of parameter sparsity, as this can allow for practical computation savings. For example, in neural networks speedups can be obtained by employing a dropout (Srivastava et al., 2014)

like procedure with neuron sparsity in fully connected layers or feature map sparsity for convolutional layers

(Wen et al., 2016; Louizos et al., 2017; Neklyudov et al., 2017). This is straightforward to do with hard concrete gates; simply share the gate between all of the members of the group. The expected and, according to section 2.3, penalties in this scenario can be rewritten as:

 =|G|∑g=1|g|(1−Q(sg≤0|ϕg)) (16) =|G|∑g=1((1−Q(sg≤0|ϕg))|g|∑j=1~θ2j). (17)

where corresponds to the number of groups and corresponds to the number of parameters of group . For all of our subsequent experiments we employed neuron sparsity, where we introduced a gate per input neuron for fully connected layers and a gate per output feature map for convolutional layers. Notice that in the interpretation we adopt the gate is shared across all locations of the feature map for convolutional layers, akin to spatial dropout (Tompson et al., 2015). This can lead to practical computation savings while training, a benefit which is not possible with the commonly used independent dropout masks per spatial location (e.g. as at Zagoruyko & Komodakis (2016)).

## 3 Related work

Compression and sparsification of neural networks has recently gained much traction in the deep learning community. The most common and straightforward technique is parameter / neuron pruning

(LeCun et al., 1990) according to some criterion. Whereas weight pruning (Han et al., 2015; Ullrich et al., 2017; Molchanov et al., 2017) is in general inefficient for saving computation time, neuron pruning (Wen et al., 2016; Louizos et al., 2017; Neklyudov et al., 2017) can lead to computation savings. Unfortunately, all of the aforementioned methods require training the original dense network thus precluding the benefits we can obtain by having exact sparsity on the computation during training. This is in contrast to our approach where sparsification happens during training, thus theoretically allowing conditional computation to speed-up training (Bengio et al., 2013, 2015).

Emulating binary r.v.s with rectifications of continuous r.v.s is not a new concept and has been previously done with Gaussian distributions in the context of generative modelling

(Hinton & Ghahramani, 1997; Harva & Kabán, 2007; Salimans, 2016) and with logistic distributions at (Bengio et al., 2013) in the context of conditional computation. These distributions can similarly represent the value of exact zero, while still maintaining the tractability of continuous optimization. Nevertheless, they are sub-optimal when we require approximations to binary r.v.s (as is the case for the penalty); we cannot represent the bimodal behavior of a Bernoulli r.v. due to the fact that the underlying distribution is unimodal. Another technique that allows for gradient based optimization of discrete r.v.s are the smoothing transformations proposed by Rolfe (2016). There the core idea is that if a model has binary latent variables, then we can smooth them with continuous noise in a way that allows for reparametrization gradients. There are two main differences with the hard concrete distribution we employ here; firstly, the double rectification of the hard concrete r.v.s allows us to represent the values of exact zero and one (instead of just zero) and, secondly, due to the underlying concrete distribution the random samples from the hard concrete will better emulate binary r.v.s.

## 4 Experiments

We validate the effectiveness of our method on two tasks. The first corresponds to the toy classification task of MNIST using a simple multilayer perceptron (MLP) with two hidden layers of size 300 and 100

(LeCun et al., 1998)

, and a simple convolutional network, the LeNet-5-Caffe

. The second corresponds to the more modern task of CIFAR 10 and CIFAR 100 classification using Wide Residual Networks (Zagoruyko & Komodakis, 2016). For all of our experiments we set , and, following the recommendations from Maddison et al. (2016), set for the concrete distributions. We initialized the locations

by sampling from a normal distribution with a standard deviation of

and a mean that yields to be approximately equal to the original dropout rate employed at each of the networks. We used a single sample of the gate for each minibatch of datapoints during the optimization, even though this can lead to larger variance in the gradients (Kingma et al., 2015). In this way we show that we can obtain the speedups in training with practical implementations, without actually hurting the overall performance of the network. We have made the code publicly available at https://github.com/AMLab-Amsterdam/L0_regularization.

### 4.1 MNIST classification and sparsification

For these experiments we did no further regularization besides the norm and optimization was done with Adam (Kingma & Ba, 2014) using the default hyper-parameters and temporal averaging. We can see at Table 1 that our approach is competitive with other methods that tackle neural network compression. However, it is worth noting that all of these approaches prune the network post-training using thresholds while requiring training the full network. We can further see that our approach minimizes the amount of parameters more at layers where the gates affect a larger part of the cost; for the MLP this corresponds to the input layer whereas for the LeNet5 this corresponds to the first fully connected layer. In contrast, the methods with sparsity inducing priors (Louizos et al., 2017; Neklyudov et al., 2017) sparsify parameters irrespective of that extra cost (since they are only encouraged by the prior to move parameters to zero) and as a result they achieve similar sparsity on all of the layers. Nonetheless, it should be mentioned that we can in principle increase the sparsification on specific layers simply by specifying a separate for each layer, e.g. by increasing the for gates that affect less parameters. We provide such results at the “ sep.” rows.

To get a better idea about the potential speedup we can obtain in training we plot in Figure 3 the expected, under the probability of the gate being active, floating point operations (FLOPs) as a function of the training iterations. We also included the theoretical speedup we can obtain by using dropout (Srivastava et al., 2014) networks. As we can observe, our minimization procedure that is targeted towards neuron sparsity can potentially yield significant computational benefits compared to the original or dropout architectures, with minimal or no loss in performance. We further observe that there is a significant difference in the flop count for the LeNet model between the and sep. settings. This is because we employed larger values for ( and ) for the convolutional layers (which contribute the most to the computation) in the sep. setting. As a result, this setting is more preferable when we are concerned with speedup, rather than network compression (which is affected only by the number of parameters).

### 4.2 CIFAR classification

For WideResNets we apply regularization on the weights of the hidden layer of the residual blocks, i.e. where dropout is usually employed. We also employed an regularization term as described in Section 2.3 with the weight decay coefficient used in Zagoruyko & Komodakis (2016). For the layers with the hard concrete gates we divided the weight decay coefficient by 0.7 to ensure that a-priori we assume the same length-scale as the 0.3 dropout equivalent network. For optimization we employed the procedure described in Zagoruyko & Komodakis (2016) with a minibatch of 128 datapoints, which was split between two GPUs, and used a single sample for the gates for each GPU.

As we can observe at Table 2, with a of the regularized wide residual network improves upon the accuracy of the dropout equivalent network on both CIFAR 10 and CIFAR 100. Furthermore, it simultaneously allows for potential training time speedup due to gradually decreasing the number of FLOPs, as we can see in Figures 3(a)3(b). This sparsity is also obtained without any “lag” in convergence speed, as at Figure 3(c) we observe a behaviour that is similar to the dropout network. Finally, we observe that by further increasing we obtain a model that has a slight error increase but can allow for a larger speedup.

## 5 Discussion

We have described a general recipe that allows for optimizing the norm of parametric models in a principled and effective manner. The method is based on smoothing the combinatorial problem with continuous distributions followed by a hard-sigmoid. To this end, we also proposed a novel distribution which we coin as the hard concrete; it is a “stretched” binary concrete distribution, the samples of which are transformed by a hard-sigmoid. This in turn better mimics the binary nature of Bernoulli distributions while still allowing for efficient gradient based optimization. In experiments we have shown that the proposed minimization process leads to neural network sparsification that is competitive with current approaches while theoretically allowing for speedup in training. We have further shown that this process can provide a good inductive bias and regularizer, as on the CIFAR experiments with wide residual networks we improved upon dropout.

As for future work; better harnessing the power of conditional computation for efficiently training very large neural networks with learned sparsity patterns is a potential research direction. It would be also interesting to adopt a full Bayesian treatment over the parameters , such as the one employed at Molchanov et al. (2017); Louizos et al. (2017). This would then allow for further speedup and compression due to the ability of automatically learning the bit precision of each weight. Finally, it would be interesting to explore the behavior of hard concrete r.v.s at binary latent variable models, since they can be used as a drop in replacement that allow us to maintain both the discrete nature as well as the efficient reparametrization gradient optimization.

## Acknowledgements

We would like to thank Taco Cohen, Thomas Kipf, Patrick Forré, and Rianne van den Berg for feedback on an early draft of this paper.

## Appendix A Relation to variational inference

The objective function described in Eq. 3 is in fact a special case of a variational lower bound over the parameters of the network under a spike and slab (Mitchell & Beauchamp, 1988) prior. The spike and slab distribution is the golden standard in sparsity as far as Bayesian inference is concerned and it is defined as a mixture of a delta spike at zero and a continuous distribution over the real line (e.g. a standard normal):

 p(z)=Bernoulli(π),p(θ|z=0)=δ(θ),p(θ|z=1)=N(θ|0,1). (18)

Since the true posterior distribution over the parameters under this prior is intractable, we will use variational inference (Beal, 2003). Let be a spike and slab approximate posterior over the parameters and gate variables , where we assume that it factorizes over the dimensionality of the parameters . It turns out that we can write the following variational free energy under the spike and slab prior and approximate posterior over a parameter vector :

 F =−Eq(z)q(θ|z)[logp(D|θ)]+|θ|∑j=1KL(q(zj)||p(zj))+ +|θ|∑j=1(q(zj=1)KL(q(θj|zj=1)||p(θj|zj=1))+ +q(zj=0)KL(q(θj|zj=0)||p(θj|zj=0))) (19) =−Eq(z)q(θ|z)[logp(D|θ)]+|θ|∑j=1KL(q(zj)||p(zj))+ +|θ|∑j=1q(zj=1)KL(q(θj|zj=1)||p(θj|zj=1)), (20)

where the last step is due to 555We can see that this is indeed the case by taking the limit of of the KL divergence of two Gaussians that have the same mean and variance.. The term that involves corresponds to the KL-divergence from the Bernoulli prior to the Bernoulli approximate posterior and can be interpreted as the “code cost” or else the amount of information the parameter contains about the data , measured by the KL-divergence from the prior .

Now consider making the assumption that we are optimizing, rather than integrating, over and further assuming that . We can justify this assumption from an empirical Bayesian procedure: there is a hypothetical prior for each parameter that adapts to in a way that results into needing, approximately, nats to transform to that particular . Those nats are thus the amount of information the can encode about the data had we used that as the prior. Notice that under this view we can consider as the amount of flexibility of that hypothetical prior; with we have a prior that is flexible enough to represent exactly , thus resulting into no code cost and possible overfitting. Under this assumption the variational free energy can be re-written as:

 F =−Eq(z)[logp(D|~θ⊙z)]+|θ|∑j=1KL(q(zj)||p(zj))+λ|θ|∑j=1q(zj=1) (21) ≥−Eq(z)[logp(D|~θ⊙z)]+λ|θ|∑j=1πj, (22)

where corresponds to the optimized and the last step is due to the positivity of the KL-divergence. Now by taking the negative log-probability of the data to be equal to the loss of Eq. 1 we see that Eq. 22 is the same as Eq. 3. Note that in case that we are interested over the uncertainty of the gates , we should optimize Eq. 21, rather than Eq. 22, as this will properly penalize the entropy of . Furthermore, Eq. 21 also allows for the incorporation of prior information about the behavior of the gates (e.g. gates being active 10% of the time, on average). We have thus shown that the expected minimization procedure is in fact a close surrogate to a variational bound involving a spike and slab distribution over the parameters and a fixed coding cost for the parameters when the gates are active.

## Appendix B The hard concrete distribution

As mentioned in the main text, the hard concrete is a straightforward modification of the binary concrete (Maddison et al., 2016; Jang et al., 2016); let

be the probability density function (pdf) and

the cumulative distribution function (CDF) of a binary concrete random variable :

 qs(s|ϕ) =βαs−β−1(1−s)−β−1(αs−β+(1−s)−β)2, (23) Qs(s|ϕ) =Sigmoid((logs−log(1−s))β−logα). (24)

Now by stretching this distribution to the interval, with and we obtain with the following pdf and CDF:

 q¯s(¯s|ϕ)=1|ζ−γ|qs(¯s−γζ−γ∣∣∣ϕ),Q¯s(¯s|ϕ)=Qs(¯s−γζ−γ∣∣∣ϕ). (25)

and by further rectifying with the hard-sigmoid, , we obtain the following distribution over :

 q(z|ϕ) =Q¯s(0|ϕ)δ(z)+(1−Q¯s(1|ϕ))δ(z−1)+(Q¯s(1|ϕ)−Q¯s(0|ϕ))q¯s(z|¯s∈(0,1),ϕ), (26)

which is composed by a delta peak at zero with probability , a delta peak at one with probability , and a truncated version of in the (0, 1) range.

## Appendix C Negative KL-divergence for hard concrete distributions

In case th 21 is to be optimized with a hard concrete then we have to compute the KL-divergence from a prior to . It is necessary for the prior to have the same support as in order for the KL-divergence to be valid; as a result we can let the prior similarly be a hard-sigmoid transformation of an arbitrary continuous distribution with CDF :

 p(z)=P¯s(0)δ(z)+(1−P¯s(1))δ(z−1)+(P¯s(1)−P¯s(0))p¯s(z|¯s∈(0,1)) (27)

Since both and

are mixtures with the same number of components we can use the chain rule of relative entropy

(Cover & Thomas, 2012; Hershey & Olsen, 2007) in order to compute the KL-divergence:

 KL(q(z)||p(z)) =Q¯s(0)logQ¯s(0)P¯s(0)+(1−Q¯s(1))log1−Q¯s(1)1−P¯s(1)+ +(Q¯s(1)−Q¯s(0))Eq¯s(z|¯s∈(0,1))[logq¯s(z)−logp¯s(z)], (28)

where corresponds to the the pre-rectified variable. Notice that in case that the integral under the truncated distribution is not available in closed form we can still obtain a Monte Carlo estimate by sampling the truncated distribution, on e.g. a interval, via the inverse transform method:

 u∼U(0,1),z=Q−1¯s(Q¯s(γ)+u(Q¯s(ζ)−Q¯s(γ))), (29)

where

corresponds to the quantile function and

to the CDF of the random variable . Furthermore, it should be mentioned that , since the rectifications are not invertible transformations.