1 Introduction
Modern deep neural networks (DNNs) have achieved great success in fields of computer vision and related areas. In the meantime, they are experiencing rapid growth in model size and computation cost, which makes it difficult to deploy on diverse hardware platforms. Recent works study how to develop a network with flexible size during test time [kim2018nestednet, yu2018slimmable, yu2019universally, cai2019once, ijcai2020288, xu2020one], to reduce the cost in designing [tan2019efficientnet], training [kingma2014adam], compressing [han2015deep] and deploying [ren2019admm] a DNN on various platforms. As these networks are often composed of a nested set of smaller subnetworks, we refer to them as nested nets in this paper. As many problems are safetycritical, such as object recognition [gal2016dropout, guo2017calibration], medicalimage segmentation [kohl2018probabilistic, kendall2015bayesian] and crowd counting [oh2019crowd, neurips20counting], the adopted DNNs are required to provide wellcalibrated uncertainty in addition to high prediction performance, as erroneous predictions could result in disastrous consequences. However, the measure of uncertainty was not considered in previous designs of nested nets, which leads to over or underconfident predictions.
One basis for creating nested nets is to order the network components (e.g., convolution channels) such that less important components can be removed first when creating the subnetwork. A unit for neural networks, nested dropout, was developed to order the latent feature representation for the encoderdecoder models [rippel2014learning, bekasov2020ordering]. Specifically, a discrete distribution is assigned over the indices of the representations, and the operation of nested dropout samples an index then drops the representations with larger indices. Recent studies show that the nested dropout is able to order the network components during training such that nested nets can be obtained [ijcai2020288, finn2014learning]. The ordering layout is applicable to different granularity levels of network components: single weights, groups of weights, convolutional channels, residual blocks, network layers, and even quantization bits. We refer to the partitions of the network components as nodes
in this paper. However, the probability that an index is sampled is specified by hand as a hyperparameter, and does not change during training. Thus, the importance of a node is predetermined by hand rather than learned from data.
To enhance predictive uncertainty and to allow the dropout rate to be learned, in this paper, we propose a fully Bayesian treatment for constructing nested nets. We propose a new nested dropout, based on a chain of interdependent Bernoulli variables. The chain simulates the Bernoulli trials and can be understood as a special case of a twostate Markov chain, which intuitively generates order information. To save the time cost for sampling during training, we propose a variational ordering unit that approximates the chain, and an approximate posterior based on a novel
Downhill distribution built on Gumbel Softmax [jang2016categorical, maddison2016concrete]. This allows efficient sampling of the multivariate ordered mask, and provides useful gradients to update the importance of the nodes.Based on the proposed ordering units, a Bayesian nested neural network is constructed, where the independent distributions of nodes are interconnected with the ordering units. A mixture model prior is placed over each node, while the model selection is determined by the ordering units (Fig. 1). A variational inference problem is formulated and resolved, and we propose several methods to simplify the sampling and calculation of the regularization term. Experiments show that our model outperforms the deterministic nested models in any subnetwork, in terms of classification accuracy, calibration and outofdomain detection. It also outperforms other uncertainty calibration methods on uncertaintycritical vision tasks, e.g., probabilistic UNet on medical segmentation with noisy labels.
In summary, the contributions of this paper are:

We propose a variational nested dropout unit with a novel pair of prior and posterior distributions.

We propose a novel Bayesian nested neural network that can generate large sets of uncertaintycalibrated subnetworks. The formulation can be viewed as a generalization of ordered regularization over the subnetworks.

To our knowledge, this is the first work that considers uncertainty calibration and learned the importance of network components in nested neural networks.
2 Variational Nested Dropout
We first review nested dropout, and then propose our Bayesian ordering unit and variational approximation.
2.1 A review of nested dropout
The previous works [rippel2014learning]
that order the representations use either Geometric or Categorical distributions to sample the last index of the kept units, then drop the neurons with indices greater than it. Specifically, the distribution
is assigned over the representation indices . The nested/ordered dropout operation proceeds as follows:
Tail sampling: A tail index is sampled that represents the last element be kept.

Ordered dropping: The elements with indices are dropped.
We also refer to this operation as an ordering unit as the representations are sorted in order.
In [rippel2014learning], which focuses on learning ordered representations, this operation is proved to exactly recover PCA with a onelayer neural network. Cui [ijcai2020288] shows this operation, when applied to groups of neural network weights or quantization bits, generates nested subnetworks that are optimal for different computation resources. They further prove that increasing from a smaller subnetwork to a larger one maximizes the incremental information gain. A large network only needs to be trained once with ordered dropout, yielding a set of networks with varying sizes for deployment. However, the above methods treat the ordered dropout rate as a hyperparameter, and handtuning the dropout rate is tedious and may lead to suboptimal performance of the subnetworks, as compared to learning this hyperparameter from the data. As illustrated in Fig. 2, the previous works use handspecified parameters for nested dropout, which freezes the importance of the network components over different layers during training.
A common practice for regular Bernoulli dropout is to treat the dropout rate as a variational parameter in Bayesian neural networks [Gal2016Uncertainty]. To find the optimal dropout rate, gridsearch is first adopted [gal2016dropout], whose complexity grows exponentially with the number of dropout units. To alleviate the cost of searching, a continuous relaxation of the discrete dropout is proposed by which the dropout rate can be optimized directly [gal2017concrete], improving accuracy and uncertainty, while keeping a low training time. However, for nested dropout, two aspects are unclear: 1) how to take a full Bayesian treatment with nested dropout units; 2) how the relaxation can be done for these units or how the gradients can be backpropagated to the parameters of .
2.2 Bayesian Ordering Unit
The conventional nested dropout uses a Geometric distribution to sample the tail
, , for . By definition, the Geometric distribution models the probability that the th trial is the first “success” in a sequence of independent Bernoulli trials. In the context of slimming neural networks, a “failure” of a Bernoulli trial indicates that node is kept, while a “success” indicates the tail index, where this node is kept and all subsequent nodes are dropped. Thus, is the conditional probability of a node being a tail index, given the previous node is kept.Sampling from the Geometric only generates the tail index of the nodes to be kept. A hard selection operation of ordered dropping is required to drop the following nodes. The ordered dropping can be implemented using a set of ordered mask vectors
, where consists of ones followed by zeros, . Given the sampled tail index , the appropriate maskis selected and applied to the nodes (e.g., multiplying the weights). However, as the masking is a nondifferentiable transformation and does not provide a welldefined probability distribution, the nested dropout parameters cannot be learned using this formulation.
To find a more natural prior for the nodes, we propose to use a chain of Bernoulli variables to directly model the distribution of the ordered masks
. Let the set of binary variables
represent the random ordered mask. Specifically, we model the conditional distributions with Bernoulli variables,(1)  
where is the conditional probability of keeping the node given the previous node is kept, and (the first node is always kept). Note that we also allow different probabilities for each . The marginal distribution of is
(2) 
A property of this chain is that if occurs at the th position, the remaining elements with indices become . That is, sampling from this chain generates an ordered mask, which can be directly multiplied on the nodes to realize ordered dropping. Another benefit is that applying a continuous relaxation [gal2017concrete] of the Bernoulli variables in the chain allows its parameters to be optimized.
However, the sampling of requires stepping through each element , which has complexity , and is thus not scalable in modern DNNs where is large. Thus we apply the variational inference framework, while treating as the prior of the ordered mask in our Bayesian treatment. One challenge is to find a tractable variational distribution that approximates the true posterior and is easy to compute. Another challenge is to define a that allows efficient reparameterization, so that the gradient of the parameter of
can be estimated with low variance.
2.3 Variational Ordering Unit
We propose a novel Downhill distribution based on Gumbel Softmax distribution [jang2016categorical, maddison2016concrete] that generates the ordered mask .
Definition 1
Downhill Random Variables (r.v.). Let the temperature parameter
. An r.v. has a Downhill distribution , if its density is:(3)  
where are the probabilities for each dimension.
Two important properties of Downhill distributions are:

Property 1. If ^{1}^{1}1For Gumbelsoftmax sampling, we first draw from , then calculate . The samples of can be obtained by first drawing then computing ., then , where is a dimensional vector of ones, and . . is a standard uniform variable.

Property 2. When , sampling from the Downhill distribution reduces to discrete sampling, where the sample space is the set of ordered mask vectors . The approximation of the Downhill distribution to the Bernoulli chain can be calculated in closedform.
Property 1 shows the sampling process of the Downhill distribution. We visualize the Downhill samples in Fig. 3. As each multivariate sample has a shape of a long descent from left to right, we name it Downhill distribution. The temperature variable controls the sharpness of the downhill or the smoothness of the step at the tail index. When is large, the slope is gentle in which case no nodes are dropped, but the less important nodes are multiplied with a factor less than 1. When , the shape of the sample becomes a cliff which is similar to the prior on ordered masks, where the less important nodes are dropped (i.e., multiplied by 0). Property 1 further implies the gradient can be estimated with low variance, for a cost function . Because the samples of are replaced by a differentiable function , then , where represents the whole transformation process in Prop. 1.
Recall that our objective is to approximate the chain of Bernoulli variables with . Property 2 shows why the proposed distribution is consistent with the chain of Bernoullis in essence, and provides an easy way to derive the evidence lower bound for variational inference. The proof for the two properties is in Appx. 1.1. This simple transformation of Gumbel softmax samples allows fast sampling of an ordered unit. Compared with , the complexity decreases from to , as the sequential sampling of the Bernoulli chain is no longer necessary.
3 Bayesian Nested Neural Network
In this section, we present the Bayesian nested neural network based on the fundamental units proposed in Sec. 2.
3.1 Bayesian Inference and SGVB
Consider a dataset constructed from pairs of instances . Our objective is to estimate the parameters of a neural network that predicts given input and parameters . In Bayesian learning, a prior is placed over the parameters . After data is observed, the prior distribution is transformed into a posterior distribution .
For neural networks, computing the posterior distribution using the Bayes rule requires computing intractable integrals over . Thus, approximation techniques are required. One family of techniques is variational inference, with which the posterior is approximated by a parametric distribution , where are the variational parameters. is approximated by minimizing the KullbackLeibler (KL) divergence with the true posterior, , which is equivalent to maximizing the evidence lower bound (ELBO):
(4) 
where the expected data loglikelihood is
(5) 
The integration is not tractable for neural networks. An efficient method for gradientbased optimization of the variational bound is stochastic gradient variational Bayes (SGVB) [kingma2013auto, kingma2015variational]. SGVB parameterizes the random parameters as where is a differentiable function and is a noise variable with fixed parameters. With this parameterization, an unbiased differentiable minibatchbased Monte Carlo estimator of the expected data loglikelihood is obtained:
(6) 
where is a minibatch of data with random instances , and .
3.2 Bayesian Nested Neural Network
In our model, the parameter consists of two parts: weight matrix and ordering units . The ordering units order the network weights and generate submodels that minimize the residual loss of a larger submodel [rippel2014learning, ijcai2020288]. We define the corresponding variational parameters , where and are the variational parameters for the weights and ordering units respectively. We then have the following optimization objective,
(7)  
(8) 
where and are the random noise, and and are the differentiable functions that transform the noises to the probabilistic weights and ordered masks.
Next, we focus on an example of a fullyconnected (FC) layer. Assume the FC layer in neural network takes in activations as the input, and outputs , where the weight matrix , and are the input and output size, and is the batch size. The elements are indexed as , and respectively. We omit the bias for simplicity, and our formulation can easily be extended to include the bias term. We have the ordering unit with each element applied on the column of , by which the columns of are given different levels of importance. Note that is flexible, and can be applied to rowwise or elementwise as well.
The prior for assumes each weight is independent, , where and . We choose to place a mixture of two univariate variables as the prior over each element of the weight matrix
. For example, if we use the univariate normal distribution, then each
is a Gaussian mixture, where the 2 components are:(9)  
(10) 
where and
are the means and standard deviations for the two components. We fix
and to be a small value, resulting in a spike at zero for the component when . The variablefollows the chain of Bernoulli distributions proposed in (
32). Using (2), the marginal distribution of is thenTo calculate the expected data loglikelihood, our Downhill distribution allows efficient sampling and differentiable transformation for the ordering units (Sec. 2.3). The reparameterization of weight distributions has been widely studied [kingma2015variational, louizos2017multiplicative, kingma2013auto] to provide gradient estimate with low variance. Our framework is compatible with these techniques, which will be discussed in Sec. 3.4. An overview of sampling is shown in Fig. 1.
3.3 Posterior Approximation
Next, we introduce the computation of the KL divergence. We assume the posterior takes the same form as the prior, while takes the distribution . We consider the case that for simplicity, while can be adjusted in the training process as annealing. For this layer, the KL divergence in (7) is
(11)  
Term of (11) is
where is the set of ordered masks. The number of components in the space is reduced from to , because there are only possible ordered masks. By definition, the probabilities are
(12) 
where we define . The derivation of is included in the Appx. 1.2.
Define as the th column of , and where . The term of (11) is
(13) 
Note that the term inside the integration over is the KL divergence between the univariate conditional density in the prior and the posterior, with or . Define as the KL of for component . The term can then be reorganized as
(14) 
There are totally
terms, which causes a large computation cost in every epoch. Consider the matrices
and , which are easily computed by applying the KL function elementwise. The term is then expressed as(15) 
where is a vector of 1s, is a matrix of 1s and is a lower triangular matrix with each element being 1. Then the calculation in (15) can be easily parallelize with modern computation library.
Ordered Regularization. We show that, if given the spikeandslab priors, our KL term in (11) has a nice interpretation as a generalization of an ordered regularization over the subnetworks. The corresponding reduced objective for deterministic networks is
(16) 
Note that larger subnetworks have greater penalization. The proof and more interpretations are in the Appx. 1.3.
3.4 Implementation
For efficient sampling of the weight distributions, we put multiplicative Gaussian noise on the weight , similar to [kingma2015variational, molchanov2017variational, louizos2017bayesian]. We take for as an example.
(17)  
(18) 
We also assume a loguniform prior [kingma2015variational, molchanov2017variational, louizos2017bayesian]. (10) becomes . With this prior, the negative KL term in (13) does not depend on the variational parameter [kingma2015variational], when the parameter is fixed,
(19) 
where is a constant.
As the second term in (19) cannot be computed analytically and should be estimated by sampling, Kingma [kingma2015variational] propose to sample first and design a function to approximate it, but their approximation of does not encourage as the optimization is difficult. An corresponds to a small variance, which is not flexible. Molchanov [molchanov2017variational] use a different parameterization that pushes , which means this can be discarded, as illustrated in Fig. 4. In our model, we want the order or sparsity of weights to be explicitly controlled by the ordering unit , otherwise the network would collapse to a single model rather than generate a nested set of submodels. Thus, we propose another approximation to (19),
(20)  
where , , and . We obtained these parameters by sampling from to estimate (19) as the groundtruth and fit these parameters for epochs. For fitting the curves, the input range is limited to . As shown in Fig. 4, our parameterization allows and maximizing does not push to infinity (c.f. [kingma2015variational] and [molchanov2017variational]), providing more flexible choices for the weight variance.
As the prior of the zerocomponent is assumed a spike at zero with a small constant variance, we let be the same spike as (9) to save computation. Also, to speed up the sampling process in Fig. 1, we directly multiply the sampled mask with the output features of the layer. This saves the cost for sampling from and simplifies (15) to . Using the notation in Sec. 3.2, the output of a fully connected layer is
(21)  
(22) 
The sampling process is similar to that of [kingma2015variational, molchanov2017variational, louizos2017bayesian].
The Bayesian nested neural network can be easily extended to convolutional layers with the ordering applied to filter channels (see Appx. 2.1 for details).
4 Related work
In this section, we reviewed the deep nets with regularization and nested nets, while the comparisons with Bayesian neural network are elaborated in Sec. 3.4.
regularization. The BernoulliGaussian linear model with independent Bernoulli variables is shown to be equivalent to regularization [murphy2012machine]. Recent works [louizos2018learning, yang2019deephoyer] investigate norm for regularizing deep neural networks. [louizos2018learning] presents a general formulation of a regularized learning objective for a single deterministic neural network,
(23) 
where the variable is a binary gate with parameter for each network node , and is the loss. It was shown that regularization over the weights is a special case of an ELBO over parameters with spikeandslab priors. These works present the uniform regularization as the coefficient is a constant over the weights. It is interesting that our ELBO (7) can be viewed as a generalization of a new training objective of deterministic networks, which includes a weighted penalization over the choices of subnetworks, interpretable as an ordered regularization (16).
Nested neural networks. Nested nets have been explored in recent years, for its portability in neural network (NN) deployment on different platforms. [kim2018nestednet] proposes a networkinnetwork structure for a nested net. which consists of internal networks from the core level to the full level. [yu2018slimmable, yu2019universally]
propose slimmable NN that trains a network that samples multiple subnetworks of different channel numbers (widths) simultaneously, where the weights are shared among different widths. The network needs to switch between different batch normalization parameters that correspond to different widths. To alleviate the interference in optimizing channels in slimmable NN,
[cai2019once] proposes a onceforall network that is elastic in kernel size, network depth and width, by shrinking the network progressively during training. [ijcai2020288] proposes using nested dropout to train a fully nested neural network, which generates more subnetworks in nodes, including weights, channels, paths, and layers. However, none of the previous works consider learned importance over the nodes and the predictive uncertainty. Our work provides a wellcalibrated uncertainty and the learned importance, with a full Bayesian treatment of nested nets.5 Experiments
We next present experiments using our Bayesian nested neural network. The experiments include two main tasks: image classification and semantic segmentation.
5.1 Image Classification
Datasets and setup.
The image classification experiments are conducted on Cifar10 and Tiny Imagenet (see Appx. 3.4 for results on Cifar100). The tested NN models are VGG11 with batch normalization layers
[simonyan2014very], ResNeXtCifar model from
[xie2017aggregated], and MobileNetV2 [sandler2018mobilenetv2].To train the proposed Bayesian nested neural network (denoted as ), we use the crossentropy loss for the expected loglikelihood in (8). The computation of the KL term follows Sec. 3.4. For ordering the nodes, in every layer, we assign each dimension of the prior (Bernoulli chain) and posterior (Downhill variable) of the ordering unit to a group of weights. Thus, the layer width is controlled by the ordering unit. We set the number of groups to 32 for VGG11 and ResNextCifar, and to 16 for MobileNetV2. We compare our with the fully nested neural network () [ijcai2020288], since it can be seen as an extension of slimmable NN [yu2018slimmable, yu2019universally] to finegrained nodes. We also compare with the Bayesian NN with variational Gaussian dropout [kingma2015variational], where we train a set of independent Bayesian NNs (IBNN) for different fixed widths. Conceptually, the performance of IBNN, which trains separate subnetworks, is the ideal target for , which uses nested subnetworks.
During testing time, we generate fixed width masks for and as in [ijcai2020288]. For fairness, we do not perform local search of optimal width like in , but directly truncating the widths. We rescale the node output by the probability that a node is kept (see Appx. 2.2). The batch normalization statistics are then recollected for 1 epoch with the training data (using fewer data is also feasible as shown in Appx. 3.2). The number of samples used in testing and IBNN is 6. The detailed hyperparameter settings for training and testing are in Appx. 3.1.
Evaluation metrics. For the evaluation, we test accuracy, uncertainty calibration, and outofdomain (OOD) detection. Calibration performance is measured with the expected calibration error [guo2017calibration] (ECE), which is the expected difference between the average confidence and accuracy. OOD performance is measured with the area under the precisionrecall curve (AUPR) [boyd2013area, hendrycks2016baseline, lakshminarayanan2017simple] (see Appx. 3.3 for AUROC curves). If we take the OOD class as positive, precision is the fraction of detected OOD data that are true OOD, while recall is the fraction of true OOD data that are successfully detected. Note that a better model will have higher accuracy and OOD AUPR, and lower calibration ECE. As the sampling and collection of batchnorm statistics are stochastic, we repeat each trial 3 times and report the average results.
Results. The results are presented in Figs. 5 and 6. First, looking at performance versus width, exhibits the wellbehaved property of subnetworks, where the performance increases (accuracy and AUPR increase, ECE decreases) or is stable as the width increases. This demonstrates that the variational ordering unit successfully orders the information in each layer.
Despite learning nested subnetworks, in general, has similar performance as IBNN (which separately learns subnetworks) for all models and datasets, with the following exceptions. For MobileNetV2 on both datasets, outperforms IBNN in all metrics, as IBNN fails to perform well in prediction and uncertainty (outperformed by too). For VGG11 on both datasets, IBNN tends to have lower ECE with smaller widths, showing its advantage in providing uncertainty for small and simple models. However, IBNN has larger ECE when the model size is large, e.g., has lower ECE than IBNN with the ResNeXt model. Finally, outperforms IBNN by a large margin for ResNeXt on Tiny ImageNet, which we attribute to its ability to prune the complex architecture via learning ordered structures (Sec. 2.3) and the ordered 0 regularization effect (Sec. 3.3), which are absent in IBNN.
Comparing the two nested models, outperforms in all metrics, which shows the advantage of learning the nested dropout rate for each node.
5.2 Lung Abnormalities Segmentation
Dataset and Setup. The semantic segmentation experiments are conducted on the LIDCIDRI [clark2013cancer] dataset, which contains 1,018 CT scans from 1,010 lung patients with manual lesion segmentation from four experts. This dataset is uncertaintycritical as it contains typical ambiguities in labels that appear in medical applications. We follow [kohl2018probabilistic] to process the data, resulting in 12,870 images in total. We adopt the generalized energy distance (GED) [bellemare2017cramer, salimans2018improving, szekely2013energy, kohl2018probabilistic] as the evaluation metric, with as the distance function. GED measures the distance between the output distributions rather than single deterministic predictions. For , it measure the probabilistic distances between the induced distribution from model posterior given a fixed width, and the noisy labels from four experts. We use a standard UNet [ronneberger2015u] for and the number of groups is 32. We compare with Probabilistic UNet (PUNet) [kohl2018probabilistic], a deep ensemble of UNet (EUNet), and Dropout UNet (DUNet) [kendall2015bayesian]. Their results are the average results from [kohl2018probabilistic] with the full UNet.
Results. The results are presented in Fig. 7. We observe that outperforms the existing methods in most of the cases, with the difference more obvious when there are fewer posterior samples. The performance of stabilizes after width of 32.29%. This indicates learns a compact and effective structure compared with other methods, in terms of capturing ambiguities in the labels.
When there are more posterior samples (8 and 16), probabilistic UNet has better performance than the with the smallest width ( channels are preserved). This means with more posterior samples, the probabilistic UNet can depict the latent structure better, but uses a fullwidth model. Increasing the width to 32.29%, then achieves better performance.
6 Conclusion
In this paper, we propose a Bayesian nested neural network, which is based on a novel variational ordering unit that explicitly models the weight importance via the Downhill random variable. From our model, the weight importance can be learned from data, rather than handtuned as with previous methods. Experiments show that this framework can improve both accuracy and calibrated predictive uncertainty. Future work will study the variational ordering unit in language modeling, sequential data, or generative models where the order is important, e.g., [rai2015large]. The Downhill random variable is a wellsuited hidden variable for such applications.
References
7 Appendix  Derivation and Proofs
7.1 Derivation of Properties
Property 1.
If ^{2}^{2}2For Gumbelsoftmax sampling, we first draw from , then calculate . The samples of can be obtained by first drawing then computing ., then , where is a dimensional vector of ones, and . . is a standard uniform variable.
We show that using the sampling process in Property 1 recovers produce the Downhill random variable. We assume follows a Gumbel softmax distribution [gumbel1948statistical, maddison2014sampling] which has the following form.
(25)  
We apply the transformation to the variable .
To obtain the distribution of , we apply the change of variables formula on .
(26)  
(27) 
From the definition of , we can obtain . The Jacobian
(28) 
Thus, .
(29)  
Property 2.
When , sampling from the Downhill distribution reduces to discrete sampling, where the sample space is the set of ordered mask vectors . The approximation of the Downhill distribution to the Bernoulli chain can be calculated in closedform.
As shown in [maddison2016concrete, jang2016categorical], when , the Gumbel softmax transformation corresponds to an argmax operation that generates an onehot variable:
(30) 
where the relative order is preserved.
Say a sample , with th entry being one and the rest entries being 0. The defined transformation generates . Thus, . It is easy to see the transformation is surjective function that . is exactly the set of ordered mask defined in Sec. 2.2.
Thus, we can calculate the approximation of Downhill variable to the Bernoulli chain,
(31) 
where (See Appx 1.2). The KL divergence in (31) minimized to 0 when .
7.2 Probability of ordered masks
Recall the formulation of Bernoulli chain:
(32)  
It is observed, there is a chance only when , and if . Thus,
(33) 
where is the index of first zero. And we define as , which means all nodes are remained.
7.3 0 regularization
We consider the case when prior over each weight is a spikeandslap distribution, i.e., and , using the notation in Sec. 3.3. The posterior is also in this form. The derivations of KL term in (1115) remain unchanged as it make nothing but meanfield assumption on the weight prior. With and (15), the objective (7) can be reorganized as
(34) 
as . We assume as in [louizos2018learning]. It means transforming to requires nats. Thus, . The last term is then simplified to
(35) 
Then,
(36)  
(37) 
where the line 23 is because KL is positive. Let . Then, maximizing the evidence lower bound presents the same objective in (16). This objective assigns greater penalization to the larger subnetworks with more redundant nodes. To compare with (23) [louizos2018learning] that uses have a constant coefficient over the probabilities, our reduced formulation provides an ordered 0 regularization instead of a uniform 0 regularization.
Note that (36) ignores the weight uncertainty compared with (7). (37) further ignores the uncertainty over the ordered mask, reduced to a deterministic formulation for a nested neural network with learned weight importance. The network used in this paper is (7) with weight uncertainty considered, where the detailed discussion for weight distributions.
8 Appendix  Implementation
8.1 Extension to Convolutional Layer
We consider a convolutional layer takes in a single tensor
as input, where is the index of the batch, , and are the dimensions of feature map. The layer has filters aggregated as and outputs a matrix . In the paper, we consider the ordered masks applied over the output channels and each filter corresponds to a dimension in . As shown in [kingma2015variational, molchanov2017variational], the local reparameterization trick can be applied, due to the linearity of the convolutional layer.(38)  
where is the th dimension of the sampled ordered mask .
To calculate the KL term (11), the only modification is to let the first summation be over the height, width and input channels in (13).
Comments
There are no comments yet.