Bayesian Nested Neural Networks for Uncertainty Calibration and Adaptive Compression

01/27/2021 ∙ by Yufei Cui, et al. ∙ 0

Nested networks or slimmable networks are neural networks whose architectures can be adjusted instantly during testing time, e.g., based on computational constraints. Recent studies have focused on a "nested dropout" layer, which is able to order the nodes of a layer by importance during training, thus generating a nested set of sub-networks that are optimal for different configurations of resources. However, the dropout rate is fixed as a hyper-parameter over different layers during the whole training process. Therefore, when nodes are removed, the performance decays in a human-specified trajectory rather than in a trajectory learned from data. Another drawback is the generated sub-networks are deterministic networks without well-calibrated uncertainty. To address these two problems, we develop a Bayesian approach to nested neural networks. We propose a variational ordering unit that draws samples for nested dropout at a low cost, from a proposed Downhill distribution, which provides useful gradients to the parameters of nested dropout. Based on this approach, we design a Bayesian nested neural network that learns the order knowledge of the node distributions. In experiments, we show that the proposed approach outperforms the nested network in terms of accuracy, calibration, and out-of-domain detection in classification tasks. It also outperforms the related approach on uncertainty-critical tasks in computer vision.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern deep neural networks (DNNs) have achieved great success in fields of computer vision and related areas. In the meantime, they are experiencing rapid growth in model size and computation cost, which makes it difficult to deploy on diverse hardware platforms. Recent works study how to develop a network with flexible size during test time [kim2018nestednet, yu2018slimmable, yu2019universally, cai2019once, ijcai2020-288, xu2020one], to reduce the cost in designing [tan2019efficientnet], training [kingma2014adam], compressing [han2015deep] and deploying [ren2019admm] a DNN on various platforms. As these networks are often composed of a nested set of smaller sub-networks, we refer to them as nested nets in this paper. As many problems are safety-critical, such as object recognition [gal2016dropout, guo2017calibration], medical-image segmentation [kohl2018probabilistic, kendall2015bayesian] and crowd counting [oh2019crowd, neurips20counting], the adopted DNNs are required to provide well-calibrated uncertainty in addition to high prediction performance, as erroneous predictions could result in disastrous consequences. However, the measure of uncertainty was not considered in previous designs of nested nets, which leads to over- or under-confident predictions.

One basis for creating nested nets is to order the network components (e.g., convolution channels) such that less important components can be removed first when creating the sub-network. A unit for neural networks, nested dropout, was developed to order the latent feature representation for the encoder-decoder models [rippel2014learning, bekasov2020ordering]. Specifically, a discrete distribution is assigned over the indices of the representations, and the operation of nested dropout samples an index then drops the representations with larger indices. Recent studies show that the nested dropout is able to order the network components during training such that nested nets can be obtained [ijcai2020-288, finn2014learning]. The ordering layout is applicable to different granularity levels of network components: single weights, groups of weights, convolutional channels, residual blocks, network layers, and even quantization bits. We refer to the partitions of the network components as nodes

in this paper. However, the probability that an index is sampled is specified by hand as a hyperparameter, and does not change during training. Thus, the importance of a node is pre-determined by hand rather than learned from data.

To enhance predictive uncertainty and to allow the dropout rate to be learned, in this paper, we propose a fully Bayesian treatment for constructing nested nets. We propose a new nested dropout, based on a chain of interdependent Bernoulli variables. The chain simulates the Bernoulli trials and can be understood as a special case of a two-state Markov chain, which intuitively generates order information. To save the time cost for sampling during training, we propose a variational ordering unit that approximates the chain, and an approximate posterior based on a novel

Downhill distribution built on Gumbel Softmax [jang2016categorical, maddison2016concrete]. This allows efficient sampling of the multivariate ordered mask, and provides useful gradients to update the importance of the nodes.

Based on the proposed ordering units, a Bayesian nested neural network is constructed, where the independent distributions of nodes are interconnected with the ordering units. A mixture model prior is placed over each node, while the model selection is determined by the ordering units (Fig. 1). A variational inference problem is formulated and resolved, and we propose several methods to simplify the sampling and calculation of the regularization term. Experiments show that our model outperforms the deterministic nested models in any sub-network, in terms of classification accuracy, calibration and out-of-domain detection. It also outperforms other uncertainty calibration methods on uncertainty-critical vision tasks, e.g., probabilistic U-Net on medical segmentation with noisy labels.

In summary, the contributions of this paper are:

  • We propose a variational nested dropout unit with a novel pair of prior and posterior distributions.

  • We propose a novel Bayesian nested neural network that can generate large sets of uncertainty-calibrated sub-networks. The formulation can be viewed as a generalization of ordered regularization over the sub-networks.

  • To our knowledge, this is the first work that considers uncertainty calibration and learned the importance of network components in nested neural networks.

Figure 1: Sampling process in a layer for calculating the data log-likelihood (Eq. 8). A fully connected layer takes as an input and outputs . The variational ordering unit generates ordered mask . Nodes ’s with the same color share an element . The gradient through stochastic nodes

can be estimated efficiently, to update the importance


2 Variational Nested Dropout

We first review nested dropout, and then propose our Bayesian ordering unit and variational approximation.

2.1 A review of nested dropout

The previous works [rippel2014learning]

that order the representations use either Geometric or Categorical distributions to sample the last index of the kept units, then drop the neurons with indices greater than it. Specifically, the distribution

is assigned over the representation indices . The nested/ordered dropout operation proceeds as follows:

  1. Tail sampling: A tail index is sampled that represents the last element be kept.

  2. Ordered dropping: The elements with indices are dropped.

We also refer to this operation as an ordering unit as the representations are sorted in order.

In [rippel2014learning], which focuses on learning ordered representations, this operation is proved to exactly recover PCA with a one-layer neural network. Cui  [ijcai2020-288] shows this operation, when applied to groups of neural network weights or quantization bits, generates nested sub-networks that are optimal for different computation resources. They further prove that increasing from a smaller sub-network to a larger one maximizes the incremental information gain. A large network only needs to be trained once with ordered dropout, yielding a set of networks with varying sizes for deployment. However, the above methods treat the ordered dropout rate as a hyper-parameter, and hand-tuning the dropout rate is tedious and may lead to suboptimal performance of the sub-networks, as compared to learning this hyperparameter from the data. As illustrated in Fig. 2, the previous works use hand-specified parameters for nested dropout, which freezes the importance of the network components over different layers during training.

Figure 2: The probability of tail index being sampled in different nested dropout realizations. Rippel   [rippel2014learning] and Cui   [ijcai2020-288] adopt Geometric and Categorical distributions, which are static over different layers and the learning process. The proposed variational nested dropout (VND) learns the importances of nodes from data. The two examples are from two different layers in a Bayesian nested neural network.

A common practice for regular Bernoulli dropout is to treat the dropout rate as a variational parameter in Bayesian neural networks [Gal2016Uncertainty]. To find the optimal dropout rate, grid-search is first adopted [gal2016dropout], whose complexity grows exponentially with the number of dropout units. To alleviate the cost of searching, a continuous relaxation of the discrete dropout is proposed by which the dropout rate can be optimized directly [gal2017concrete], improving accuracy and uncertainty, while keeping a low training time. However, for nested dropout, two aspects are unclear: 1) how to take a full Bayesian treatment with nested dropout units; 2) how the relaxation can be done for these units or how the gradients can be back-propagated to the parameters of .

2.2 Bayesian Ordering Unit

The conventional nested dropout uses a Geometric distribution to sample the tail

, , for . By definition, the Geometric distribution models the probability that the -th trial is the first “success” in a sequence of independent Bernoulli trials. In the context of slimming neural networks, a “failure” of a Bernoulli trial indicates that node is kept, while a “success” indicates the tail index, where this node is kept and all subsequent nodes are dropped. Thus, is the conditional probability of a node being a tail index, given the previous node is kept.

Sampling from the Geometric only generates the tail index of the nodes to be kept. A hard selection operation of ordered dropping is required to drop the following nodes. The ordered dropping can be implemented using a set of ordered mask vectors

, where consists of ones followed by zeros, . Given the sampled tail index , the appropriate mask

is selected and applied to the nodes (e.g., multiplying the weights). However, as the masking is a non-differentiable transformation and does not provide a well-defined probability distribution, the nested dropout parameters cannot be learned using this formulation.

To find a more natural prior for the nodes, we propose to use a chain of Bernoulli variables to directly model the distribution of the ordered masks

. Let the set of binary variables

represent the random ordered mask. Specifically, we model the conditional distributions with Bernoulli variables,


where is the conditional probability of keeping the node given the previous node is kept, and (the first node is always kept). Note that we also allow different probabilities for each . The marginal distribution of is


A property of this chain is that if occurs at the -th position, the remaining elements with indices become . That is, sampling from this chain generates an ordered mask, which can be directly multiplied on the nodes to realize ordered dropping. Another benefit is that applying a continuous relaxation [gal2017concrete] of the Bernoulli variables in the chain allows its parameters to be optimized.

However, the sampling of requires stepping through each element , which has complexity , and is thus not scalable in modern DNNs where is large. Thus we apply the variational inference framework, while treating as the prior of the ordered mask in our Bayesian treatment. One challenge is to find a tractable variational distribution that approximates the true posterior and is easy to compute. Another challenge is to define a that allows efficient re-parameterization, so that the gradient of the parameter of

can be estimated with low variance.

2.3 Variational Ordering Unit

We propose a novel Downhill distribution based on Gumbel Softmax distribution [jang2016categorical, maddison2016concrete] that generates the ordered mask .

Definition 1

Downhill Random Variables (r.v.). Let the temperature parameter

. An r.v. has a Downhill distribution , if its density is:


where are the probabilities for each dimension.

Two important properties of Downhill distributions are:

  • Property 1. If 111For Gumbel-softmax sampling, we first draw from , then calculate . The samples of can be obtained by first drawing then computing ., then , where is a -dimensional vector of ones, and . . is a standard uniform variable.

  • Property 2. When , sampling from the Downhill distribution reduces to discrete sampling, where the sample space is the set of ordered mask vectors . The approximation of the Downhill distribution to the Bernoulli chain can be calculated in closed-form.

Property 1 shows the sampling process of the Downhill distribution. We visualize the Downhill samples in Fig. 3. As each multivariate sample has a shape of a long descent from left to right, we name it Downhill distribution. The temperature variable controls the sharpness of the downhill or the smoothness of the step at the tail index. When is large, the slope is gentle in which case no nodes are dropped, but the less important nodes are multiplied with a factor less than 1. When , the shape of the sample becomes a cliff which is similar to the prior on ordered masks, where the less important nodes are dropped (i.e., multiplied by 0). Property 1 further implies the gradient can be estimated with low variance, for a cost function . Because the samples of are replaced by a differentiable function , then , where represents the whole transformation process in Prop. 1.

Recall that our objective is to approximate the chain of Bernoulli variables with . Property 2 shows why the proposed distribution is consistent with the chain of Bernoullis in essence, and provides an easy way to derive the evidence lower bound for variational inference. The proof for the two properties is in Appx. 1.1. This simple transformation of Gumbel softmax samples allows fast sampling of an ordered unit. Compared with , the complexity decreases from to , as the sequential sampling of the Bernoulli chain is no longer necessary.

Figure 3: The multivariate Downhill samples under different temperatures . When , a clear cliff is observed as the dimension increases, which is beneficial for differentiating important or unimportant nodes. As increases, the shape becomes a slope where the gaps between important/unimportant nodes are smoother, which is beneficial for training.

3 Bayesian Nested Neural Network

In this section, we present the Bayesian nested neural network based on the fundamental units proposed in Sec. 2.

3.1 Bayesian Inference and SGVB

Consider a dataset constructed from pairs of instances . Our objective is to estimate the parameters of a neural network that predicts given input and parameters . In Bayesian learning, a prior is placed over the parameters . After data is observed, the prior distribution is transformed into a posterior distribution .

For neural networks, computing the posterior distribution using the Bayes rule requires computing intractable integrals over . Thus, approximation techniques are required. One family of techniques is variational inference, with which the posterior is approximated by a parametric distribution , where are the variational parameters. is approximated by minimizing the Kullback-Leibler (KL) divergence with the true posterior, , which is equivalent to maximizing the evidence lower bound (ELBO):


where the expected data log-likelihood is


The integration is not tractable for neural networks. An efficient method for gradient-based optimization of the variational bound is stochastic gradient variational Bayes (SGVB) [kingma2013auto, kingma2015variational]. SGVB parameterizes the random parameters as where is a differentiable function and is a noise variable with fixed parameters. With this parameterization, an unbiased differentiable minibatch-based Monte Carlo estimator of the expected data log-likelihood is obtained:


where is a minibatch of data with random instances , and .

3.2 Bayesian Nested Neural Network

In our model, the parameter consists of two parts: weight matrix and ordering units . The ordering units order the network weights and generate sub-models that minimize the residual loss of a larger sub-model [rippel2014learning, ijcai2020-288]. We define the corresponding variational parameters , where and are the variational parameters for the weights and ordering units respectively. We then have the following optimization objective,


where and are the random noise, and and are the differentiable functions that transform the noises to the probabilistic weights and ordered masks.

Next, we focus on an example of a fully-connected (FC) layer. Assume the FC layer in neural network takes in activations as the input, and outputs , where the weight matrix , and are the input and output size, and is the batch size. The elements are indexed as , and respectively. We omit the bias for simplicity, and our formulation can easily be extended to include the bias term. We have the ordering unit with each element applied on the column of , by which the columns of are given different levels of importance. Note that is flexible, and can be applied to row-wise or element-wise as well.

The prior for assumes each weight is independent, , where and . We choose to place a mixture of two univariate variables as the prior over each element of the weight matrix

. For example, if we use the univariate normal distribution, then each

is a Gaussian mixture, where the 2 components are:


where and

are the means and standard deviations for the two components. We fix

and to be a small value, resulting in a spike at zero for the component when . The variable

follows the chain of Bernoulli distributions proposed in (

32). Using (2), the marginal distribution of is then

To calculate the expected data log-likelihood, our Downhill distribution allows efficient sampling and differentiable transformation for the ordering units (Sec. 2.3). The reparameterization of weight distributions has been widely studied [kingma2015variational, louizos2017multiplicative, kingma2013auto] to provide gradient estimate with low variance. Our framework is compatible with these techniques, which will be discussed in Sec. 3.4. An overview of sampling is shown in Fig. 1.

3.3 Posterior Approximation

Next, we introduce the computation of the KL divergence. We assume the posterior takes the same form as the prior, while takes the distribution . We consider the case that for simplicity, while can be adjusted in the training process as annealing. For this layer, the KL divergence in (7) is


Term of (11) is

where is the set of ordered masks. The number of components in the space is reduced from to , because there are only possible ordered masks. By definition, the probabilities are


where we define . The derivation of is included in the Appx. 1.2.

Define as the -th column of , and where . The term of (11) is


Note that the term inside the integration over is the KL divergence between the univariate conditional density in the prior and the posterior, with or . Define as the KL of for component . The term can then be re-organized as


There are totally

terms, which causes a large computation cost in every epoch. Consider the matrices

and , which are easily computed by applying the KL function element-wise. The term is then expressed as


where is a vector of 1s, is a matrix of 1s and is a lower triangular matrix with each element being 1. Then the calculation in (15) can be easily parallelize with modern computation library.

Ordered -Regularization. We show that, if given the spike-and-slab priors, our KL term in (11) has a nice interpretation as a generalization of an ordered regularization over the sub-networks. The corresponding reduced objective for deterministic networks is


Note that larger sub-networks have greater penalization. The proof and more interpretations are in the Appx. 1.3.

3.4 Implementation

For efficient sampling of the weight distributions, we put multiplicative Gaussian noise on the weight , similar to [kingma2015variational, molchanov2017variational, louizos2017bayesian]. We take for as an example.


We also assume a log-uniform prior [kingma2015variational, molchanov2017variational, louizos2017bayesian]. (10) becomes . With this prior, the negative KL term in (13) does not depend on the variational parameter  [kingma2015variational], when the parameter is fixed,


where is a constant.

As the second term in (19) cannot be computed analytically and should be estimated by sampling, Kingma  [kingma2015variational] propose to sample first and design a function to approximate it, but their approximation of does not encourage as the optimization is difficult. An corresponds to a small variance, which is not flexible. Molchanov  [molchanov2017variational] use a different parameterization that pushes , which means this can be discarded, as illustrated in Fig. 4. In our model, we want the order or sparsity of weights to be explicitly controlled by the ordering unit , otherwise the network would collapse to a single model rather than generate a nested set of sub-models. Thus, we propose another approximation to (19),


where , , and . We obtained these parameters by sampling from to estimate (19) as the ground-truth and fit these parameters for epochs. For fitting the curves, the input range is limited to . As shown in Fig. 4, our parameterization allows and maximizing does not push to infinity (c.f. [kingma2015variational] and [molchanov2017variational]), providing more flexible choices for the weight variance.

Figure 4: Approximation to (19). Our approximation allows (c.f., [kingma2015variational]) and does not push to generate a collapsed model (c.f., [molchanov2017variational]).

As the prior of the zero-component is assumed a spike at zero with a small constant variance, we let be the same spike as (9) to save computation. Also, to speed up the sampling process in Fig. 1, we directly multiply the sampled mask with the output features of the layer. This saves the cost for sampling from and simplifies (15) to . Using the notation in Sec. 3.2, the output of a fully connected layer is


The sampling process is similar to that of [kingma2015variational, molchanov2017variational, louizos2017bayesian].

The Bayesian nested neural network can be easily extended to convolutional layers with the ordering applied to filter channels (see Appx. 2.1 for details).

4 Related work

In this section, we reviewed the deep nets with regularization and nested nets, while the comparisons with Bayesian neural network are elaborated in Sec. 3.4.

regularization. The Bernoulli-Gaussian linear model with independent Bernoulli variables is shown to be equivalent to regularization [murphy2012machine]. Recent works [louizos2018learning, yang2019deephoyer] investigate norm for regularizing deep neural networks. [louizos2018learning] presents a general formulation of a -regularized learning objective for a single deterministic neural network,


where the variable is a binary gate with parameter for each network node , and is the loss. It was shown that regularization over the weights is a special case of an ELBO over parameters with spike-and-slab priors. These works present the uniform regularization as the coefficient is a constant over the weights. It is interesting that our ELBO (7) can be viewed as a generalization of a new training objective of deterministic networks, which includes a weighted penalization over the choices of sub-networks, interpretable as an ordered regularization (16).

Nested neural networks. Nested nets have been explored in recent years, for its portability in neural network (NN) deployment on different platforms. [kim2018nestednet] proposes a network-in-network structure for a nested net. which consists of internal networks from the core level to the full level. [yu2018slimmable, yu2019universally]

propose slimmable NN that trains a network that samples multiple sub-networks of different channel numbers (widths) simultaneously, where the weights are shared among different widths. The network needs to switch between different batch normalization parameters that correspond to different widths. To alleviate the interference in optimizing channels in slimmable NN,

[cai2019once] proposes a once-for-all network that is elastic in kernel size, network depth and width, by shrinking the network progressively during training. [ijcai2020-288] proposes using nested dropout to train a fully nested neural network, which generates more sub-networks in nodes, including weights, channels, paths, and layers. However, none of the previous works consider learned importance over the nodes and the predictive uncertainty. Our work provides a well-calibrated uncertainty and the learned importance, with a full Bayesian treatment of nested nets.

5 Experiments

We next present experiments using our Bayesian nested neural network. The experiments include two main tasks: image classification and semantic segmentation.

5.1 Image Classification

Datasets and setup.

The image classification experiments are conducted on Cifar10 and Tiny Imagenet (see Appx. 3.4 for results on Cifar100). The tested NN models are VGG11 with batch normalization layers 


, ResNeXt-Cifar model from 

[xie2017aggregated], and MobileNetV2 [sandler2018mobilenetv2].

Figure 5: Results on Cifar10 for (a) VGG11, (b) MobileNetv2, and (c) ResNeXt-Cifar. Each curve plots performance versus the network width. The solid line indicate the mean and the shaded area indicates two standard deviations.
Figure 6:

Results on Tiny ImageNet for (a) VGG11, (b) MobileNetv2, (c) ResNeXt-Cifar.

To train the proposed Bayesian nested neural network (denoted as ), we use the cross-entropy loss for the expected log-likelihood in (8). The computation of the KL term follows Sec. 3.4. For ordering the nodes, in every layer, we assign each dimension of the prior (Bernoulli chain) and posterior (Downhill variable) of the ordering unit to a group of weights. Thus, the layer width is controlled by the ordering unit. We set the number of groups to 32 for VGG11 and ResNext-Cifar, and to 16 for MobileNetV2. We compare our with the fully nested neural network ([ijcai2020-288], since it can be seen as an extension of slimmable NN [yu2018slimmable, yu2019universally] to fine-grained nodes. We also compare with the Bayesian NN with variational Gaussian dropout [kingma2015variational], where we train a set of independent Bayesian NNs (IBNN) for different fixed widths. Conceptually, the performance of IBNN, which trains separate sub-networks, is the ideal target for , which uses nested sub-networks.

During testing time, we generate fixed width masks for and as in [ijcai2020-288]. For fairness, we do not perform local search of optimal width like in , but directly truncating the widths. We re-scale the node output by the probability that a node is kept (see Appx. 2.2). The batch normalization statistics are then re-collected for 1 epoch with the training data (using fewer data is also feasible as shown in Appx. 3.2). The number of samples used in testing and IBNN is 6. The detailed hyper-parameter settings for training and testing are in Appx. 3.1.

Figure 7: Evaluation on semantic segmentation using generalized energy distance () with different numbers of posterior samples. Each box-plot shows the GED of all data of for one network width (%). The black horizontal line in the box plot represents the mean. The bold horizontal lines represent the averaged results for comparison methods, which use the full-width network.

Evaluation metrics. For the evaluation, we test accuracy, uncertainty calibration, and out-of-domain (OOD) detection. Calibration performance is measured with the expected calibration error [guo2017calibration] (ECE), which is the expected difference between the average confidence and accuracy. OOD performance is measured with the area under the precision-recall curve (AUPR) [boyd2013area, hendrycks2016baseline, lakshminarayanan2017simple] (see Appx. 3.3 for AUROC curves). If we take the OOD class as positive, precision is the fraction of detected OOD data that are true OOD, while recall is the fraction of true OOD data that are successfully detected. Note that a better model will have higher accuracy and OOD AUPR, and lower calibration ECE. As the sampling and collection of batch-norm statistics are stochastic, we repeat each trial 3 times and report the average results.

Results. The results are presented in Figs. 5 and 6. First, looking at performance versus width, exhibits the well-behaved property of sub-networks, where the performance increases (accuracy and AUPR increase, ECE decreases) or is stable as the width increases. This demonstrates that the variational ordering unit successfully orders the information in each layer.

Despite learning nested sub-networks, in general, has similar performance as IBNN (which separately learns sub-networks) for all models and datasets, with the following exceptions. For MobileNetV2 on both datasets, outperforms IBNN in all metrics, as IBNN fails to perform well in prediction and uncertainty (outperformed by too). For VGG11 on both datasets, IBNN tends to have lower ECE with smaller widths, showing its advantage in providing uncertainty for small and simple models. However, IBNN has larger ECE when the model size is large, e.g., has lower ECE than IBNN with the ResNeXt model. Finally, outperforms IBNN by a large margin for ResNeXt on Tiny ImageNet, which we attribute to its ability to prune the complex architecture via learning ordered structures (Sec. 2.3) and the ordered -0 regularization effect (Sec. 3.3), which are absent in IBNN.

Comparing the two nested models, outperforms in all metrics, which shows the advantage of learning the nested dropout rate for each node.

5.2 Lung Abnormalities Segmentation

Dataset and Setup. The semantic segmentation experiments are conducted on the LIDC-IDRI [clark2013cancer] dataset, which contains 1,018 CT scans from 1,010 lung patients with manual lesion segmentation from four experts. This dataset is uncertainty-critical as it contains typical ambiguities in labels that appear in medical applications. We follow [kohl2018probabilistic] to process the data, resulting in 12,870 images in total. We adopt the generalized energy distance (GED) [bellemare2017cramer, salimans2018improving, szekely2013energy, kohl2018probabilistic] as the evaluation metric, with as the distance function. GED measures the distance between the output distributions rather than single deterministic predictions. For , it measure the probabilistic distances between the induced distribution from model posterior given a fixed width, and the noisy labels from four experts. We use a standard U-Net [ronneberger2015u] for and the number of groups is 32. We compare with Probabilistic U-Net (P-UNet) [kohl2018probabilistic], a deep ensemble of U-Net (E-UNet), and Dropout U-Net (D-UNet) [kendall2015bayesian]. Their results are the average results from [kohl2018probabilistic] with the full U-Net.

Results. The results are presented in Fig. 7. We observe that outperforms the existing methods in most of the cases, with the difference more obvious when there are fewer posterior samples. The performance of stabilizes after width of 32.29%. This indicates learns a compact and effective structure compared with other methods, in terms of capturing ambiguities in the labels.

When there are more posterior samples (8 and 16), probabilistic U-Net has better performance than the with the smallest width ( channels are preserved). This means with more posterior samples, the probabilistic U-Net can depict the latent structure better, but uses a full-width model. Increasing the width to 32.29%, then achieves better performance.

6 Conclusion

In this paper, we propose a Bayesian nested neural network, which is based on a novel variational ordering unit that explicitly models the weight importance via the Downhill random variable. From our model, the weight importance can be learned from data, rather than hand-tuned as with previous methods. Experiments show that this framework can improve both accuracy and calibrated predictive uncertainty. Future work will study the variational ordering unit in language modeling, sequential data, or generative models where the order is important, e.g., [rai2015large]. The Downhill random variable is a well-suited hidden variable for such applications.


7 Appendix - Derivation and Proofs

7.1 Derivation of Properties

Property 1.

If 222For Gumbel-softmax sampling, we first draw from , then calculate . The samples of can be obtained by first drawing then computing ., then , where is a -dimensional vector of ones, and . . is a standard uniform variable.

We show that using the sampling process in Property 1 recovers produce the Downhill random variable. We assume follows a Gumbel softmax distribution [gumbel1948statistical, maddison2014sampling] which has the following form.


We apply the transformation to the variable .

To obtain the distribution of , we apply the change of variables formula on .


From the definition of , we can obtain . The Jacobian


Thus, .


Property 2.

When , sampling from the Downhill distribution reduces to discrete sampling, where the sample space is the set of ordered mask vectors . The approximation of the Downhill distribution to the Bernoulli chain can be calculated in closed-form.

As shown in [maddison2016concrete, jang2016categorical], when , the Gumbel softmax transformation corresponds to an argmax operation that generates an one-hot variable:


where the relative order is preserved.

Say a sample , with -th entry being one and the rest entries being 0. The defined transformation generates . Thus, . It is easy to see the transformation is surjective function that . is exactly the set of ordered mask defined in Sec. 2.2.

Thus, we can calculate the approximation of Downhill variable to the Bernoulli chain,


where (See Appx 1.2). The KL divergence in (31) minimized to 0 when .

7.2 Probability of ordered masks

Recall the formulation of Bernoulli chain:


It is observed, there is a chance only when , and if . Thus,


where is the index of first zero. And we define as , which means all nodes are remained.

7.3 -0 regularization

We consider the case when prior over each weight is a spike-and-slap distribution, i.e., and , using the notation in Sec. 3.3. The posterior is also in this form. The derivations of KL term in (11-15) remain unchanged as it make nothing but mean-field assumption on the weight prior. With and (15), the objective (7) can be re-organized as


as . We assume as in [louizos2018learning]. It means transforming to requires nats. Thus, . The last term is then simplified to




where the line 2-3 is because KL is positive. Let . Then, maximizing the evidence lower bound presents the same objective in (16). This objective assigns greater penalization to the larger sub-networks with more redundant nodes. To compare with (23) [louizos2018learning] that uses have a constant coefficient over the probabilities, our reduced formulation provides an ordered -0 regularization instead of a uniform -0 regularization.

Note that (36) ignores the weight uncertainty compared with (7). (37) further ignores the uncertainty over the ordered mask, reduced to a deterministic formulation for a nested neural network with learned weight importance. The network used in this paper is (7) with weight uncertainty considered, where the detailed discussion for weight distributions.

8 Appendix - Implementation

8.1 Extension to Convolutional Layer

We consider a convolutional layer takes in a single tensor

as input, where is the index of the batch, , and are the dimensions of feature map. The layer has filters aggregated as and outputs a matrix . In the paper, we consider the ordered masks applied over the output channels and each filter corresponds to a dimension in . As shown in [kingma2015variational, molchanov2017variational], the local reparameterization trick can be applied, due to the linearity of the convolutional layer.


where is the -th dimension of the sampled ordered mask .

To calculate the KL term (11), the only modification is to let the first summation be over the height, width and input channels in (13).