Predictive Uncertainty through Quantization

10/12/2018 ∙ by Bastiaan S. Veeling, et al. ∙ 0

High-risk domains require reliable confidence estimates from predictive models. Deep latent variable models provide these, but suffer from the rigid variational distributions used for tractable inference, which err on the side of overconfidence. We propose Stochastic Quantized Activation Distributions (SQUAD), which imposes a flexible yet tractable distribution over discretized latent variables. The proposed method is scalable, self-normalizing and sample efficient. We demonstrate that the model fully utilizes the flexible distribution, learns interesting non-linearities, and provides predictive uncertainty of competitive quality.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: DLVMs have layers of stochastic latent variables.

In high-risk domains, prediction errors come at high costs. Luckily such domains often provide a failsafe: self-driving cars perform an emergency stop, doctors run another diagnostic test, and industrial processes are temporarily halted. For deep learning models, this can be achieved by rejecting datapoints with a confidence score below a predetermined threshold. This way, a low error rate can be guaranteed at the cost of rejecting some predictions. However, estimating high quality confidence scores from neural networks, which create well-ordered rankings of correct and incorrect predictions, remains an active area of research.

Deep Latent Variable Models (DLVMs, fig. 1) approach this by postulating latent variables for which the uncertainty in influences the confidence in the target prediction. Recently, efficient inference algorithms have been proposed in the form of variational inference, where an inference neural network is optimized to predict parameters of a variational distribution that approximates an otherwise intractable distribution (Kingma & Welling (2013); Rezende et al. (2014); Alemi et al. (2016); Achille & Soatto (2016)).

Figure 2: A distribution is optimized to approximate .

Variational inference relies on a tractable class of distributions that can be optimized to closely resemble the true distribution (fig. 2), and it’s hypothesized that more flexible classes lead to more faithful approximations and thus better performance (Jordan et al. (1999)). To explore this hypothesis, we propose a novel tractable class of highly flexible variational distributions. Considering that neural networks with low-precision activations exhibit good performance (Holi & Hwang (1993); Hubara et al. (2016)), we make the modeling assumption that latent variables can be expressed under a strong quantization scheme, without loss of predictive fidelity. If this assumption holds, it becomes tractable to model a scalar latent variable with a flexible multinomial distribution over the quantization bins (fig. 3).

Figure 3: SQUAD quantizes the domain of to model a flexible and tractable variational distribution.

By re-positioning the variational distribution from a potentially limited description of moments, as found in commonly applied conjugate distributions, to a direct expression of probabilities per value, a variety of benefits arise. As the output domain is constrained, the method becomes self-normalizing, relieving the model from hard-to-parallelize batch normalization techniques (

Ioffe & Szegedy (2015)

). More interesting priors can be explored and the model is able to learn unique activation functions per neuron.

More concretely, the contributions of this work are as follows:

  • We propose a novel variational inference method by leveraging multinomial distributions on quantized latent variables.

  • We show that the emerging predicted distributions are multimodal, motivating the need for flexible distributions in variational inference.

  • We demonstrate that the proposed method applied to the information bottleneck objective computes competitive uncertainty over the predictions and that this manifests in better performance under strong risk guarantees.

2 Background

In this work, we explore deep neural networks for regression and classification. We have data-points consisting of intputs and targets in a dataset and postulate latent variables that represent the data. We focus on the Information Bottleneck (IB) perspective: first proposed by Tishby et al. (2000), the information bottleneck objective is optimized to maximize the mutual information between and , whilst minimizing the mutual information between and . The objective can be efficiently optimized using a variational inference scheme as shown concurrently by both Alemi et al. (2016) and Achille & Soatto (2016). Under the Markov assumption , they derive the following lower bound:


where is commonly estimated using a single Monte Carlo sample and is a variational approximation to the marginal distribution of . In practice is fixed to a simple distribution such as a spherical Gaussian. Alemi et al. (2016) and Achille & Soatto (2016) continue to show that the Variational Auto Encoder (VAE) Evidence Lower Bound (ELBO) proposed in Kingma & Ba (2014); Rezende et al. (2014) is a special case of the IB bound when and :


where represents the identity of data-point . Interestingly, the VAE perspective considers the bound to optimize a variational distribution , whilst the IB perspective prescribes that in the ELBO is not a variational posterior but the true encoder , and instead and are the distributions approximated with variational counterparts.

From yet another perspective, equation 1 can be interpreted as a domain-translating beta-VAE (Higgins et al. (2016)), where an input image is encoded into a latent space and decoded into the target domain. The Lagrange multiplier then controls the trade-off between rate and distortion, as argued by Alemi et al. (2017).

In this work, we follow the IB interpretation of the bound in equation 1 and leave the evaluation of our proposed variational inference scheme in other models such as the VAE for further work.

3 Method

Figure 4: The left diagram visualizes the computational graph of SQUAD at training time, providing a detailed view on how an individual latent variable is sampled. The right diagram visualizes how the proposed matrix-factorization variant improves the parameter efficiency of the model.

At the heart of our proposal lies the assumption that neuron networks can be effective under strong activation quantization schemes. We start with presenting the derivation if the model in the context of a single latent-layer information bottleneck, following the single data-point loss in equation 1 and dropping the subscript for clarity, with figure 4 for visual reference:


To impose a flexible, multi-modal distribution over , we first make a mean-field assumption . We then quantize the domain of each of the scalar latent variables such that only a small set of potential values remain: with e.g. , see fig. 3.

To optimize the parameters

with Stochastic Gradient Descent (SGD), we need to derive a fully differentiable sampling scheme that allows us to sample values of

. To formulate this, we re-parametrize the expectation over in equation 3 using a set of variables

which index the value vector

, allowing us to use a softmax function to represent the distribution over :


These indexing values are then used in conjunction with values as in input for , which is modelled with a small network : (abusing notation to indicate element-wise indexing with ):

To enable sampling from the discrete variables , we use the Gumbel-Max trick (Gumbel (1954)), denoted , re-parameterizing the expectation with uniform noise :


As the argmax is not differentiable, we approximate this expectation using the Gumbel-Softmax trick (Maddison et al. (2016); Jang et al. (2016)), which generates samples that smoothly deform into one-hot samples as the softmax temperature approaches . Using the inner product (denoted ) of the approximate one-hot samples and , we create samples from :


In practice, we anneal from to during the training process, as proposed by Yang et al. (2017)

to reduce gradient variance initially, at the risk of introducing bias.

To conclude our derivation, we use a fixed SQUAD distribution to model the variational marginal as shown in figure 5. We can then derive the KL term analytically following the definition for discrete distributions. Using the fact that the KL divergence is additive for independent variables, we get our final loss:


For the remainder of this work, we will refer to the latent variables as in lieu of , for clarity.

At test time, we can approximate the predictive function for a new data-point by taking samples from the latent variables i.e. , and averaging the predictions for :


We improve the flexibility of the proposed model by creating a hierarchical set of latent variables. The joint distribution of L layers of latents is then:


With . This is straightforwardly implemented with a simple ancestral sampling scheme.

Interestingly, the strong quantization proposed in our method can itself be considered an additional information bottleneck, as it exactly upper-bounds the number of bits per latent variable. Such bottlenecks are theorized to have a beneficial effect on generalization (Tishby et al. (2000); Achille & Soatto (2016); Alemi et al. (2017, 2016)), and we can directly control this bottleneck by varying the number of quantization bins.

The computational complexity of the method, as well as the number of model parameters , scale linearly in , i.e. (with the number of quantization bins). It is thus suitable for large-scale inference. We would like to stress that the proposed method differs from work that leverages the Gumbel-Softmax trick to model categorical latent variables: our proposal models continuous scalar latent variables by quantizing their domain and modeling belief with a multinomial distribution. Categorical latent variable models would incur a much larger polynomial complexity penalty of .

Matrix-factorization variant

To improving the tractability of using a large number of quantization bins, we propose a variant of SQUAD that uses a matrix factorization scheme to improve the parameter efficiency. Formally, equation 4 becomes:

with full layer weights and respectively of shape and , where denotes the number of neurons, the number of factorization inputs, number of quantization bins and the input dimensionality. To improve the parameter efficiency, we can learn per layer as well, resulting in shape , which is found to be beneficial for large C by the hyper-parameter search presented in section 5. We depict this alternative model on the right side of figure 4 and will refer to it as SQUAD-factorized. We leave further extensions such as Network-in-Network (Lin et al. (2013)) for future work.

Figure 5: In the IB bound, the marginal is approximated with a fixed distribution . Using our proposed SQUAD distribution we can impose a variety of interesting forms for via the spacing and weighting of the quantization bins. For the values , we compare linearly spaced bins (left)

versus bins with equal probability mass under a normal distribution


. Furthermore, we explore the effect of allowing the bin values to be optimized with SGD on a per-neuron or per-layer basis, to allow the model to optimize the quantization scheme with the highest fidelity. For the prior probabilities, we explore a uniform prior

(left) and probability mass of the bins under a normal distribution (middle).

4 Related Work

Outside the realm of DLVMs, other methods have been explored for predictive uncertainty. Lakshminarayanan et al. (2017) propose deep ensembles: straightforward averaging of predictions from a small set of separately adversarially trained DNNs. Although highly scalable, this method requires retraining a model up to 10 times, which can be inhibitively expensive for large datasets.

Gal & Ghahramani (2015b) propose the use of dropout (Srivastava et al. (2014)) at test time and present a Bayesian neural network interpretation of this method. A follow-up work by Gal et al. (2017) explores the use of Gumbel-Softmax to smoothly deform the dropout noise to allow optimization of the dropout rate during training. A downside of MC-dropout is the limited flexibility of the fixed bi-modal delta-peak distribution imposed on the weights, which requires a large number of samples for good estimates of uncertainty. van den Oord et al. (2017) propose the use of vector quantization in variational inference, quantizing a multi-dimensional embedding, rather than individual latent variables, and explore this in the context of auto-encoders.

In the space of learning non-linearity’s, Su et al. (2017)

explore a flexible non-linearity that can assume the form of most canonical activations. More flexible distributions have been explored for distributional reinforcement learning by

Dabney et al. (2017)

using quantile regression, of which can be seen as a special case of SQUAD where the bin values are learned but have fixed uniform probability. Categorical distributions on scalar variables have been used to model more flexible Bayesian neural network posteriors as by

Shayer et al. (2017). The use of a mixture of diracs distribution to approximate a variety of distributions was proposed by Schrempf et al. (2006).

5 Results

Figure 6:

Risk/coverage curve (with log-axes for discernibility) of 2-layer models on notMNIST. Lines closer to the lower-right are better. As the selective classifier lowers the confidence threshold, coverage increase at the cost of greater classification risk. Area around curves represents 90% confidence bounds computed using 10 initializations/splits.

Quantifying the quality of uncertainty estimates of models remains an open problem. Various methods have been explored in previous works, such as relative entropy Louizos & Welling (2017); Gal & Ghahramani (2015a), probability calibration, and proper scoring rules Lakshminarayanan et al. (2017). Although interesting in their own right, these metrics do not directly measure a good ranking of predictions, nor indicate applicability in high-risk domains. Proper scoring rules are the exception, but a model with good ranking ability does not necessarily exhibit good performance on proper scoring rules: any score that provides relative ordering suffices and does not have to reflect true calibrated probabilities. In fact, well-ranked confidence scores can be re-calibrated (Niculescu-Mizil & Caruana (2005)) after training to improve performance on proper scoring rules and calibration metrics.

In order to evaluate the applicability of the model in high-risk fields such as medicine, we want to quantify how models perform under a desired risk requirement. We propose to use the selection with guaranteed risk (SGR) method111We deviate slightly from Geifman & El-Yaniv (2017) in that we use Softmax Response (SR) – the probability taken from the softmax output for the most likely class – as the confidence score for all methods. Geifman & El-Yaniv (2017) instead proposed to use the variance of the probabilities for MCdropout, but our experiments showed that SR paints MCdropout in a more favorable light. introduced by Geifman & El-Yaniv (2017) to measure this. In summary, the SGR method defines a selective classifier using a trained neural network and can guarantee a user-selected desired risk with high probability (e.g. 99%), by selectively rejecting data points with a predicted confidence score below an optimal threshold.

To limit the influence of hyper-parameters on the comparison, we use the automated optimization method TPE (Bergstra et al. (2011)) for both baselines and our models. The hyper-parameters are optimized for coverage at 2% risk () on fashionMNIST, and sub-sequentially evaluated on notMNIST. Larger models are evaluated on SVHN.

We compare222All models are optimized with ADAM (Kingma & Ba (2014)), weight initialization as proposed by He et al. (2015), a weight decay of

and adaptive learning rate decay scheme — 10x reduction after 10 epochs of no validation accuracy improvement— and use early stopping after 20 epochs of no improvement.

our model against plain MLPs, MCDropout using Maxout activations333

We found this baseline to perform stronger in comparison to conventional ReLU MCdropout models, under equal number of latent variables.

(Goodfellow et al. (2013); Chang & Chen (2015)

) and an information bottleneck model using mean-field Gaussian distributions. We evaluate the complementary deep ensembles technique (

Lakshminarayanan et al. (2017)) for all methods.

5.1 Main results

We start our analysis by comparing the predictive uncertainty of 2-layer models with 32 latent variables per layer. In figure 6 we visualize the risk/coverage trade-off achieved using the predicted uncertainty as a selective threshold, and present coverage results in table 1

. Overall, we find that SQUAD performs significantly better than plain MLPs and deep Gaussian IB models, and we tentatively attribute this to the increased flexibility of the multinomial distribution. Compared to a Maxout MCdropout model with a similar number of weights, SQUAD appears to have a slight —though not significant— advantage, despite the strong quantization scheme, especially at low risk. Deep ensembles improve results for all methods, which fits the hypothesis that ensembles integrate over a form of weight uncertainty. When evaluated on a new dataset without retuning hyperparameters, SQUAD shows strong performance, as shown in table 


Fashion MNIST cov@risk .5% cov@risk 1% cov@risk 2% NLL Acc.
Plain MLP 29.1 () 45.9 () 60.4 () 0.408 () 87.7 ()
Maxout MCDropout 41.9 () 56.5 () 69.9 () 0.299 () 89.5 ()
DLGM 0.0 () 33.5 () 47.0 () 0.446 () 84.3 ()
SQUAD 42.9 () 58.3 () 69.5 () 0.293 () 89.5 ()
Deep Ensemble cov@risk .5% cov@risk 1% cov@risk 2% NLL Acc.
Plain MLP Ensemble 40.6 58.3 70.2 0.296 89.3
Max. MCD. Ensemble 48.2 59.1 72.2 0.271 90.2
DLGM Ensemble 0.0 34.3 47.8 0.435 84.7
SQUAD Ensemble 47.5 61.6 73.1 0.273 90.1
Table 1: SGR coverage results on Fashion MNIST. We present coverage percentage of the test dataset for three pre-determined risk-guarantees (one per column), where higher coverage is better, as well as negative log-likelihood and overall accuracy. The results indicate that SQUAD provides competitive uncertainty, especially at low-risk guarantees. Bayesian approximations via deep ensembles improve coverage all over the board, and for SQUAD in particular. When SGR can’t guarantee the required risk level at high propability, 0% coverage is reported. (2 std. deviations shown in parentheses, optimal results in bold.)
notMNIST cov@risk .5% cov@risk 1% cov@risk 2% NLL Acc.
Plain MLP 77.4 () 85.5 () 90.3 () 0.228 () 93.3 ()
Maxout MCDropout 85.7 () 90.6 () 94.2 () 0.165 () 95.3 ()
SQUAD 87.1 () 91.1 () 94.5 () 0.161 () 95.4 ()
Plain MLP Ensemble 85.9 90.6 93.5 0.175 94.9
Max. MCD. Ensemble 88.5 92.8 95.7 0.148 96.0
SQUAD Ensemble 90.7 93.5 96.1 0.137 96.2
Table 3: Results on SVHN indicate that the quantization scheme imposed by SQUAD models might hinder performance, but that this is effectively compensated by the SQUAD-factorized variant using a larger amount of bins. Even with MC samples at test time, SQUAD performs well.
MLP K=256 (SVHN) cov@risk .5% cov@risk 1% cov@risk 2% NLL Acc.
Plain MLP 0.0 () 0.0 () 36.3 () 0.758 () 83.1 ()
Maxout MCDropout 0.0 () 50.7 () 65.0 () 0.480 () 86.4 ()
SQUAD-factorized 18.4 () 53.9 () 66.7 () 0.454 () 86.7 ()
SQUAD 1.7 () 42.8 () 59.3 () 0.534 () 84.6 ()
Max. MCDropout T=4 0.0 () 38.5 () 57.6 () 0.562 () 84.9 ()
SQUAD-factorized T=4 10.7 () 49.9 () 64.5 () 0.480 () 86.2 ()
SQUAD T=4 0.0 () 38.0 () 55.6 () 0.569 () 83.7 ()
Table 2: SQUAD exhibits strong performance on a held-out dataset without hyperparameter tuning.

5.2 Natural Images

To explore larger models trained on natural image datasets, we lightly tune hyper-parameters on 256-latent 2-layer models over 100 TPE evaluations. As SVHN contains natural images in color, we anticipate a need for a higher amount of information per variable. We thus explore the effect of the matrix-factorized variant.

As shown in table 3, SQUAD-factorized outperforms the non-factorized variant. Considering the computational cost at the optimum of a 4-neuron factorization () with =37 quantization bins, the model clocks 3.4 million weights. In comparison, the optimum for the presented MCdropout results has =11, using 9.0 million weights. On an NVIDIA Titan XP, the dropout baseline takes 13s per epoch on average, while SQUAD-factorized spans just 9s.

To evaluate the sample efficiency of the methods, we compare results at samples. We find that SQUAD’s results suffer less from under-sampling than MCdropout. We tentatively attribute the sample efficiency to the flexible approximating posterior on the activations, which is in stark contrast to the rigid approximating distribution that MCdropout imposes on the weights. In conclusion, SQUAD comes out favorably in a resource-constrained environment.

5.3 Analysis of latent variable distributions

In order to evaluate if the proposed variational distribution does not simply collapse into single mode predictions, we want to find out what type of distributions the model predicts over the latent variables. We visualize the forms of predicted distributions in figure 7a. Although this showcases only a small subset of potential multi-modal behavior that emerges, this demonstrates that the model indeed utilizes the distribution to its full potential. To provide an intuition on how these predicted distributions emerge, we present figure 9 in the appendix.

Figure 7: By analyzing the emerging predicted distributions of individual neurons in a converged SQUAD model, we find that the flexible variational distribution is used to its full advantage. Figure (A) visualizes a subset of interesting stereotypical distributions we hope to find in the model. Figure (B) summarizes distributions predicted by the model similar to stereotypes, discovered by looking at predicted distributions with low KL. Figure (C) shows how often distributions similar to stereotypes arise, as measured by the KL distance (lower KL is closer to stereotypes).

Figure 8: By using a 1-dimensional matrix factorization for a SQUAD-factorized distribution, we can visualize the type of (stochastic) activation functions learned by the method. After training the model as usual on fashion-MNIST, we take a random neuron from the first layer. We visualize how the predicted distribution of the output changes as a function of the 1-dimensional input. The left y-axis indicates the prob(ability per value as shown using the green line. The right y-axis indicates the value and in blue the most likely value is shown, and the gray dots represent samples from the neuron. The red line depicts the expected output of the neuron. The shape of the expected output is akin to a peaky sigmoid activation, and similar shapes are found in the other neurons of the network as well. This provides food for thought on the design of activation functions for conventional neural networks.

In figure 8 we visualize one of the activation functions that the method learns for a 1-dimensional input SQUAD-factorized model. The learned activation functions resemble “peaked” sigmoid activations, which can be interpreted as a combination of an RBF kernel and sigmoid. This provides food for thought on how non-linearity’s for conventional neural networks can be designed, and the effect of using such a non-linearity can be studied in further work.

6 Discussion

In this work, we have proposed a new flexible class of variational distributions. To measure the effectiveness for real world classification, we applied the class to a deep variational information bottleneck model. By placing a quantization-based distribution on the activations, we can compute uncertainty estimates over the outputs. We proposed an evaluation scheme motivated by the need in real-world domains to guarantee a minimal risk. The results presented indicate that SQUAD provides an improvement over plain neural networks and Gaussian information bottleneck models. In comparison to a MCDropout model, which approximates a Bayesian neural network, we get competitive performance. Moreover, qualitatively we find that the flexible distribution is used to its full advantage is sample efficient. The method learns interesting non-linearity’s, is tractable and scaleable, and as the output domain is constrained, no batch normalization techniques are required.

Various directions for future work arise. The improvement of ensemble methods over individual models indicates that there remains room for improvement for capturing the full uncertainty of the output, and thus a fully Bayesian approach to SQUAD which would include weight uncertainty, shows promise. The flexible class allows us to define a wide variety of interesting priors, which provides opportunity to study interesting priors that are hard to define as a continuous density. Likewise, more effective initialization of parameters for the proposed method requires further attention. Orthogonally, the proposed class can be applied to other variational objectives as well, such as the variational auto-encoder. Finally, the discretized nature of the variables allows for the analytical computation of other divergences such as mutual information and the Jensen-Shannon divergence, the effectiveness of which remains to be studied.


We thank Bart Bakker, Maximilian Ilse, Dimitrios Mavroeidis, Jakub Tomczak, Daniel Worrall and anonymous reviewers for their insightful comments and discussions. This research was supported by Philips Research, the SURFSara Lisa cluster and the NVIDIA GPU Grant. We thank contributors to TensorFlow (

Abadi et al. (2016)

), Keras (

Chollet et al. (2015)) and Sacred (Gref et al. (2016)).


  • Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.

    TensorFlow: Large-Scale machine learning on heterogeneous distributed systems.

    March 2016.
  • Achille & Soatto (2016) Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. November 2016.
  • Alemi et al. (2016) Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. December 2016.
  • Alemi et al. (2017) Alexander A Alemi, Ben Poole, Ian Fischer, Joshua V Dillon, Rif A Saurous, and Kevin Murphy. An Information-Theoretic analysis of deep Latent-Variable models. November 2017.
  • Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for Hyper-Parameter optimization. In J Shawe-Taylor, R S Zemel, P L Bartlett, F Pereira, and K Q Weinberger (eds.), Advances in Neural Information Processing Systems 24, pp. 2546–2554. Curran Associates, Inc., 2011.
  • Chang & Chen (2015) Jia-Ren Chang and Yong-Sheng Chen. Batch-normalized maxout network in network. November 2015.
  • Chollet et al. (2015) François Chollet et al. Keras., 2015.
  • Dabney et al. (2017) Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. October 2017.
  • Gal & Ghahramani (2015a) Yarin Gal and Zoubin Ghahramani.

    Bayesian convolutional neural networks with bernoulli approximate variational inference.

    June 2015a.
  • Gal & Ghahramani (2015b) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. June 2015b.
  • Gal et al. (2017) Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. May 2017.
  • Geifman & El-Yaniv (2017) Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4885–4894. Curran Associates, Inc., 2017.
  • Goodfellow et al. (2013) Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. February 2013.
  • Gref et al. (2016) Klaus Gref et al. Sacred., 2016.
  • Gumbel (1954) E J Gumbel. Statistical theory of extreme values and some practical, applications, national bureau of standards, washington (1954). MR0061342 (15: 811b), 1954.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.


    Proceedings of the IEEE international conference on computer vision

    , pp. 1026–1034., 2015.
  • Higgins et al. (2016) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational framework. November 2016.
  • Holi & Hwang (1993) J L Holi and J N Hwang. Finite precision error analysis of neural network hardware implementations. IEEE Trans. Comput., 42(3):281–290, March 1993.
  • Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. September 2016.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456., June 2015.
  • Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-Softmax. November 2016.
  • Jordan et al. (1999) Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Mach. Learn., 37(2):183–233, November 1999.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. December 2014.
  • Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-Encoding variational bayes. December 2013.
  • Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 6405–6416. Curran Associates, Inc., 2017.
  • Lin et al. (2013) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. December 2013.
  • Louizos & Welling (2017) Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. March 2017.
  • Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh.

    The concrete distribution: A continuous relaxation of discrete random variables.

    November 2016.
  • Niculescu-Mizil & Caruana (2005) Alexandru Niculescu-Mizil and Rich Caruana.

    Predicting good probabilities with supervised learning.

    In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pp. 625–632, New York, NY, USA, 2005. ACM.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    January 2014.
  • Schrempf et al. (2006) O C Schrempf, D Brunn, and U D Hanebeck. Dirac mixture density approximation based on minimization of the weighted cramer-von mises distance. In 2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pp. 512–517, 2006.
  • Shayer et al. (2017) Oran Shayer, Dan Levi, and Ethan Fetaya. Learning discrete weights using the local reparameterization trick. October 2017.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15:1929–1958, 2014.
  • Su et al. (2017) Qinliang Su, Xuejun Liao, and Lawrence Carin. A probabilistic framework for nonlinearities in stochastic neural networks. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4486–4495. Curran Associates, Inc., 2017.
  • Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. April 2000.
  • van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. November 2017.
  • Yang et al. (2017) Dong Yang, Daguang Xu, S Kevin Zhou, Bogdan Georgescu, Mingqing Chen, Sasa Grbic, Dimitris Metaxas, and Dorin Comaniciu. Automatic liver segmentation using an adversarial Image-to-Image network. July 2017.

7 Appendix

Figure 9: This figure serves to provide intuition on how a variety of distributions come about in our model. We show the set of weights used to predict the probability for the bins of a randomly selected latent variable from the first layer in a converged 2-layer SQUAD model (reshaped to a 28x28 squares for comparison with the data). We then present 5 data-points for which the neuron predicts a stereotypical distribution, as visualized in the last bar-plot.

7.1 Effect of hyper-parameters on coverage:

The optimal configuration of hyper-parameters and bin priors have been determined using 700 evaluations selected using TPE. The space of parameters explored is as follows, presented in the hyperopt API for transparency:

# Shared
C: quniform(2, 10, 1) * 2 + 1,
dropout rate: uniform(0.01, .95),
lr: loguniform(log(0.0001), log(0.01)),
batch_size: qloguniform(log(32), log(512), 1)
# SQUAD & Gaussian
kl_multiplier: loguniform(log(1e-6), log(0.01)),
init_scale: loguniform(log(1e-3), log(20)),
use_bin_probs: choice([’uni’, ’gaus’]),
use_bins: choice([’equal_prob_gaus’,
learn_bin_values: choice([
   ’per_neuron’, ’per_layer’, ’fixed’]),

In figure 10 we visualize the pairwise effect of these hyper-parameters on the coverage. The optimal configuration found in for the main SQUAD model are: batch size: 244, KL multiplier: 0.0027, learn bin values: per layer, : uniform, : linearly spread over (-3.5,3.5), lr: 0.0008, : 15, initialization scale: 3.214.

Figure 10: This figure visualizes the pairwise relationship between hyper-parameters of SQUAD and the effect on coverage. The top-60 configurations are highlighted. Green values are good, red values are bad. We have filtered on the optimal settings for bin values and prior to reduce clutter.