1 Introduction
Stochastic neural networks (SNN) are a central tool in many subfields of machine learning, including (1) Bayesian deep learning
(MacKay, 1992; Blundell et al., 2015; HernándezLobato and Adams, 2015; Gal and Ghahramani, 2016), (2) exploration in reinforcement learning
(Ian et al., 2013; Osband et al., 2016; Riquelme et al., 2018), and (3) statistical learning theory such as PACBayesian learning
(McAllester, 1999; Langford and Seeger, 2001; Dziugaite and Roy, 2017). Perturbations of the network parameters induce a distribution over the model, and this intrinsic uncertainty is the subject of great interest to machine learning practitioners and theoreticians alike. For example, deep Bayesian models are often used to adequately measure uncertainty, and determine whether the model itself is inherently familiar with the unseen data. This is especially important in the context of autonomous vehicles, where decisions must be made to meet specific safety standards (McAllister et al., 2017). Conversely, the lack of confidence can be leveraged to efficiently guide exploration in reinforcement learning, via randomizing the approximate value function (Azizzadenesheli et al., 2018; Touati et al., 2018) or maximizing intrinsic rewards (Houthooft et al., 2016).Furthermore, a considerable proportion of statistical learning theory is devoted to understanding what implies generalization, or what constitutes an appropriate measure of complexity (Bartlett et al., 2017; Arora et al., 2018; Neyshabur et al., 2017). PACBayesian learning theory (McAllester, 1999) specifically explores the generalization property of a randomized prediction rule, and has been recently studied in the context of stochastic neural networks (Dziugaite and Roy, 2017)
. In this particular study, the working hypothesis is that good generalization can be guaranteed on the premise that stochastic gradient descent
(Robbins and Monro, 1951) finds a solution that obtains certain structural property (such as flatness).From a more practical perspective, considerable effort has been devoted to modelling uncertainty through the injection of independent noise to the network parameters (Graves, 2011; Blundell et al., 2015; Kingma et al., 2015). Noise independence largely restricts the expressivity of the noise distribution and thus the resulting uncertainty measures are illcalibrated (Minka et al., 2005; Turner and Sahani, 2011). Attempts have been made to correlate parameters of a neural network, including Louizos and Welling (2017); Krueger et al. (2017); Pawlowski et al. (2017), for example, by adapting expressive nonlinear invertible transformations developed in the variational inference literature (Rezende and Mohamed, 2015; Kingma et al., 2016; Huang et al., 2018), or via implicit methods (Goodfellow et al., 2014). However, these methods are limited due to their inability to scale well. Louizos and Welling (2017), for instance, resort to a specific multiplicative noise sampled from a lower dimensional space and have to use an auxiliary method to bound the entropy. Krueger et al. (2017), on the other hand, give up on injecting noise on the entire set of parameters and model the distribution of the scale and shift parameter of the preactivations.
In attempts to address some of the challenges articulated above and efficiently model the joint distribution of a network’s parameters, we propose the Kronecker Flow, an invertible transformationbased method inspired by the Kronecker product. Specifically, we notice that the Kronecker product can be thought of as lefttransforming a matrix via a linear map, and then righttransforming it using another linear map. Our contributions are as follows.

We extend this idea to more general invertible mappings to induce nonlinear dependencies, and apply this trick to parameterizing deep stochastic neural networks.

We apply our method to predictive tasks and show that our methods work better on larger architectures compared to existing methods.

We are the first to apply flowbased methods to tighten the PACbased bound.

Our methods prove to be competitive over other methods in approximate Thompson sampling in contextual bandit problems.
2 Background
Stochastic neural networks with parameter perturbation normally follow the stochastic process: , , where is the parameter of the neural network
, which outputs the prediction probability vector for classification or the predicted values for regression. We let
be the training set of size ^{1}^{1}1We use the notation to compactly describe the set of integers ., be the differential entropy , be the coefficient controlling the amount of noise injected into the model and the degree of regularization,be the loss function and
be the empirical risk.2.1 Variational inference
Bayesian inference updates our prior belief over the model parameters according to the Bayes rule as more information in the form of the likelihood of the training set is available. Inference can be cast as an optimization problem, where one seeks to maximize the variational lower bound (also known as the evidence lower bound, or the ELBO) on the log marginal likelihood:
(1) 
where is the variational approximate posterior and can be decomposed into due to conditional independence assumption. The optimal is the true posterior, i.e. . In our case, we use to parameterize a neural network. Prediction can be carried out via the predictive posterior
for drawn i.i.d. from , where we use the variational distribution to approximate and a Monte Carlo estimate to approximate the integral.
The prior distribution can be used to encode some form of inductive bias, such as one that is in favor of parameter values closer to some chosen a priori. We choose the prior to be an isotropic Gaussian, centered at the random initialization , i.e., . The entropy term ensures the variational posterior does not collapse to a point estimate. Both of them can be thought of as some form of regularizer, so we attach a coefficient
in front of them as a hyperparameter
^{2}^{2}2Like the parameter in Zhang et al. (2017).2.2 PACBayes generalization bound
Another use case of stochastic neural networks is to understand generalization, via minimizing the PACBayes bound, which usually takes the following form: Assume the data is distributed according to some data distribution . Then for any and a fixed prior over models , with probability at least over the choice of , for all , , where is some sort of “distance” function that is usually convex (Germain et al., 2009), is the empirical risk (with a bounded loss such as the zeroone loss), is the test loss, and is a measure of complexity that scales proportionally with the Kullback–Leibler (KL) divergence.
For instance, Dziugaite and Roy (2017) minimize the following bound originally due to McAllester (1999) and then tightened by Langford and Seeger (2001):
Theorem 1.
Let be the zeroone loss. For any and data distribution , and any distribution on the space of , with probability at least over the choice of a training set , for all distributions on the space of ,
(2) 
where the KL on the LHS is a function of the losses, similar to that of the Bernoulli KL.
We refer to the above bound as the McAllester bound. The KL divergence on the RHS of the bound, also known as the information gain, tells us to what extent the posterior is dependent on the training data. The sharper and more confident is, and the farther away it is from the prior
, the larger the KL will be, which in turn is reflected by the larger bound on the generalization gap. This is consistent with traditional notion of biasvariance tradeoff.
Alternatively, we consider the following bound due to Catoni (2007):
Theorem 2.
With the same setup as Theorem 1, and with a fixed , the following bound holds with probability over :
(3) 
We refer to this bound as the Catoni bound. We notice the linear relationship (which is also noticed by Germain et al. (2016)) between the empirical risk and the KL divergence. This allows us to make use of the linearity of expectation to perform change of variable (see the next section). We also note that the optimal in Equation 3 is always larger than , so the PACBayes bound is actually more conservative than Bayesian inference in this sense.
2.3 Normalizing flows
Minimization of Equation 1 and 3 requires (i) computing the gradient with respect to the parameter of the (PAC)Bayesian posterior , and (ii) computing the entropy of . One approach to do this is via change of variable under an invertible mapping. Let
be a random variable in
, and be a bijection parameterized by . Let and be its density. Then we can rewrite the loss function as^{3}^{3}3Since the weighting coefficient can be absorbed into the loss function , we neglect it for simplicity now.where we apply the change of variable (see Appendix A
for the detailed derivation). The logdeterminant (logdet) term ensures that we obtain a valid probability density function after
is applied, which can be a sequence of invertible mappings itself, hence referred to as the normalizing flow (Rezende and Mohamed, 2015). This way, the random variable and the parameters are decoupled, so that we can differentiate the integrand to have an unbiased estimate of the gradient (fixing some
). We let be the standard normal.3 Kronecker Flows
We consider maximizing the ELBO and minimizing the Catoni bound via normalizing flowbased SNNs. Conventionally, meanfield approximation using factorized distributions (such as multivariate Gaussian with diagonal covariance) has been well explored in the variational inference (VI) literature (Blundell et al., 2015). We are interested in better capturing the structure in the parameter space as restricted VI methods are known to exhibit overconfidence (Minka et al., 2005; Turner and Sahani, 2011). However, the parameters of a neural network are usually very high dimensional (on the order of millions), requiring a novel way to parameterize the joint distribution over the parameters.
In its general form, neural networks can be represented by a collection of tensors i.e.
. While our method below can easily be generalized to highdimensional tensors (such as for convolutional kernels), to simplify notation, we describe the matrix form.3.1 Linear Kronecker Flow
The matrixvariate normal (
) distribution generalizes the multivariate normal distribution to matrixvalued random variables, such as weight matrices of a neural network
(Louizos and Welling, 2016). Matrix normal is a multivariate normal distribution whose covariance matrix is a Kronecker product (), which allows us to model the correlation among the parameters to some degree.More concretely, assume is an random Gaussian matrix, and , and are realvalued matrices. Then has a matrix normal distribution, as
where vec is the vectorization of a matrix that concatenates all the columns. This allows us to represent the covariance matrix in a more compact manner ( parameters versus parameters for Kronecker product).
Limitation of the Kronecker product.
The Kronecker product covariance matrix is not a strict generalization of diagonal covariance matrix. To observe this, let , (this is the case of Louizos and Welling (2016)), and , where , , and . Then is also a diagonal matrix of size . Equating to solve for and will result in nonlinear equations with variables, which can be overdetermined for . For example, let , and for some . Then the nonlinear system below does not have a solution:
To see this, dividing by and dividing by yield and , respectively, which doesn’t have a solution if . This is because the Kronecker product is essentially parameter sharing, which can heavily restrict the matrix it can represent.
To remedy the above limitation, we can further decouple the reparameterization of the parameter matrix into two parts: (1) one that models the marginal variance and (2) one that models correlations. Assume is a positivevalued matrix, and let . Then
is a Gaussian distribution with the following property, which is useful in calculating the KL divergence:
Property 1.
Let be given as above, with and . Then

[label=(P0)]

, and


See Appendix B for the derivation and interpretation of the property. Naive implemetations of this can be inefficient and numerically unstable, as the entropy term involves computing the logdeterminant of and
, requiring the standard automatic differentiation libraries to resort to singular value decomposition when the matrix is nearsingular. Thus, we choose to parameterize
and as lower triangular matrices^{4}^{4}4This can be achieved via masking. with ones on the diagonal, leaving the uncertainty to be modeled by . This means .. Xaxis indicates the shape of the random matrix/tensor, sorted according to the dimensionality. The shaded area is the error bar with
standard deviation away from the mean performance, averaged across 25 trials.
Simulation.
To validate the limited expressiveness of kronecker product, we randomly initialize to be the density of a multivariate Gaussian with mean zero, and covariance being the square of a random standard Gaussian matrix. We choose such that it can be decomposed into a product of integers, and parameterize using independent Gaussian (dubbed Diag), the Kronecker product with diagonal and (KDiag), and the Kronecker product with elementwise scaling (KLinear). We minimize ; see Figure 1 for the results. We also conduct the same experiment with 3D tensors (instead of matrices). We see that KDiag consistently underperforms when compared to Diag, which indicates parameter sharing does restrict the family of distributions it can represent, and KLinear is consistently better.
3.2 Nonlinear Kronecker Flow
In this section, we generalize the Kronecker product to more general nonlinear mappings. In Appendix C, we make a connection to nondecreasing triangle maps (Villani, 2008)
that are general enough to model any probability distributions.
First, notice that leftmultiplying by amounts to introducing linear correlation among the rows of , applied to each of the columns. Likewise, rightmultiplying by amounts to correlating column entries of each row of . Inspired by this, we consider applying an invertible mapping to each row of the random weight matrix, and another invertible mapping to each column of the matrix. We call this the Kronecker Flow ^{5}^{5}5To differentiate this from KLinear from the previous section, we refer to using nonlinear as KNonlinear..
Specifically, let and be invertible mappings. We define the matrixmatrix function as , with the following batchoperations (for and ):
It is easy to verify that is invertible. Due to the partial dependency of and , the Jacobians of the vectorized forms (after proper permutation) are blockdiagonal, so we have
In practice, we use the volume preserving version of RealNVP (Dinh et al., 2016) and inverse autoregressive flow (IAF) (Kingma et al., 2016) to parameterize and for our experiments. The KLinear from the previous section can be thought of as using a linear map as and .
4 Concentration of empirical KL with normalizing flows
In their study, Dziugaite and Roy (2017) use independent Gaussian for to minimize the McAllester bound, so they can compute the KL between Gaussians analytically. This is no longer feasible when we use more flexible families for , such as normalizing flows. Moreover, a Monte Carlo estimate might result in underestimating the bound after inverting the KL between Bernoullis on the LHS of Equation 2 (which is a concave function; see Appendix A of Reeb et al. (2018) for an illustration). This necessitates a high probability bound on the concentration of the empirical estimate.
In Section 2.3, we have established can be written in the following form
where both and are standard Gaussian (the mean and variance can be absorbed into the invertible mapping if this is not the case).
The first term in the KL can be computed analytically. The second term usually can be almost surely bounded (e.g. using Block neural autoregressive flows) so that we can use Hoeffdingtype concentration or it can simply be made zero using e.g. volume preserving flows. The challenge now lies in the third term, which has a quadratic form , neglecting the normalizing constant.
Now assume is a Lipschitz ^{6}^{6}6
The following flows are all Lipschitz (with proper activation functions): volume preserving version of
Dinh et al. (2016); Kingma et al. (2016), Berg et al. (2018), Behrmann et al. (2018), De Cao et al. (2019), etc.. Let . Then is Lipschitz:This is key in deriving a tail bound on , as Lipschitz functions of canonical Gaussian random variables are subGaussians, meaning they have a tail that decays faster than a Gaussian random variable. The following theorem provides a concentration bound for the empirical average of similar to that of a Chisquare random variable, as (square of a subGaussian) is subexponential.
Theorem 3.
Let be defined as above with a Lipschitz constant . Let . Then the following concentration bound holds
where .
Note that in practice the empirical KL that we use is inversely scaled by the size of the training set (see Equation 2), so the Lipschitz constant can be made small in practice to dominate the dimensionality.
5 Experiments
We evaluate our proposed method in the context of two prediction tasks (Section 5.1), PACBayes bound minimization (Section 5.2) and contextual bandit (Section 5.3). For the two prediction tasks, we use the MNIST handwritten digit dataset (Lecun et al., 1998) and CIFAR10 (Krizhevsky, 2009). See Appendix E for a detailed description.
Dataset  L2  FFG  MNFG  Diag  KDiag  KLinear  KNonlinear 

MNIST  0.6  0.9  0.7  0.92  0.67  0.70  0.60 
CIFAR5  24  22  16  19.0    16.8  17.4 
Setup  SGD  KFAC  BBB  NoisyKFAC  Diag  KDiag  KLinear  KNonlinear 

R  18.21  17.61  17.18  14.48  17.71  16.71  14.65  14.74 
D  11.65  11.11  11.69  10.65  10.69  13.65  11.35  9.88 
5.1 Classification
In this section, we evaluate the generalization performance of our method applied to Bayesian neural networks. We consider two architectures: LeNet5 (Lecun et al., 1998) and a modified version VGG16 (Simonyan and Zisserman, 2014) proposed by Zhang et al. (2017).
We first compare to the multiplicative normalizing flow (MNFG) proposed by Louizos and Welling (2017), applying our method to LeNet5 (see Table 1). Our Diag matches the performance of their FFG (fully factorized Gaussian). KDiag outperforms Diag in this case, perhaps due to the smaller number of parameters which makes it easier to optimize. KNonlinear yields the best generalization error in this case. On the CIFAR5 experiment (we take the first 5 classes of CIFAR10), our methods are on par with MNFG.
Second, we compare with the noisy KFAC proposed by Zhang et al. (2017), applying our methods to the larger architecture VGG16 (see Table 2). Noisy KFAC applies an approximate natural gradient method. Despite this advantage, our methods (KLinear and KNonlinear) have simiar prediction accuracy in the regular setup. We also include the results of data augmentation with horizontal flip and random crop where KNonlinear outperforms all the other methods.
5.2 PAC Bayes bound minimization
Bound  Pinsker Bound  Catoni Bound  Catoni Bound  
Flow  Diag  KLinear  Diag  KLinear  KNonlinear  D  KD  KL  KN  
1  2  1  2  1  2  1  2  1  2  LeNet5  
6.62  6.00  6.09  5.90  8.04  7.66  8.10  8.33  5.96  5.90  2.12  2.95  2.00  2.01  
6.66  6.12  5.98  5.96  7.78  7.70  7.98  8.26  5.83  5.76  2.31  2.87  1.91  2.14  
bound  23.77  25.94  21.69  25.33  24.11  26.41  22.88  26.43  20.41  22.53  10.83  12.96  10.09  10.03 
KL  5968  7829  5292  7554  5001  6555  4334  5996  4725  5921  3177  3477  2913  2873 
For the PACBayes bound estimation, we minimize Equation 3. We follow the recipe of Dziugaite and Roy (2017). We upper bound the zeroone loss by crossentropy divided by (where is the number of classes) to make the upper bound tight. We set the prior to be , where is the initial value of the parameters, and apply a union bound to tune the prior variance . We also tune the coefficient as a parameter during training ^{7}^{7}7We are allowed to do so since we treat Equation 3 as an optimization objective, rather than report it as a bound. We report the McAllester bound, which holds for any , even if it depends on . , and report the McAllester bound for comparison (since it is the tightest). For more details, see Dziugaite and Roy (2017) for reference.
We test with a multilayer perceptron with 1 or 2 hidden layers with 600 neurons and LeNet5, evaluated on the MNIST dataset (see Table
3). For further clarification, we follow the steps of Dziugaite and Roy (2017) by minimizing the McAllester bound, using Pinsker’s inequality to bound the inverse of the Bernoulli KL (which we call the Pinsker bound). Since this bound has a square root in the complexity term, we can only use the Gaussian family with an analytic form of the KL. The result we have is slightly looser than Dziugaite and Roy (2017) since we have a 10class problem and they deal with a binary version of MNIST. We see that the bound can indeed be improved by capturing the correlation among the parameters. We then compare to minimizing the Catoni bound, which is slightly looser since the linear relationship between the empirical risk and the KL term penalizes the latter more when the KL is larger. However, by modelling the nonlinear dependencies, KNonlinear clearly outperforms the other methods (even compared to the ones minimizing the Pinsker bound). This indicates there exists a considerable amount of structure in the parameter space that may explain the gap between the test error and the generalization bound.We also notice that, despite the linear relationship, the Catoni bound focuses more on the complexity term than the ELBO. For example, the empirical risks of LeNet5 in Table 3 are much larger compared to the test loss of Table 1. The reasons are two: (1) the optimal in Equation 3 is larger than 1 (depending on the relative value of the KL), and (2) to properly upper bound the zeroone loss, we scale down the crossentropy loss by during optimization. These two factors make it a more conservative training algorithm than Bayesian inference. This explains the hyperparameter tuning that is usually done in practice; e.g. when we decrease the coefficient, we allow the model to focus more on minimizing the empirical risk, at the cost of a weaker generalization guarantee.
5.3 Contextual bandit
Bandit  SGD  fBNN  Diag  KDiag  KLinear  KNonlinear 

Mushroom  4.06 1.23  3.91 1.55  2.16 0.51  2.41 1.28  1.85 0.27  3.47 0.83 
Statlog  1.29 0.35  0.73 0.03  1.01 0.06  0.84 0.22  0.81 0.01  0.79 0.08 
Covertype  30.01 0.37  32.03 0.70  28.42 0.52  29.19 0.28  28.13 0.22  28.06 0.26 
Financial  6.08 0.83  7.27 1.90  7.43 0.99  5.88 0.44  5.88 0.61  5.78 0.49 
Jester  56.24 3.36  59.70 4.31  59.34 3.92  57.17 3.15  57.66 3.67  57.96 4.48 
Adult  79.31 0.83  84.45 1.43  76.32 0.16  77.28 0.03  75.94 0.21  77.30 0.46 
. Values reported are the mean over 3 independent trials with standard error of the mean, normalized with respect to the performance of the uniform policy.
Uncertainty modeling lies at the heart of the explorationexploitation dilemma in sequential decisionmaking. In order to maximize its collected cumulative rewards, an agent should trade off exploring different actions and gaining more knowledge about the reward estimate vs. exploiting the current estimate and allocating resources to the actions that are likely rewarding. Thompson sampling (TS) (Thompson, 1933) is one the popular approaches that deals with the latter tradeoff by maintaining posterior distribution over reward models and randomizing actions on the basis of their probability of being optimal.
In this section, we investigate the effectiveness of our proposed method for performing an approximate Thompson sampling in the particular setting of contextual bandits. In the latter setting, at each time , the agent sees a dimensional context , selects one of the available actions, , and earns a reward generated by the environment. The agent aims to minimize its cumulative regret defined as where is the highest expected reward given the context and the expectation is over the randomness of both environment and the agent’s choice of actions.
We compare different methods on a range of realworld bandit problems introduced by Riquelme et al. (2018). We train the models every 50 time steps for 200 iterations using a batchsize of 512. We ran each experiment with 3 different random seeds and we report the means and standard deviations of cumulative regret normalized with the respect to the uniform baseline in the table 4. We include the functional variational Bayesian neural networks (fBNN), recently introduced by Sun et al. (2019) as a baseline, and we use their open sourced implementation of fBNN in the bandit setting. From table 4, we see that across the 6 bandit problems, our proposed method (KLinear and KNonlinear) provides competitive and consistent results. They outperform other baselines in 4 problems out of 6.
6 Conclusion
In this work, we present the Kronecker Flow, a flowbased method to induce complex distribution inspired by the Kronecker product. Our methods scale to larger architectures such as VGG16 since it takes advantage of the shape of the parameters. We demonstrate our methods work better than vanilla Kronecker product with diagonal matrices on multiple setups, including classification and approximate Thompson sampling in contexual bandit, and outperform existing methods in the Bayesian neural network literature. We are also the first to apply flowbased methods to obtain a tighter numerical generalization bound.
References
 Arora et al. (2018) Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
 Azizzadenesheli et al. (2018) Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through bayesian deep qnetworks. CoRR, abs/1802.04412, 2018. URL http://arxiv.org/abs/1802.04412.
 Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
 Behrmann et al. (2018) Jens Behrmann, David Duvenaud, and JörnHenrik Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2018.
 Berg et al. (2018) Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.
 Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In Proceedings of The 32nd International Conference on Machine Learning, pages 1613–1622, 2015.
 Bogachev et al. (2005) V I Bogachev, A V Kolesnikov, and K V Medvedev. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309–335, apr 2005. doi: 10.1070/sm2005v196n03abeh000882. URL https://doi.org/10.1070%2Fsm2005v196n03abeh000882.
 Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
 Catoni (2007) Olivier Catoni. Pacbayesian supervised classification: The thermodynamics of statistical learning. Lecture NotesMonograph Series, 56:i–163, 2007. ISSN 07492170. URL http://www.jstor.org/stable/20461499.
 De Cao et al. (2019) Nicola De Cao, Ivan Titov, and Wilker Aziz. Block neural autoregressive flow. arXiv preprint arXiv:1904.04676, 2019.
 Dinh et al. (2016) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.

Dziugaite and Roy (2017)
Gintare Karolina Dziugaite and Daniel M. Roy.
Computing nonvacuous generalization bounds for deep (stochastic)
neural networks with many more parameters than training data.
In
Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence
, 2017.  Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.

Germain et al. (2009)
Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario
Marchand.
Pacbayesian learning of linear classifiers.
In Proceedings of the 26th Annual International Conference on Machine Learning, pages 353–360. ACM, 2009.  Germain et al. (2016) Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon LacosteJulien. Pacbayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pages 1884–1892, 2016.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, 2014.
 Graves (2011) Alex Graves. Practical variational inference for neural networks. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2348–2356. Curran Associates, Inc., 2011. URL http://papers.nips.cc/paper/4329practicalvariationalinferenceforneuralnetworks.pdf.

HernándezLobato and Adams (2015)
José Miguel HernándezLobato and Ryan Adams.
Probabilistic backpropagation for scalable learning of bayesian neural networks.
In International Conference on Machine Learning, pages 1861–1869, 2015.  Houthooft et al. (2016) Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems 29. 2016. URL http://papers.nips.cc/paper/6591vimevariationalinformationmaximizingexploration.pdf.
 Huang et al. (2018) ChinWei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In International Conference on Machine Learning, 2018.

Hyvärinen and Pajunen (1999)
Aapo Hyvärinen and Petteri Pajunen.
Nonlinear independent component analysis: Existence and uniqueness results.
Neural Networks, 12(3), 1999.  Ian et al. (2013) Osband Ian, Van Roy Benjamin, and Russo Daniel. (more) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th International Conference on Neural Information Processing Systems, USA, 2013. Curran Associates Inc.
 Jaini et al. (2019) Priyank Jaini, Kira A Selby, and Yaoliang Yu. Sumofsquares polynomial flow. In International Conference on Machine Learning, 2019.
 Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, 2016.
 Kingma et al. (2015) Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2575–2583. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5666variationaldropoutandthelocalreparameterizationtrick.pdf.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 Krueger et al. (2017) David Krueger, ChinWei Huang, Riashat Islam, Ryan Turner, Alexandre Lacoste, and Aaron Courville. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.
 Langford and Seeger (2001) John Langford and Matthias Seeger. Bounds for averaging classifiers. 2001.
 Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998. ISSN 00189219. doi: 10.1109/5.726791.
 Louizos and Welling (2016) Christos Louizos and Max Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pages 1708–1716, 2016.
 Louizos and Welling (2017) Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. In International Conference on Machine Learning, 2017.
 MacKay (1992) David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
 McAllester (1999) David A McAllester. Pacbayesian model averaging. In COLT, volume 99, pages 164–170. Citeseer, 1999.
 McAllister et al. (2017) Rowan McAllister, Yarin Gal, Alex Kendall, Mark Van Der Wilk, Amar Shah, Roberto Cipolla, and Adrian Weller. Concrete problems for autonomous vehicle safety: Advantages of bayesian deep learning. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pages 4745–4753. AAAI Press, 2017. ISBN 9780999241103. URL http://dl.acm.org/citation.cfm?id=3171837.3171951.
 Minka et al. (2005) Tom Minka et al. Divergence measures and message passing. Technical report, Technical report, Microsoft Research, 2005.
 Müller et al. (2018) Thomas Müller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Novák. Neural importance sampling. arXiv preprint arXiv:1808.03856, 2018.
 Neyshabur et al. (2017) Behnam Neyshabur, Srinadh Bhojanapalli, David Mcallester, and Nati Srebro. Exploring generalization in deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5947–5956. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7176exploringgeneralizationindeeplearning.pdf.
 Osband et al. (2016) Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 2377–2386. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045641.
 Pawlowski et al. (2017) Nick Pawlowski, Andrew Brock, Matthew C. H. Lee, Martin Rajchl, and Ben Glocker. Implicit Weight Uncertainty in Neural Networks. arXiv eprints, 2017.
 Reeb et al. (2018) David Reeb, Andreas Doerr, Sebastian Gerwinn, and Barbara Rakitsch. Learning gaussian processes by minimizing pacbayesian generalization bounds. In Advances in Neural Information Processing Systems, pages 3337–3347, 2018.
 Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.
 Riquelme et al. (2018) Carlos Riquelme, George Tucker, and Jasper Roland Snoek. Deep bayesian bandits showdown. 2018. URL https://openreview.net/pdf?id=SyYe6kCW.
 Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 09 1951. doi: 10.1214/aoms/1177729586. URL https://doi.org/10.1214/aoms/1177729586.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Sun et al. (2019) Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger Grosse. Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779, 2019.
 Thompson (1933) William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.
 Touati et al. (2018) Ahmed Touati, Harsh Satija, Joshua Romoff, Joelle Pineau, and Pascal Vincent. Randomized value functions via multiplicative normalizing flows. CoRR, abs/1806.02315, 2018. URL http://arxiv.org/abs/1806.02315.
 Turner and Sahani (2011) Richard E Turner and Maneesh Sahani. Two problems with variational expectation maximisation for timeseries models. Bayesian Time series models, 1(3.1):3–1, 2011.
 Van den Oord et al. (2016) Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pages 4790–4798, 2016.
 Villani (2008) Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 Zhang et al. (2017) Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient as variational inference. arXiv preprint arXiv:1712.02390, 2017.
Appendix A Law of the unconscious statistician
Let be our probability space. Let be a random variable following the (Lebesgue) density and being its pushforward measure, and write with being its density and being its pushforward measure, and . Then
Appendix B Derivation and interpretation of Property 1
We first derive Property 1 algebraically, and give an interpretation that can be genralized to higher dimensional tensor operation. Recall that we have the following givens:

Assume is a random Gaussian matrix.

Assume and .

.

.
If we rescale elementwise by before inducing the columnwise and rowwise correlation, we have: (superscript is Hadamard power)
If is a matrix of ones, the RHS equals , which is the covariance of the matrix normal.
Generally, might not be a matrix of ones. But we can still compute the determinant and trace of the covariance matrix (useful in computing KL):
Interpretation of the determinant and trace.
The determinant measures the change in volume due to the linear map. Since each operation (elementwise multiplication with , leftmultiplication with , and rightmultiplication with ) is an invertible map, the determinant of the composition is a product of determinants. After elementwise multiplication with (hence )^{8}^{8}8The power 2 comes from the fact that we are looking at the determinant of the covariance. Direct computation of the likelihood involves , which is equivalent to the logdeterminant of the invertible map., we apply the same linear map to the columns of ; in vector form, this corresponds to leftmultiplication with a block diagonal of ’s, hence . The same reasoning explains .
The trace of the covariance can be written as , i.e. the sum of marginal variances. Each of the is a linear combination of the entries of , which have unit variance and are uncorrelated, so by the additive property of variance of sum of uncorrelated random variables and the quadratic scaling property of variance, .
Appendix C Connection to triangular maps.
Much of the recent work on normalizing flows has been dedicated to inverse autoregressive transformations (Kingma et al., 2016; Huang et al., 2018; Müller et al., 2018; De Cao et al., 2019; Jaini et al., 2019), as they are general enough to induce any density function (Hyvärinen and Pajunen, 1999; Bogachev et al., 2005; Villani, 2008). When such transformations are used for and , the overall transformation is also a triangle map, since depends on for and . Such a function has some “blind spots” similar to the ones discovered by Van den Oord et al. (2016). One avenue for improvement is to design a transformation that increase the connectivity. Another avenue for improvement is to condition each (rowwise or columnwise) transformation on a learnable embedding of the row/column, such that each row/column is transformed by a slightly different function than another ^{9}^{9}9We try this idea in the preliminary stage of the project, but find it harder to optimize. This is potentially due to the extra parameters that have to be learned..
Appendix D Tail bound of empirical KL
We begin with some preliminaries and lemmas in Section D.1, and prove the main result in Section D.2.
d.1 Basic tail bounds and Bernstein inequality
The tools developed in this section is to translate the coefficients (such as variance) of subGaussian random variables and subexponential random variables. We start with the definition of subGaussians:
Definition 1.
We write if is a random variable satisfying
We write as the Gamma function: . Note that for positive integers ,
. The following lemma gives an upperbound on the moments of a subGaussian.
Lemma 4.
For , for any integer , .
Proof.
The following definition is the main tool for translating the coefficients.
Definition 2.
Let be a random variable. For integer , define the Orlicz norm as
i.e, the smallest constant for which the superexponential moment of is bounded by 2. The Orlicz norm is infinity if there’s no finite for which exists.
It is easy to verify that the Orlicz norm is indeed a norm. We call the subGaussian norm, and the subexponential norm. Note that .
The following lemma upper bounds the subGaussian norm by its variance.
Lemma 5.
If , .
Proof.
By power series expansion of the exponential function,
where we used Lemma 4 for the inequality. The RHS converges and is equal to if . Thus, . ∎
The following lemma gives an upper bound on the moments of subexponential random variables.
Lemma 6.
If for some , , then .
Proof.
By Markov’s inequality,
For , since is nonnegative,
where we let and . ∎
Finally, we derive a concentration bound for subexponential random variables.
Theorem 7.
(Bernstein’s inequality for subexponential random variables) Let be independent realvalued random variables satisfying for some , with mean , and let . Then, for any , the following concentration bound holds:
d.2 Proof of Theorem 3
Since is Lipschitz, according to Theorem 5.5 and 5.6 of Boucheron et al. (2013), . And we have that
due to triangle inequality of the norm. Now since is Lipschitz, its expectation can be bounded by
Since is standardnormally distributed, follows the chi distribution with degrees of freedom, which has an expectation that can be upperbounded using Gautschi’s inequality (using Wendel’s version of the upper bound):
Combining the above and using Lemma 5, we have
Setting to be the RHS and applying Theorem 7 yield the desired result.
Appendix E Experimental Details
For the predictive tasks (Section 5.1
), we use a cosine annealing schedule for the learning rate, scaling down to 0.01 of the initial learning rate, and pretrain a deterministic network for 10 epochs using the Adam optimizer with a learning rate of 0.001, to initialize the mean of the Gaussian
, and train for 200 epochs.LeNet5 MNIST.
We use a linear annealing schedule of the coefficient (from 0 back to 1) for 50,000 iterations. We use the Adam optimizer with a learning rate of 0.0005. The result we get for KLinear uses polyak averaging with exponential decay coefficient 0.995. We use the volume preserving version of RealNVP for the KNonlinear. We use the standard Gaussian prior for .
LeNet5 CIFAR5
We use the same architecture as Louizos and Welling (2017) (192 convolutional kernels and 1,000 hidden units for the fully connected layers). We use a linear annealing schedule of the coefficient (from 0 back to 1) for 20,000 iterations for Diag, and no annealing for KLinear and KNonlinear. We use the Adam optimizer with a learning rate of 0.0003, 0.0003, 0.0005 for Diag, KLinear and KNonlinear, respectively. We use the volume preserving version of RealNVP for KNonlinear. We use the standard Gaussian prior for .
Vgg16 Cifar10
We use the modified version of VGG16 proposed by Zhang et al. (2017). We use a learning rate of 0.0005 for all experiments but KNonlinear in the regular setup (where we use 0.001). We use the isotropic Gaussian prior with variance being 0.1, and set to be [0.5, 0.1, 0.1, 0.5] in the regular setup and [0.5, 0.1, 0.1, 0.1] in the data augmented setup for Diag, KDiag, KLinear, and KNonlinear, respectively. We use the volume preserving version of RealNVP for the KNonlinear.
PACBayes MLP
We follow the same steps as Dziugaite and Roy (2017), except we did not discretize the prior variance after tuning. In practice this does not affect the bound much. We also did not initialize the mean of in our setup using SGD for our experiments. We train the stochastic network for 300 epochs, with a learning rate of 0.002. The bound holds with probability at least 0.965 over the choice of prior and the training set. The and coefficients in Dziugaite and Roy (2017) are set as 100 and 0.1. We use the volume preserving version of IAF for the KNonlinear.
PACBayes LeNet5
The same setup as PACBayes MLP, except with polyak averaging with coefficient 0.995. We use the volume preserving version of IAF for the KNonlinear.
Bandit Benchmark
All the models share the same architechture: one hidden layer with 50 units. We use the volume preserving version of RealNVP for KNonLinear. We train models every 50 time steps for 200 training iterations using a batchsize of 512.
Bandit problem  number of actions  number of contexts 

Mushroom  2  50000 
Statlog  7  43500 
Covertype  7  50000 
Financial  8  3713 
Jester  8  19181 
Adult  14  45222 