# Stochastic Neural Network with Kronecker Flow

Recent advances in variational inference enable the modelling of highly structured joint distributions, but are limited in their capacity to scale to the high-dimensional setting of stochastic neural networks. This limitation motivates a need for scalable parameterizations of the noise generation process, in a manner that adequately captures the dependencies among the various parameters. In this work, we address this need and present the Kronecker Flow, a generalization of the Kronecker product to invertible mappings designed for stochastic neural networks. We apply our method to variational Bayesian neural networks on predictive tasks, PAC-Bayes generalization bound estimation, and approximate Thompson sampling in contextual bandits. In all setups, our methods prove to be competitive with existing methods and better than the baselines.

• 17 publications
• 17 publications
• 47 publications
• 23 publications
• 27 publications
• 143 publications
10/13/2017

### Bayesian Hypernetworks

We propose Bayesian hypernetworks: a framework for approximate Bayesian ...
03/14/2019

### Functional Variational Bayesian Neural Networks

Variational Bayesian neural networks (BNNs) perform variational inferenc...
10/09/2018

### Fixing Variational Bayes: Deterministic Variational Inference for Bayesian Neural Networks

Bayesian neural networks (BNNs) hold great promise as a flexible and pri...
06/02/2022

### Excess risk analysis for epistemic uncertainty with application to variational inference

We analyze the epistemic uncertainty (EU) of supervised learning in Baye...
02/27/2021

### Variational Laplace for Bayesian neural networks

We develop variational Laplace for Bayesian neural networks (BNNs) which...
06/12/2019

### Neural Variational Inference For Estimating Uncertainty in Knowledge Graph Embeddings

Recent advances in Neural Variational Inference allowed for a renaissanc...
02/26/2019

### Function Space Particle Optimization for Bayesian Neural Networks

While Bayesian neural networks (BNNs) have drawn increasing attention, t...

## 1 Introduction

Stochastic neural networks (SNN) are a central tool in many subfields of machine learning, including (1) Bayesian deep learning

(MacKay, 1992; Blundell et al., 2015; Hernández-Lobato and Adams, 2015; Gal and Ghahramani, 2016)

, (2) exploration in reinforcement learning

(Ian et al., 2013; Osband et al., 2016; Riquelme et al., 2018)

, and (3) statistical learning theory such as PAC-Bayesian learning

(McAllester, 1999; Langford and Seeger, 2001; Dziugaite and Roy, 2017). Perturbations of the network parameters induce a distribution over the model, and this intrinsic uncertainty is the subject of great interest to machine learning practitioners and theoreticians alike. For example, deep Bayesian models are often used to adequately measure uncertainty, and determine whether the model itself is inherently familiar with the unseen data. This is especially important in the context of autonomous vehicles, where decisions must be made to meet specific safety standards (McAllister et al., 2017). Conversely, the lack of confidence can be leveraged to efficiently guide exploration in reinforcement learning, via randomizing the approximate value function (Azizzadenesheli et al., 2018; Touati et al., 2018) or maximizing intrinsic rewards (Houthooft et al., 2016).

Furthermore, a considerable proportion of statistical learning theory is devoted to understanding what implies generalization, or what constitutes an appropriate measure of complexity (Bartlett et al., 2017; Arora et al., 2018; Neyshabur et al., 2017). PAC-Bayesian learning theory (McAllester, 1999) specifically explores the generalization property of a randomized prediction rule, and has been recently studied in the context of stochastic neural networks (Dziugaite and Roy, 2017)

. In this particular study, the working hypothesis is that good generalization can be guaranteed on the premise that stochastic gradient descent

(Robbins and Monro, 1951) finds a solution that obtains certain structural property (such as flatness).

From a more practical perspective, considerable effort has been devoted to modelling uncertainty through the injection of independent noise to the network parameters (Graves, 2011; Blundell et al., 2015; Kingma et al., 2015). Noise independence largely restricts the expressivity of the noise distribution and thus the resulting uncertainty measures are ill-calibrated (Minka et al., 2005; Turner and Sahani, 2011). Attempts have been made to correlate parameters of a neural network, including  Louizos and Welling (2017); Krueger et al. (2017); Pawlowski et al. (2017), for example, by adapting expressive non-linear invertible transformations developed in the variational inference literature (Rezende and Mohamed, 2015; Kingma et al., 2016; Huang et al., 2018), or via implicit methods (Goodfellow et al., 2014). However, these methods are limited due to their inability to scale well. Louizos and Welling (2017), for instance, resort to a specific multiplicative noise sampled from a lower dimensional space and have to use an auxiliary method to bound the entropy. Krueger et al. (2017), on the other hand, give up on injecting noise on the entire set of parameters and model the distribution of the scale and shift parameter of the pre-activations.

In attempts to address some of the challenges articulated above and efficiently model the joint distribution of a network’s parameters, we propose the Kronecker Flow, an invertible transformation-based method inspired by the Kronecker product. Specifically, we notice that the Kronecker product can be thought of as left-transforming a matrix via a linear map, and then right-transforming it using another linear map. Our contributions are as follows.

1. We extend this idea to more general invertible mappings to induce non-linear dependencies, and apply this trick to parameterizing deep stochastic neural networks.

2. We apply our method to predictive tasks and show that our methods work better on larger architectures compared to existing methods.

3. We are the first to apply flow-based methods to tighten the PAC-based bound.

4. Our methods prove to be competitive over other methods in approximate Thompson sampling in contextual bandit problems.

## 2 Background

Stochastic neural networks with parameter perturbation normally follow the stochastic process: , , where is the parameter of the neural network

, which outputs the prediction probability vector for classification or the predicted values for regression. We let

be the training set of size 111We use the notation to compactly describe the set of integers ., be the differential entropy , be the coefficient controlling the amount of noise injected into the model and the degree of regularization,

be the loss function and

be the empirical risk.

### 2.1 Variational inference

Bayesian inference updates our prior belief over the model parameters according to the Bayes rule as more information in the form of the likelihood of the training set is available. Inference can be cast as an optimization problem, where one seeks to maximize the variational lower bound (also known as the evidence lower bound, or the ELBO) on the log marginal likelihood:

 logp(D)≥Eqϕ[logp(D|Θ)+logp(Θ)]+H(qϕ(Θ)), (1)

where is the variational approximate posterior and can be decomposed into due to conditional independence assumption. The optimal is the true posterior, i.e. . In our case, we use to parameterize a neural network. Prediction can be carried out via the predictive posterior

 p(y|x,D)=EΘ∼p(Θ|D)[p(y|x,Θ)]≈EΘ∼qϕ(Θ)[p(y|x,Θ)]≈1KK∑k=1p(y|x,Θk)

for drawn i.i.d. from , where we use the variational distribution to approximate and a Monte Carlo estimate to approximate the integral.

The prior distribution can be used to encode some form of inductive bias, such as one that is in favor of parameter values closer to some chosen a priori. We choose the prior to be an isotropic Gaussian, centered at the random initialization , i.e., . The entropy term ensures the variational posterior does not collapse to a point estimate. Both of them can be thought of as some form of regularizer, so we attach a coefficient

in front of them as a hyperparameter

222Like the parameter in Zhang et al. (2017).

### 2.2 PAC-Bayes generalization bound

Another use case of stochastic neural networks is to understand generalization, via minimizing the PAC-Bayes bound, which usually takes the following form: Assume the data is distributed according to some data distribution . Then for any and a fixed prior over models , with probability at least over the choice of , for all , , where is some sort of “distance” function that is usually convex (Germain et al., 2009), is the empirical risk (with a bounded loss such as the zero-one loss), is the test loss, and is a measure of complexity that scales proportionally with the Kullback–Leibler (KL) divergence.

For instance, Dziugaite and Roy (2017) minimize the following bound originally due to McAllester (1999) and then tightened by Langford and Seeger (2001):

###### Theorem 1.

Let be the zero-one loss. For any and data distribution , and any distribution on the space of , with probability at least over the choice of a training set , for all distributions on the space of ,

 DKL(^L[q]||L[q])≤DKL(q||p)+logmδm−1, (2)

where the KL on the LHS is a function of the losses, similar to that of the Bernoulli KL.

We refer to the above bound as the McAllester bound. The KL divergence on the RHS of the bound, also known as the information gain, tells us to what extent the posterior is dependent on the training data. The sharper and more confident is, and the farther away it is from the prior

, the larger the KL will be, which in turn is reflected by the larger bound on the generalization gap. This is consistent with traditional notion of bias-variance trade-off.

Alternatively, we consider the following bound due to Catoni (2007):

###### Theorem 2.

With the same setup as Theorem 1, and with a fixed , the following bound holds with probability over :

 L[q] ≤11−12β(^L[q]+βm(DKL(q||p)+ln1δ)). (3)

We refer to this bound as the Catoni bound. We notice the linear relationship (which is also noticed by Germain et al. (2016)) between the empirical risk and the KL divergence. This allows us to make use of the linearity of expectation to perform change of variable (see the next section). We also note that the optimal in Equation 3 is always larger than , so the PAC-Bayes bound is actually more conservative than Bayesian inference in this sense.

### 2.3 Normalizing flows

Minimization of Equation 1 and 3 requires (i) computing the gradient with respect to the parameter of the (PAC-)Bayesian posterior , and (ii) computing the entropy of . One approach to do this is via change of variable under an invertible mapping. Let

be a random variable in

, and be a bijection parameterized by . Let and be its density. Then we can rewrite the loss function as333Since the weighting coefficient can be absorbed into the loss function , we neglect it for simplicity now.

 EΘ[^RD(Θ)+logqϕ(Θ)]=Eϵ[^RD(gϕ(ϵ))+logq0(ϵ)−log∣∣ ∣∣det∂gϕ(ϵ)∂ϵ∣∣ ∣∣],

where we apply the change of variable (see Appendix A

for the detailed derivation). The log-determinant (logdet) term ensures that we obtain a valid probability density function after

is applied, which can be a sequence of invertible mappings itself, hence referred to as the normalizing flow (Rezende and Mohamed, 2015)

. This way, the random variable and the parameters are decoupled, so that we can differentiate the integrand to have an unbiased estimate of the gradient (fixing some

). We let be the standard normal.

## 3 Kronecker Flows

We consider maximizing the ELBO and minimizing the Catoni bound via normalizing flow-based SNNs. Conventionally, mean-field approximation using factorized distributions (such as multivariate Gaussian with diagonal covariance) has been well explored in the variational inference (VI) literature (Blundell et al., 2015). We are interested in better capturing the structure in the parameter space as restricted VI methods are known to exhibit overconfidence (Minka et al., 2005; Turner and Sahani, 2011). However, the parameters of a neural network are usually very high dimensional (on the order of millions), requiring a novel way to parameterize the joint distribution over the parameters.

In its general form, neural networks can be represented by a collection of tensors i.e.

. While our method below can easily be generalized to high-dimensional tensors (such as for convolutional kernels), to simplify notation, we describe the matrix form.

### 3.1 Linear Kronecker Flow

The matrix-variate normal (

) distribution generalizes the multivariate normal distribution to matrix-valued random variables, such as weight matrices of a neural network

(Louizos and Welling, 2016). Matrix normal is a multivariate normal distribution whose covariance matrix is a Kronecker product (), which allows us to model the correlation among the parameters to some degree.

More concretely, assume is an random Gaussian matrix, and , and are real-valued matrices. Then has a matrix normal distribution, as

 vec(M+AEB)∼N(vec(M),B⊤B⊗AA⊤),

where vec is the vectorization of a matrix that concatenates all the columns. This allows us to represent the covariance matrix in a more compact manner ( parameters versus parameters for Kronecker product).

#### Limitation of the Kronecker product.

The Kronecker product covariance matrix is not a strict generalization of diagonal covariance matrix. To observe this, let , (this is the case of Louizos and Welling (2016)), and , where , , and . Then is also a diagonal matrix of size . Equating to solve for and will result in nonlinear equations with variables, which can be over-determined for . For example, let , and for some . Then the nonlinear system below does not have a solution:

 U⊗V=S⟺u1v1(a)=1u1v2(b)=ϵu1v3(c)=ϵu2v1(d)=1u2v2(e)=1u2v3(f)=1

To see this, dividing by and dividing by yield and , respectively, which doesn’t have a solution if . This is because the Kronecker product is essentially parameter sharing, which can heavily restrict the matrix it can represent.

To remedy the above limitation, we can further decouple the reparameterization of the parameter matrix into two parts: (1) one that models the marginal variance and (2) one that models correlations. Assume is a positive-valued matrix, and let . Then

is a Gaussian distribution with the following property, which is useful in calculating the KL divergence:

###### Property 1.

Let be given as above, with and . Then

1. [label=(P0)]

2. , and

See Appendix B for the derivation and interpretation of the property. Naive implemetations of this can be inefficient and numerically unstable, as the entropy term involves computing the log-determinant of and

, requiring the standard automatic differentiation libraries to resort to singular value decomposition when the matrix is near-singular. Thus, we choose to parameterize

and as lower triangular matrices444This can be achieved via masking. with ones on the diagonal, leaving the uncertainty to be modeled by . This means .

#### Simulation.

To validate the limited expressiveness of kronecker product, we randomly initialize to be the density of a multivariate Gaussian with mean zero, and covariance being the square of a random standard Gaussian matrix. We choose such that it can be decomposed into a product of integers, and parameterize using independent Gaussian (dubbed Diag), the Kronecker product with diagonal and (K-Diag), and the Kronecker product with elementwise scaling (K-Linear). We minimize ; see Figure 1 for the results. We also conduct the same experiment with 3D tensors (instead of matrices). We see that K-Diag consistently underperforms when compared to Diag, which indicates parameter sharing does restrict the family of distributions it can represent, and K-Linear is consistently better.

### 3.2 Nonlinear Kronecker Flow

In this section, we generalize the Kronecker product to more general non-linear mappings. In Appendix C, we make a connection to non-decreasing triangle maps (Villani, 2008)

that are general enough to model any probability distributions.

First, notice that left-multiplying by amounts to introducing linear correlation among the rows of , applied to each of the columns. Likewise, right-multiplying by amounts to correlating column entries of each row of . Inspired by this, we consider applying an invertible mapping to each row of the random weight matrix, and another invertible mapping to each column of the matrix. We call this the Kronecker Flow 555To differentiate this from K-Linear from the previous section, we refer to using non-linear as K-Nonlinear..

Specifically, let and be invertible mappings. We define the matrix-matrix function as , with the following batch-operations (for and ):

 GA(E⊤)j::=gA(E:j)GB(E)i::=gB(Ei:)

It is easy to verify that is invertible. Due to the partial dependency of and , the Jacobians of the vectorized forms (after proper permutation) are block-diagonal, so we have

 det∂vec(G(E))∂% vec(E)=∏j∈[p]det∂gA(E:j)∂E:j⋅∏i∈[n]det∂gB(GA(E⊤):i)∂GA(E⊤):i.

In practice, we use the volume preserving version of RealNVP (Dinh et al., 2016) and inverse autoregressive flow (IAF) (Kingma et al., 2016) to parameterize and for our experiments. The K-Linear from the previous section can be thought of as using a linear map as and .

## 4 Concentration of empirical KL with normalizing flows

In their study, Dziugaite and Roy (2017) use independent Gaussian for to minimize the McAllester bound, so they can compute the KL between Gaussians analytically. This is no longer feasible when we use more flexible families for , such as normalizing flows. Moreover, a Monte Carlo estimate might result in underestimating the bound after inverting the KL between Bernoullis on the LHS of Equation 2 (which is a concave function; see Appendix A of Reeb et al. (2018) for an illustration). This necessitates a high probability bound on the concentration of the empirical estimate.

In Section 2.3, we have established can be written in the following form

where both and are standard Gaussian (the mean and variance can be absorbed into the invertible mapping if this is not the case).

The first term in the KL can be computed analytically. The second term usually can be almost surely bounded (e.g. using Block neural autoregressive flows) so that we can use Hoeffding-type concentration or it can simply be made zero using e.g. volume preserving flows. The challenge now lies in the third term, which has a quadratic form , neglecting the normalizing constant.

Now assume is a -Lipschitz 666

The following flows are all Lipschitz (with proper activation functions): volume preserving version of

Dinh et al. (2016); Kingma et al. (2016), Berg et al. (2018), Behrmann et al. (2018), De Cao et al. (2019), etc.. Let . Then is -Lipschitz:

This is key in deriving a tail bound on , as Lipschitz functions of canonical Gaussian random variables are sub-Gaussians, meaning they have a tail that decays faster than a Gaussian random variable. The following theorem provides a concentration bound for the empirical average of similar to that of a Chi-square random variable, as (square of a sub-Gaussian) is sub-exponential.

###### Theorem 3.

Let be defined as above with a Lipschitz constant . Let . Then the following concentration bound holds

 P(¯g2−E[g2]>ϵ)≤exp(−Kϵ22(4C2+Cϵ)),

where .

Note that in practice the empirical KL that we use is inversely scaled by the size of the training set (see Equation 2), so the Lipschitz constant can be made small in practice to dominate the dimensionality.

## 5 Experiments

We evaluate our proposed method in the context of two prediction tasks (Section 5.1), PAC-Bayes bound minimization (Section 5.2) and contextual bandit (Section 5.3). For the two prediction tasks, we use the MNIST handwritten digit dataset (Lecun et al., 1998) and CIFAR-10 (Krizhevsky, 2009). See Appendix E for a detailed description.

### 5.1 Classification

In this section, we evaluate the generalization performance of our method applied to Bayesian neural networks. We consider two architectures: LeNet-5 (Lecun et al., 1998) and a modified version VGG-16 (Simonyan and Zisserman, 2014) proposed by Zhang et al. (2017).

We first compare to the multiplicative normalizing flow (MNFG) proposed by Louizos and Welling (2017), applying our method to LeNet-5 (see Table 1). Our Diag matches the performance of their FFG (fully factorized Gaussian). K-Diag outperforms Diag in this case, perhaps due to the smaller number of parameters which makes it easier to optimize. K-Nonlinear yields the best generalization error in this case. On the CIFAR-5 experiment (we take the first 5 classes of CIFAR-10), our methods are on par with MNFG.

Second, we compare with the noisy K-FAC proposed by Zhang et al. (2017), applying our methods to the larger architecture VGG-16 (see Table 2). Noisy K-FAC applies an approximate natural gradient method. Despite this advantage, our methods (K-Linear and K-Nonlinear) have simiar prediction accuracy in the regular setup. We also include the results of data augmentation with horizontal flip and random crop where K-Nonlinear outperforms all the other methods.

### 5.2 PAC Bayes bound minimization

For the PAC-Bayes bound estimation, we minimize Equation 3. We follow the recipe of Dziugaite and Roy (2017). We upper bound the zero-one loss by cross-entropy divided by (where is the number of classes) to make the upper bound tight. We set the prior to be , where is the initial value of the parameters, and apply a union bound to tune the prior variance . We also tune the coefficient as a parameter during training 777We are allowed to do so since we treat Equation 3 as an optimization objective, rather than report it as a bound. We report the McAllester bound, which holds for any , even if it depends on . , and report the McAllester bound for comparison (since it is the tightest). For more details, see Dziugaite and Roy (2017) for reference.

We test with a multi-layer perceptron with 1 or 2 hidden layers with 600 neurons and LeNet-5, evaluated on the MNIST dataset (see Table

3). For further clarification, we follow the steps of Dziugaite and Roy (2017) by minimizing the McAllester bound, using Pinsker’s inequality to bound the inverse of the Bernoulli KL (which we call the Pinsker bound). Since this bound has a square root in the complexity term, we can only use the Gaussian family with an analytic form of the KL. The result we have is slightly looser than Dziugaite and Roy (2017) since we have a 10-class problem and they deal with a binary version of MNIST. We see that the bound can indeed be improved by capturing the correlation among the parameters. We then compare to minimizing the Catoni bound, which is slightly looser since the linear relationship between the empirical risk and the KL term penalizes the latter more when the KL is larger. However, by modelling the non-linear dependencies, K-Nonlinear clearly outperforms the other methods (even compared to the ones minimizing the Pinsker bound). This indicates there exists a considerable amount of structure in the parameter space that may explain the gap between the test error and the generalization bound.

We also notice that, despite the linear relationship, the Catoni bound focuses more on the complexity term than the ELBO. For example, the empirical risks of LeNet-5 in Table 3 are much larger compared to the test loss of Table 1. The reasons are two: (1) the optimal in Equation 3 is larger than 1 (depending on the relative value of the KL), and (2) to properly upper bound the zero-one loss, we scale down the cross-entropy loss by during optimization. These two factors make it a more conservative training algorithm than Bayesian inference. This explains the hyperparameter tuning that is usually done in practice; e.g. when we decrease the coefficient, we allow the model to focus more on minimizing the empirical risk, at the cost of a weaker generalization guarantee.

### 5.3 Contextual bandit

Uncertainty modeling lies at the heart of the exploration-exploitation dilemma in sequential decision-making. In order to maximize its collected cumulative rewards, an agent should trade off exploring different actions and gaining more knowledge about the reward estimate vs. exploiting the current estimate and allocating resources to the actions that are likely rewarding. Thompson sampling (TS) (Thompson, 1933) is one the popular approaches that deals with the latter trade-off by maintaining posterior distribution over reward models and randomizing actions on the basis of their probability of being optimal.

In this section, we investigate the effectiveness of our proposed method for performing an approximate Thompson sampling in the particular setting of contextual bandits. In the latter setting, at each time , the agent sees a -dimensional context , selects one of the available actions, , and earns a reward generated by the environment. The agent aims to minimize its cumulative regret defined as where is the highest expected reward given the context and the expectation is over the randomness of both environment and the agent’s choice of actions.

We compare different methods on a range of real-world bandit problems introduced by Riquelme et al. (2018). We train the models every 50 time steps for 200 iterations using a batch-size of 512. We ran each experiment with 3 different random seeds and we report the means and standard deviations of cumulative regret normalized with the respect to the uniform baseline in the table 4. We include the functional variational Bayesian neural networks (fBNN), recently introduced by Sun et al. (2019) as a baseline, and we use their open sourced implementation of fBNN in the bandit setting. From table 4, we see that across the 6 bandit problems, our proposed method (K-Linear and K-Nonlinear) provides competitive and consistent results. They outperform other baselines in 4 problems out of 6.

## 6 Conclusion

In this work, we present the Kronecker Flow, a flow-based method to induce complex distribution inspired by the Kronecker product. Our methods scale to larger architectures such as VGG-16 since it takes advantage of the shape of the parameters. We demonstrate our methods work better than vanilla Kronecker product with diagonal matrices on multiple setups, including classification and approximate Thompson sampling in contexual bandit, and outperform existing methods in the Bayesian neural network literature. We are also the first to apply flow-based methods to obtain a tighter numerical generalization bound.

## References

• Arora et al. (2018) Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
• Azizzadenesheli et al. (2018) Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through bayesian deep q-networks. CoRR, abs/1802.04412, 2018. URL http://arxiv.org/abs/1802.04412.
• Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
• Behrmann et al. (2018) Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2018.
• Berg et al. (2018) Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.
• Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In Proceedings of The 32nd International Conference on Machine Learning, pages 1613–1622, 2015.
• Bogachev et al. (2005) V I Bogachev, A V Kolesnikov, and K V Medvedev. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309–335, apr 2005.
• Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
• Catoni (2007) Olivier Catoni. Pac-bayesian supervised classification: The thermodynamics of statistical learning. Lecture Notes-Monograph Series, 56:i–163, 2007. ISSN 07492170.
• De Cao et al. (2019) Nicola De Cao, Ivan Titov, and Wilker Aziz. Block neural autoregressive flow. arXiv preprint arXiv:1904.04676, 2019.
• Dinh et al. (2016) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
• Dziugaite and Roy (2017) Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In

Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence

, 2017.
• Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
• Germain et al. (2009) Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand.

Pac-bayesian learning of linear classifiers.

In Proceedings of the 26th Annual International Conference on Machine Learning, pages 353–360. ACM, 2009.
• Germain et al. (2016) Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. Pac-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pages 1884–1892, 2016.
• Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, 2014.
• Graves (2011) Alex Graves. Practical variational inference for neural networks. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2348–2356. Curran Associates, Inc., 2011.

Probabilistic backpropagation for scalable learning of bayesian neural networks.

In International Conference on Machine Learning, pages 1861–1869, 2015.
• Houthooft et al. (2016) Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems 29. 2016.
• Huang et al. (2018) Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In International Conference on Machine Learning, 2018.
• Hyvärinen and Pajunen (1999) Aapo Hyvärinen and Petteri Pajunen.

Nonlinear independent component analysis: Existence and uniqueness results.

Neural Networks, 12(3), 1999.
• Ian et al. (2013) Osband Ian, Van Roy Benjamin, and Russo Daniel. (more) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th International Conference on Neural Information Processing Systems, USA, 2013. Curran Associates Inc.
• Jaini et al. (2019) Priyank Jaini, Kira A Selby, and Yaoliang Yu. Sum-of-squares polynomial flow. In International Conference on Machine Learning, 2019.
• Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, 2016.
• Kingma et al. (2015) Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2575–2583. Curran Associates, Inc., 2015.
• Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
• Krueger et al. (2017) David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner, Alexandre Lacoste, and Aaron Courville. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.
• Langford and Seeger (2001) John Langford and Matthias Seeger. Bounds for averaging classifiers. 2001.
• Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998. ISSN 0018-9219. doi: 10.1109/5.726791.
• Louizos and Welling (2016) Christos Louizos and Max Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pages 1708–1716, 2016.
• Louizos and Welling (2017) Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. In International Conference on Machine Learning, 2017.
• MacKay (1992) David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
• McAllester (1999) David A McAllester. Pac-bayesian model averaging. In COLT, volume 99, pages 164–170. Citeseer, 1999.
• McAllister et al. (2017) Rowan McAllister, Yarin Gal, Alex Kendall, Mark Van Der Wilk, Amar Shah, Roberto Cipolla, and Adrian Weller. Concrete problems for autonomous vehicle safety: Advantages of bayesian deep learning. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pages 4745–4753. AAAI Press, 2017. ISBN 978-0-9992411-0-3.
• Minka et al. (2005) Tom Minka et al. Divergence measures and message passing. Technical report, Technical report, Microsoft Research, 2005.
• Müller et al. (2018) Thomas Müller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Novák. Neural importance sampling. arXiv preprint arXiv:1808.03856, 2018.
• Neyshabur et al. (2017) Behnam Neyshabur, Srinadh Bhojanapalli, David Mcallester, and Nati Srebro. Exploring generalization in deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5947–5956. Curran Associates, Inc., 2017.
• Osband et al. (2016) Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2377–2386. JMLR.org, 2016.
• Pawlowski et al. (2017) Nick Pawlowski, Andrew Brock, Matthew C. H. Lee, Martin Rajchl, and Ben Glocker. Implicit Weight Uncertainty in Neural Networks. arXiv e-prints, 2017.
• Reeb et al. (2018) David Reeb, Andreas Doerr, Sebastian Gerwinn, and Barbara Rakitsch. Learning gaussian processes by minimizing pac-bayesian generalization bounds. In Advances in Neural Information Processing Systems, pages 3337–3347, 2018.
• Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.
• Riquelme et al. (2018) Carlos Riquelme, George Tucker, and Jasper Roland Snoek. Deep bayesian bandits showdown. 2018.
• Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 09 1951.
• Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
• Sun et al. (2019) Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger Grosse. Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779, 2019.
• Thompson (1933) William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.
• Touati et al. (2018) Ahmed Touati, Harsh Satija, Joshua Romoff, Joelle Pineau, and Pascal Vincent. Randomized value functions via multiplicative normalizing flows. CoRR, abs/1806.02315, 2018. URL http://arxiv.org/abs/1806.02315.
• Turner and Sahani (2011) Richard E Turner and Maneesh Sahani. Two problems with variational expectation maximisation for time-series models. Bayesian Time series models, 1(3.1):3–1, 2011.
• Van den Oord et al. (2016) Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pages 4790–4798, 2016.
• Villani (2008) Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
• Zhang et al. (2017) Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient as variational inference. arXiv preprint arXiv:1712.02390, 2017.

## Appendix A Law of the unconscious statistician

Let be our probability space. Let be a random variable following the (Lebesgue) density and being its pushforward measure, and write with being its density and being its pushforward measure, and . Then

 E[A] =∫RdAd(gϕ∘ϵ)∗P =∫Rd(^RD(Θ)+logqϕ(Θ))qϕ(Θ)dΘ =∫RdA∘Θdϵ∗P =∫Rd(^RD(gϕ(ϵ))+logqϕ(gϕ(ϵ)))q0(ϵ)dϵ =∫Rd(^RD(gϕ(ϵ))+logq0(ϵ)−log∣∣ ∣∣det∂gϕ(ϵ)∂ϵ∣∣ ∣∣)q0(ϵ)dϵ

## Appendix B Derivation and interpretation of Property 1

We first derive Property 1 algebraically, and give an interpretation that can be genralized to higher dimensional tensor operation. Recall that we have the following givens:

• Assume is a random Gaussian matrix.

• Assume and .

• .

• .

If we rescale elementwise by before inducing the column-wise and row-wise correlation, we have: (superscript is Hadamard power)

 Σ:=Var(vec(M+A(E∘S)B)) =Var((B⊤⊗A)vec(E∘S)) =(B⊤⊗A)diag(vec(S2))(B⊤⊗A)⊤ =(B⊤⊗A)diag(vec(S2))(B⊗A⊤)

If is a matrix of ones, the RHS equals , which is the covariance of the matrix normal.

Generally, might not be a matrix of ones. But we can still compute the determinant and trace of the covariance matrix (useful in computing KL):

 det(Σ)=det(A)2pdet(B)2n∏ijS2ij
 Tr(Σ) =∑iΣii =∑i∑j(B⊤⊗A)ij%vec(S2)j(B⊗A⊤)ji =∑i∑j(B2⊤⊗A2)ijvec(S2)j =∑i((B2⊤⊗A2)% vec(S2))i =∑ij(A2S2B2)ij

#### Interpretation of the determinant and trace.

The determinant measures the change in volume due to the linear map. Since each operation (elementwise multiplication with , left-multiplication with , and right-multiplication with ) is an invertible map, the determinant of the composition is a product of determinants. After elementwise multiplication with (hence )888The power 2 comes from the fact that we are looking at the determinant of the covariance. Direct computation of the likelihood involves , which is equivalent to the log-determinant of the invertible map., we apply the same linear map to the columns of ; in vector form, this corresponds to left-multiplication with a block diagonal of ’s, hence . The same reasoning explains .

The trace of the covariance can be written as , i.e. the sum of marginal variances. Each of the is a linear combination of the entries of , which have unit variance and are uncorrelated, so by the additive property of variance of sum of uncorrelated random variables and the quadratic scaling property of variance, .

## Appendix C Connection to triangular maps.

Much of the recent work on normalizing flows has been dedicated to inverse autoregressive transformations (Kingma et al., 2016; Huang et al., 2018; Müller et al., 2018; De Cao et al., 2019; Jaini et al., 2019), as they are general enough to induce any density function (Hyvärinen and Pajunen, 1999; Bogachev et al., 2005; Villani, 2008). When such transformations are used for and , the overall transformation is also a triangle map, since depends on for and . Such a function has some “blind spots” similar to the ones discovered by Van den Oord et al. (2016). One avenue for improvement is to design a transformation that increase the connectivity. Another avenue for improvement is to condition each (row-wise or column-wise) transformation on a learnable embedding of the row/column, such that each row/column is transformed by a slightly different function than another 999We try this idea in the preliminary stage of the project, but find it harder to optimize. This is potentially due to the extra parameters that have to be learned..

## Appendix D Tail bound of empirical KL

We begin with some preliminaries and lemmas in Section D.1, and prove the main result in Section D.2.

### d.1 Basic tail bounds and Bernstein inequality

The tools developed in this section is to translate the coefficients (such as variance) of sub-Gaussian random variables and sub-exponential random variables. We start with the definition of sub-Gaussians:

###### Definition 1.

We write if is a random variable satisfying

 P(|X|>t)≤2exp(−t22L2)

We write as the Gamma function: . Note that for positive integers ,

. The following lemma gives an upper-bound on the moments of a sub-Gaussian.

###### Lemma 4.

For , for any integer , .

###### Proof.

Since is non-negative, similar to Lemma 6, we have

 E[|X|p] =∫∞0P(|X|p≥s)ds=∫∞0P(|X|≥t)ptp−1dt ≤2p∫∞0e−t2/2L2tp−1dt=≤p(2L2)p/2∫∞0e−uup/2−1du=p(2L2)p/2Γ(p/2)

where we let and . ∎

The following definition is the main tool for translating the coefficients.

###### Definition 2.

Let be a random variable. For integer , define the -Orlicz norm as

 ||X||ψk:=inf{t>0:E[exp(|X|k/tk)]≤2}

i.e, the smallest constant for which the super-exponential moment of is bounded by 2. The Orlicz norm is infinity if there’s no finite for which exists.

It is easy to verify that the Orlicz norm is indeed a norm. We call the sub-Gaussian norm, and the sub-exponential norm. Note that .

The following lemma upper bounds the sub-Gaussian norm by its variance.

If , .

###### Proof.

By power series expansion of the exponential function,

 E[ecX2]=1+∞∑p=1cpE[X2p]p!≤1+∞∑p=1cpp!2(2L2)pp!=1+2∞∑p=1(2cL2)p

where we used Lemma 4 for the inequality. The RHS converges and is equal to if . Thus, . ∎

The following lemma gives an upper bound on the moments of sub-exponential random variables.

###### Lemma 6.

If for some , , then .

###### Proof.

By Markov’s inequality,

 P(|X|>t)≤E[exp(|X|/C)]exp(t/C)≤2e−t/C

For , since is non-negative,

 E[|X|p] =∫∞0P(|X|p≥s)ds=∫∞0P(|X|≥t)ptp−1dt ≤2p∫∞0e−t/Ctp−1dt=2pCp∫∞0e−uup−1du=2pCpΓ(p)=2Cpp!

where we let and . ∎

Finally, we derive a concentration bound for sub-exponential random variables.

###### Theorem 7.

(Bernstein’s inequality for sub-exponential random variables) Let be independent real-valued random variables satisfying for some , with mean , and let . Then, for any , the following concentration bound holds:

 P(¯X−μX>ϵ)≤exp(−nϵ22(4C2+Cϵ))
###### Proof.

Let and . Then by Lemma 6, and for integers : . Then by Corollary 2.11 of Boucheron et al. (2013), we have

 P(¯X−μX>ϵ)=P(n∑i=1(Xi−μX)>nϵ)≤exp(−nϵ22(4C2+Cϵ))

### d.2 Proof of Theorem 3

Since is -Lipschitz, according to Theorem 5.5 and 5.6 of Boucheron et al. (2013), . And we have that

due to triangle inequality of the norm. Now since is -Lipschitz, its expectation can be bounded by

Since is standard-normally distributed, follows the chi distribution with degrees of freedom, which has an expectation that can be upper-bounded using Gautschi’s inequality (using Wendel’s version of the upper bound):

 E[||ϵ||]=√2Γ((d+1)/2)Γ(d/2)≤√2(d2)1/2=√d

Combining the above and using Lemma 5, we have

 ||g2||ψ1≤(6L2+L√log2(√d+||g−1(0)||))2

Setting to be the RHS and applying Theorem 7 yield the desired result.

## Appendix E Experimental Details

For the predictive tasks (Section 5.1

), we use a cosine annealing schedule for the learning rate, scaling down to 0.01 of the initial learning rate, and pretrain a deterministic network for 10 epochs using the Adam optimizer with a learning rate of 0.001, to initialize the mean of the Gaussian

, and train for 200 epochs.

#### LeNet-5 MNIST.

We use a linear annealing schedule of the coefficient (from 0 back to 1) for 50,000 iterations. We use the Adam optimizer with a learning rate of 0.0005. The result we get for K-Linear uses polyak averaging with exponential decay coefficient 0.995. We use the volume preserving version of RealNVP for the K-Nonlinear. We use the standard Gaussian prior for .

#### LeNet-5 CIFAR-5

We use the same architecture as Louizos and Welling (2017) (192 convolutional kernels and 1,000 hidden units for the fully connected layers). We use a linear annealing schedule of the coefficient (from 0 back to 1) for 20,000 iterations for Diag, and no annealing for K-Linear and K-Nonlinear. We use the Adam optimizer with a learning rate of 0.0003, 0.0003, 0.0005 for Diag, K-Linear and K-Nonlinear, respectively. We use the volume preserving version of RealNVP for K-Nonlinear. We use the standard Gaussian prior for .

#### Vgg-16 Cifar-10

We use the modified version of VGG-16 proposed by Zhang et al. (2017). We use a learning rate of 0.0005 for all experiments but K-Nonlinear in the regular setup (where we use 0.001). We use the isotropic Gaussian prior with variance being 0.1, and set to be [0.5, 0.1, 0.1, 0.5] in the regular setup and [0.5, 0.1, 0.1, 0.1] in the data augmented setup for Diag, K-Diag, K-Linear, and K-Nonlinear, respectively. We use the volume preserving version of RealNVP for the K-Nonlinear.

#### PAC-Bayes MLP

We follow the same steps as Dziugaite and Roy (2017), except we did not discretize the prior variance after tuning. In practice this does not affect the bound much. We also did not initialize the mean of in our setup using SGD for our experiments. We train the stochastic network for 300 epochs, with a learning rate of 0.002. The bound holds with probability at least 0.965 over the choice of prior and the training set. The and coefficients in Dziugaite and Roy (2017) are set as 100 and 0.1. We use the volume preserving version of IAF for the K-Nonlinear.

#### PAC-Bayes LeNet-5

The same setup as PAC-Bayes MLP, except with polyak averaging with coefficient 0.995. We use the volume preserving version of IAF for the K-Nonlinear.

#### Bandit Benchmark

All the models share the same architechture: one hidden layer with 50 units. We use the volume preserving version of RealNVP for K-NonLinear. We train models every 50 time steps for 200 training iterations using a batch-size of 512.