Stochastic Neural Network with Kronecker Flow

06/10/2019 ∙ by Chin-Wei Huang, et al. ∙ Element AI Inc Université de Montréal 1

Recent advances in variational inference enable the modelling of highly structured joint distributions, but are limited in their capacity to scale to the high-dimensional setting of stochastic neural networks. This limitation motivates a need for scalable parameterizations of the noise generation process, in a manner that adequately captures the dependencies among the various parameters. In this work, we address this need and present the Kronecker Flow, a generalization of the Kronecker product to invertible mappings designed for stochastic neural networks. We apply our method to variational Bayesian neural networks on predictive tasks, PAC-Bayes generalization bound estimation, and approximate Thompson sampling in contextual bandits. In all setups, our methods prove to be competitive with existing methods and better than the baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic neural networks (SNN) are a central tool in many subfields of machine learning, including (1) Bayesian deep learning 

(MacKay, 1992; Blundell et al., 2015; Hernández-Lobato and Adams, 2015; Gal and Ghahramani, 2016)

, (2) exploration in reinforcement learning 

(Ian et al., 2013; Osband et al., 2016; Riquelme et al., 2018)

, and (3) statistical learning theory such as PAC-Bayesian learning 

(McAllester, 1999; Langford and Seeger, 2001; Dziugaite and Roy, 2017). Perturbations of the network parameters induce a distribution over the model, and this intrinsic uncertainty is the subject of great interest to machine learning practitioners and theoreticians alike. For example, deep Bayesian models are often used to adequately measure uncertainty, and determine whether the model itself is inherently familiar with the unseen data. This is especially important in the context of autonomous vehicles, where decisions must be made to meet specific safety standards (McAllister et al., 2017). Conversely, the lack of confidence can be leveraged to efficiently guide exploration in reinforcement learning, via randomizing the approximate value function (Azizzadenesheli et al., 2018; Touati et al., 2018) or maximizing intrinsic rewards (Houthooft et al., 2016).

Furthermore, a considerable proportion of statistical learning theory is devoted to understanding what implies generalization, or what constitutes an appropriate measure of complexity (Bartlett et al., 2017; Arora et al., 2018; Neyshabur et al., 2017). PAC-Bayesian learning theory (McAllester, 1999) specifically explores the generalization property of a randomized prediction rule, and has been recently studied in the context of stochastic neural networks (Dziugaite and Roy, 2017)

. In this particular study, the working hypothesis is that good generalization can be guaranteed on the premise that stochastic gradient descent 

(Robbins and Monro, 1951) finds a solution that obtains certain structural property (such as flatness).

From a more practical perspective, considerable effort has been devoted to modelling uncertainty through the injection of independent noise to the network parameters (Graves, 2011; Blundell et al., 2015; Kingma et al., 2015). Noise independence largely restricts the expressivity of the noise distribution and thus the resulting uncertainty measures are ill-calibrated (Minka et al., 2005; Turner and Sahani, 2011). Attempts have been made to correlate parameters of a neural network, including  Louizos and Welling (2017); Krueger et al. (2017); Pawlowski et al. (2017), for example, by adapting expressive non-linear invertible transformations developed in the variational inference literature (Rezende and Mohamed, 2015; Kingma et al., 2016; Huang et al., 2018), or via implicit methods (Goodfellow et al., 2014). However, these methods are limited due to their inability to scale well. Louizos and Welling (2017), for instance, resort to a specific multiplicative noise sampled from a lower dimensional space and have to use an auxiliary method to bound the entropy. Krueger et al. (2017), on the other hand, give up on injecting noise on the entire set of parameters and model the distribution of the scale and shift parameter of the pre-activations.

In attempts to address some of the challenges articulated above and efficiently model the joint distribution of a network’s parameters, we propose the Kronecker Flow, an invertible transformation-based method inspired by the Kronecker product. Specifically, we notice that the Kronecker product can be thought of as left-transforming a matrix via a linear map, and then right-transforming it using another linear map. Our contributions are as follows.

  1. We extend this idea to more general invertible mappings to induce non-linear dependencies, and apply this trick to parameterizing deep stochastic neural networks.

  2. We apply our method to predictive tasks and show that our methods work better on larger architectures compared to existing methods.

  3. We are the first to apply flow-based methods to tighten the PAC-based bound.

  4. Our methods prove to be competitive over other methods in approximate Thompson sampling in contextual bandit problems.

2 Background

Stochastic neural networks with parameter perturbation normally follow the stochastic process: , , where is the parameter of the neural network

, which outputs the prediction probability vector for classification or the predicted values for regression. We let

be the training set of size 111We use the notation to compactly describe the set of integers ., be the differential entropy , be the coefficient controlling the amount of noise injected into the model and the degree of regularization,

be the loss function and

be the empirical risk.

2.1 Variational inference

Bayesian inference updates our prior belief over the model parameters according to the Bayes rule as more information in the form of the likelihood of the training set is available. Inference can be cast as an optimization problem, where one seeks to maximize the variational lower bound (also known as the evidence lower bound, or the ELBO) on the log marginal likelihood:

(1)

where is the variational approximate posterior and can be decomposed into due to conditional independence assumption. The optimal is the true posterior, i.e. . In our case, we use to parameterize a neural network. Prediction can be carried out via the predictive posterior

for drawn i.i.d. from , where we use the variational distribution to approximate and a Monte Carlo estimate to approximate the integral.

The prior distribution can be used to encode some form of inductive bias, such as one that is in favor of parameter values closer to some chosen a priori. We choose the prior to be an isotropic Gaussian, centered at the random initialization , i.e., . The entropy term ensures the variational posterior does not collapse to a point estimate. Both of them can be thought of as some form of regularizer, so we attach a coefficient

in front of them as a hyperparameter

222Like the parameter in Zhang et al. (2017).

2.2 PAC-Bayes generalization bound

Another use case of stochastic neural networks is to understand generalization, via minimizing the PAC-Bayes bound, which usually takes the following form: Assume the data is distributed according to some data distribution . Then for any and a fixed prior over models , with probability at least over the choice of , for all , , where is some sort of “distance” function that is usually convex (Germain et al., 2009), is the empirical risk (with a bounded loss such as the zero-one loss), is the test loss, and is a measure of complexity that scales proportionally with the Kullback–Leibler (KL) divergence.

For instance, Dziugaite and Roy (2017) minimize the following bound originally due to McAllester (1999) and then tightened by Langford and Seeger (2001):

Theorem 1.

Let be the zero-one loss. For any and data distribution , and any distribution on the space of , with probability at least over the choice of a training set , for all distributions on the space of ,

(2)

where the KL on the LHS is a function of the losses, similar to that of the Bernoulli KL.

We refer to the above bound as the McAllester bound. The KL divergence on the RHS of the bound, also known as the information gain, tells us to what extent the posterior is dependent on the training data. The sharper and more confident is, and the farther away it is from the prior

, the larger the KL will be, which in turn is reflected by the larger bound on the generalization gap. This is consistent with traditional notion of bias-variance trade-off.

Alternatively, we consider the following bound due to Catoni (2007):

Theorem 2.

With the same setup as Theorem 1, and with a fixed , the following bound holds with probability over :

(3)

We refer to this bound as the Catoni bound. We notice the linear relationship (which is also noticed by Germain et al. (2016)) between the empirical risk and the KL divergence. This allows us to make use of the linearity of expectation to perform change of variable (see the next section). We also note that the optimal in Equation 3 is always larger than , so the PAC-Bayes bound is actually more conservative than Bayesian inference in this sense.

2.3 Normalizing flows

Minimization of Equation 1 and 3 requires (i) computing the gradient with respect to the parameter of the (PAC-)Bayesian posterior , and (ii) computing the entropy of . One approach to do this is via change of variable under an invertible mapping. Let

be a random variable in

, and be a bijection parameterized by . Let and be its density. Then we can rewrite the loss function as333Since the weighting coefficient can be absorbed into the loss function , we neglect it for simplicity now.

where we apply the change of variable (see Appendix A

for the detailed derivation). The log-determinant (logdet) term ensures that we obtain a valid probability density function after

is applied, which can be a sequence of invertible mappings itself, hence referred to as the normalizing flow (Rezende and Mohamed, 2015)

. This way, the random variable and the parameters are decoupled, so that we can differentiate the integrand to have an unbiased estimate of the gradient (fixing some

). We let be the standard normal.

3 Kronecker Flows

We consider maximizing the ELBO and minimizing the Catoni bound via normalizing flow-based SNNs. Conventionally, mean-field approximation using factorized distributions (such as multivariate Gaussian with diagonal covariance) has been well explored in the variational inference (VI) literature (Blundell et al., 2015). We are interested in better capturing the structure in the parameter space as restricted VI methods are known to exhibit overconfidence (Minka et al., 2005; Turner and Sahani, 2011). However, the parameters of a neural network are usually very high dimensional (on the order of millions), requiring a novel way to parameterize the joint distribution over the parameters.

In its general form, neural networks can be represented by a collection of tensors i.e.

. While our method below can easily be generalized to high-dimensional tensors (such as for convolutional kernels), to simplify notation, we describe the matrix form.

3.1 Linear Kronecker Flow

The matrix-variate normal (

) distribution generalizes the multivariate normal distribution to matrix-valued random variables, such as weight matrices of a neural network 

(Louizos and Welling, 2016). Matrix normal is a multivariate normal distribution whose covariance matrix is a Kronecker product (), which allows us to model the correlation among the parameters to some degree.

More concretely, assume is an random Gaussian matrix, and , and are real-valued matrices. Then has a matrix normal distribution, as

where vec is the vectorization of a matrix that concatenates all the columns. This allows us to represent the covariance matrix in a more compact manner ( parameters versus parameters for Kronecker product).

Limitation of the Kronecker product.

The Kronecker product covariance matrix is not a strict generalization of diagonal covariance matrix. To observe this, let , (this is the case of Louizos and Welling (2016)), and , where , , and . Then is also a diagonal matrix of size . Equating to solve for and will result in nonlinear equations with variables, which can be over-determined for . For example, let , and for some . Then the nonlinear system below does not have a solution:

To see this, dividing by and dividing by yield and , respectively, which doesn’t have a solution if . This is because the Kronecker product is essentially parameter sharing, which can heavily restrict the matrix it can represent.

To remedy the above limitation, we can further decouple the reparameterization of the parameter matrix into two parts: (1) one that models the marginal variance and (2) one that models correlations. Assume is a positive-valued matrix, and let . Then

is a Gaussian distribution with the following property, which is useful in calculating the KL divergence:

Property 1.

Let be given as above, with and . Then

  1. [label=(P0)]

  2. , and

See Appendix B for the derivation and interpretation of the property. Naive implemetations of this can be inefficient and numerically unstable, as the entropy term involves computing the log-determinant of and

, requiring the standard automatic differentiation libraries to resort to singular value decomposition when the matrix is near-singular. Thus, we choose to parameterize

and as lower triangular matrices444This can be achieved via masking. with ones on the diagonal, leaving the uncertainty to be modeled by . This means .

(a) Random Gaussian matrix
(b) Random 3D Gaussian tensor
Figure 1: Minimizing KL divergence between and a randomly initialized distribution

. X-axis indicates the shape of the random matrix/tensor, sorted according to the dimensionality. The shaded area is the error bar with

-standard deviation away from the mean performance, averaged across 25 trials.

Simulation.

To validate the limited expressiveness of kronecker product, we randomly initialize to be the density of a multivariate Gaussian with mean zero, and covariance being the square of a random standard Gaussian matrix. We choose such that it can be decomposed into a product of integers, and parameterize using independent Gaussian (dubbed Diag), the Kronecker product with diagonal and (K-Diag), and the Kronecker product with elementwise scaling (K-Linear). We minimize ; see Figure 1 for the results. We also conduct the same experiment with 3D tensors (instead of matrices). We see that K-Diag consistently underperforms when compared to Diag, which indicates parameter sharing does restrict the family of distributions it can represent, and K-Linear is consistently better.

3.2 Nonlinear Kronecker Flow

In this section, we generalize the Kronecker product to more general non-linear mappings. In Appendix C, we make a connection to non-decreasing triangle maps (Villani, 2008)

that are general enough to model any probability distributions.

First, notice that left-multiplying by amounts to introducing linear correlation among the rows of , applied to each of the columns. Likewise, right-multiplying by amounts to correlating column entries of each row of . Inspired by this, we consider applying an invertible mapping to each row of the random weight matrix, and another invertible mapping to each column of the matrix. We call this the Kronecker Flow 555To differentiate this from K-Linear from the previous section, we refer to using non-linear as K-Nonlinear..

Specifically, let and be invertible mappings. We define the matrix-matrix function as , with the following batch-operations (for and ):

It is easy to verify that is invertible. Due to the partial dependency of and , the Jacobians of the vectorized forms (after proper permutation) are block-diagonal, so we have

In practice, we use the volume preserving version of RealNVP (Dinh et al., 2016) and inverse autoregressive flow (IAF) (Kingma et al., 2016) to parameterize and for our experiments. The K-Linear from the previous section can be thought of as using a linear map as and .

4 Concentration of empirical KL with normalizing flows

In their study, Dziugaite and Roy (2017) use independent Gaussian for to minimize the McAllester bound, so they can compute the KL between Gaussians analytically. This is no longer feasible when we use more flexible families for , such as normalizing flows. Moreover, a Monte Carlo estimate might result in underestimating the bound after inverting the KL between Bernoullis on the LHS of Equation 2 (which is a concave function; see Appendix A of Reeb et al. (2018) for an illustration). This necessitates a high probability bound on the concentration of the empirical estimate.

In Section 2.3, we have established can be written in the following form

where both and are standard Gaussian (the mean and variance can be absorbed into the invertible mapping if this is not the case).

The first term in the KL can be computed analytically. The second term usually can be almost surely bounded (e.g. using Block neural autoregressive flows) so that we can use Hoeffding-type concentration or it can simply be made zero using e.g. volume preserving flows. The challenge now lies in the third term, which has a quadratic form , neglecting the normalizing constant.

Now assume is a -Lipschitz 666

The following flows are all Lipschitz (with proper activation functions): volume preserving version of

Dinh et al. (2016); Kingma et al. (2016), Berg et al. (2018), Behrmann et al. (2018), De Cao et al. (2019), etc.. Let . Then is -Lipschitz:

This is key in deriving a tail bound on , as Lipschitz functions of canonical Gaussian random variables are sub-Gaussians, meaning they have a tail that decays faster than a Gaussian random variable. The following theorem provides a concentration bound for the empirical average of similar to that of a Chi-square random variable, as (square of a sub-Gaussian) is sub-exponential.

Theorem 3.

Let be defined as above with a Lipschitz constant . Let . Then the following concentration bound holds

where .

Note that in practice the empirical KL that we use is inversely scaled by the size of the training set (see Equation 2), so the Lipschitz constant can be made small in practice to dominate the dimensionality.

5 Experiments

We evaluate our proposed method in the context of two prediction tasks (Section 5.1), PAC-Bayes bound minimization (Section 5.2) and contextual bandit (Section 5.3). For the two prediction tasks, we use the MNIST handwritten digit dataset (Lecun et al., 1998) and CIFAR-10 (Krizhevsky, 2009). See Appendix E for a detailed description.

Dataset L2 FFG MNFG Diag K-Diag K-Linear K-Nonlinear
MNIST 0.6 0.9 0.7 0.92 0.67 0.70 0.60
CIFAR-5 24 22 16 19.0 - 16.8 17.4
Table 1: Test error with LeNet (%) on MNIST and the first 5 classes of CIFAR-10. First 3 columns are from Louizos and Welling (2017). K-Diag on CIFAR-5 diverged, so we did not include the result.
Setup SGD KFAC BBB Noisy-KFAC Diag K-Diag K-Linear K-Nonlinear
R 18.21 17.61 17.18 14.48 17.71 16.71 14.65 14.74
D 11.65 11.11 11.69 10.65 10.69 13.65 11.35 9.88
Table 2: Test error with modified version of VGG16 (%) on CIFAR10. First 4 columns are from Zhang et al. (2017). R means regular training and D means training with data augmentation.

5.1 Classification

In this section, we evaluate the generalization performance of our method applied to Bayesian neural networks. We consider two architectures: LeNet-5 (Lecun et al., 1998) and a modified version VGG-16 (Simonyan and Zisserman, 2014) proposed by Zhang et al. (2017).

We first compare to the multiplicative normalizing flow (MNFG) proposed by Louizos and Welling (2017), applying our method to LeNet-5 (see Table 1). Our Diag matches the performance of their FFG (fully factorized Gaussian). K-Diag outperforms Diag in this case, perhaps due to the smaller number of parameters which makes it easier to optimize. K-Nonlinear yields the best generalization error in this case. On the CIFAR-5 experiment (we take the first 5 classes of CIFAR-10), our methods are on par with MNFG.

Second, we compare with the noisy K-FAC proposed by Zhang et al. (2017), applying our methods to the larger architecture VGG-16 (see Table 2). Noisy K-FAC applies an approximate natural gradient method. Despite this advantage, our methods (K-Linear and K-Nonlinear) have simiar prediction accuracy in the regular setup. We also include the results of data augmentation with horizontal flip and random crop where K-Nonlinear outperforms all the other methods.

5.2 PAC Bayes bound minimization

Bound Pinsker Bound Catoni Bound Catoni Bound
Flow Diag K-Linear Diag K-Linear K-Nonlinear D K-D K-L K-N
1 2 1 2 1 2 1 2 1 2 LeNet-5
6.62 6.00 6.09 5.90 8.04 7.66 8.10 8.33 5.96 5.90 2.12 2.95 2.00 2.01
6.66 6.12 5.98 5.96 7.78 7.70 7.98 8.26 5.83 5.76 2.31 2.87 1.91 2.14
bound 23.77 25.94 21.69 25.33 24.11 26.41 22.88 26.43 20.41 22.53 10.83 12.96 10.09 10.03
KL 5968 7829 5292 7554 5001 6555 4334 5996 4725 5921 3177 3477 2913 2873
Table 3: PAC-Bayes bound estimation: We minimize the Pinsker bound (an upper bound on the McAllester bound) and the Catani bound using different flows, and estimate the McAllester bound at inference time using Newton’s method.

For the PAC-Bayes bound estimation, we minimize Equation 3. We follow the recipe of Dziugaite and Roy (2017). We upper bound the zero-one loss by cross-entropy divided by (where is the number of classes) to make the upper bound tight. We set the prior to be , where is the initial value of the parameters, and apply a union bound to tune the prior variance . We also tune the coefficient as a parameter during training 777We are allowed to do so since we treat Equation 3 as an optimization objective, rather than report it as a bound. We report the McAllester bound, which holds for any , even if it depends on . , and report the McAllester bound for comparison (since it is the tightest). For more details, see Dziugaite and Roy (2017) for reference.

We test with a multi-layer perceptron with 1 or 2 hidden layers with 600 neurons and LeNet-5, evaluated on the MNIST dataset (see Table 

3). For further clarification, we follow the steps of Dziugaite and Roy (2017) by minimizing the McAllester bound, using Pinsker’s inequality to bound the inverse of the Bernoulli KL (which we call the Pinsker bound). Since this bound has a square root in the complexity term, we can only use the Gaussian family with an analytic form of the KL. The result we have is slightly looser than Dziugaite and Roy (2017) since we have a 10-class problem and they deal with a binary version of MNIST. We see that the bound can indeed be improved by capturing the correlation among the parameters. We then compare to minimizing the Catoni bound, which is slightly looser since the linear relationship between the empirical risk and the KL term penalizes the latter more when the KL is larger. However, by modelling the non-linear dependencies, K-Nonlinear clearly outperforms the other methods (even compared to the ones minimizing the Pinsker bound). This indicates there exists a considerable amount of structure in the parameter space that may explain the gap between the test error and the generalization bound.

We also notice that, despite the linear relationship, the Catoni bound focuses more on the complexity term than the ELBO. For example, the empirical risks of LeNet-5 in Table 3 are much larger compared to the test loss of Table 1. The reasons are two: (1) the optimal in Equation 3 is larger than 1 (depending on the relative value of the KL), and (2) to properly upper bound the zero-one loss, we scale down the cross-entropy loss by during optimization. These two factors make it a more conservative training algorithm than Bayesian inference. This explains the hyperparameter tuning that is usually done in practice; e.g. when we decrease the coefficient, we allow the model to focus more on minimizing the empirical risk, at the cost of a weaker generalization guarantee.

5.3 Contextual bandit

Bandit SGD fBNN Diag K-Diag K-Linear K-Nonlinear
Mushroom 4.06 1.23 3.91 1.55 2.16 0.51 2.41 1.28 1.85 0.27 3.47 0.83
Statlog 1.29 0.35 0.73 0.03 1.01 0.06 0.84 0.22 0.81 0.01 0.79 0.08
Covertype 30.01 0.37 32.03 0.70 28.42 0.52 29.19 0.28 28.13 0.22 28.06 0.26
Financial 6.08 0.83 7.27 1.90 7.43 0.99 5.88 0.44 5.88 0.61 5.78 0.49
Jester 56.24 3.36 59.70 4.31 59.34 3.92 57.17 3.15 57.66 3.67 57.96 4.48
Adult 79.31 0.83 84.45 1.43 76.32 0.16 77.28 0.03 75.94 0.21 77.30 0.46
Table 4: Cumulative regret incurred by different algorithms on the bandit benchmarks described in Riquelme et al. (2018)

. Values reported are the mean over 3 independent trials with standard error of the mean, normalized with respect to the performance of the uniform policy.

Uncertainty modeling lies at the heart of the exploration-exploitation dilemma in sequential decision-making. In order to maximize its collected cumulative rewards, an agent should trade off exploring different actions and gaining more knowledge about the reward estimate vs. exploiting the current estimate and allocating resources to the actions that are likely rewarding. Thompson sampling (TS) (Thompson, 1933) is one the popular approaches that deals with the latter trade-off by maintaining posterior distribution over reward models and randomizing actions on the basis of their probability of being optimal.

In this section, we investigate the effectiveness of our proposed method for performing an approximate Thompson sampling in the particular setting of contextual bandits. In the latter setting, at each time , the agent sees a -dimensional context , selects one of the available actions, , and earns a reward generated by the environment. The agent aims to minimize its cumulative regret defined as where is the highest expected reward given the context and the expectation is over the randomness of both environment and the agent’s choice of actions.

We compare different methods on a range of real-world bandit problems introduced by Riquelme et al. (2018). We train the models every 50 time steps for 200 iterations using a batch-size of 512. We ran each experiment with 3 different random seeds and we report the means and standard deviations of cumulative regret normalized with the respect to the uniform baseline in the table 4. We include the functional variational Bayesian neural networks (fBNN), recently introduced by Sun et al. (2019) as a baseline, and we use their open sourced implementation of fBNN in the bandit setting. From table 4, we see that across the 6 bandit problems, our proposed method (K-Linear and K-Nonlinear) provides competitive and consistent results. They outperform other baselines in 4 problems out of 6.

6 Conclusion

In this work, we present the Kronecker Flow, a flow-based method to induce complex distribution inspired by the Kronecker product. Our methods scale to larger architectures such as VGG-16 since it takes advantage of the shape of the parameters. We demonstrate our methods work better than vanilla Kronecker product with diagonal matrices on multiple setups, including classification and approximate Thompson sampling in contexual bandit, and outperform existing methods in the Bayesian neural network literature. We are also the first to apply flow-based methods to obtain a tighter numerical generalization bound.

References

Appendix A Law of the unconscious statistician

Let be our probability space. Let be a random variable following the (Lebesgue) density and being its pushforward measure, and write with being its density and being its pushforward measure, and . Then

Appendix B Derivation and interpretation of Property 1

We first derive Property 1 algebraically, and give an interpretation that can be genralized to higher dimensional tensor operation. Recall that we have the following givens:

  • Assume is a random Gaussian matrix.

  • Assume and .

  • .

  • .

If we rescale elementwise by before inducing the column-wise and row-wise correlation, we have: (superscript is Hadamard power)

If is a matrix of ones, the RHS equals , which is the covariance of the matrix normal.

Generally, might not be a matrix of ones. But we can still compute the determinant and trace of the covariance matrix (useful in computing KL):

Interpretation of the determinant and trace.

The determinant measures the change in volume due to the linear map. Since each operation (elementwise multiplication with , left-multiplication with , and right-multiplication with ) is an invertible map, the determinant of the composition is a product of determinants. After elementwise multiplication with (hence )888The power 2 comes from the fact that we are looking at the determinant of the covariance. Direct computation of the likelihood involves , which is equivalent to the log-determinant of the invertible map., we apply the same linear map to the columns of ; in vector form, this corresponds to left-multiplication with a block diagonal of ’s, hence . The same reasoning explains .

The trace of the covariance can be written as , i.e. the sum of marginal variances. Each of the is a linear combination of the entries of , which have unit variance and are uncorrelated, so by the additive property of variance of sum of uncorrelated random variables and the quadratic scaling property of variance, .

Appendix C Connection to triangular maps.

Much of the recent work on normalizing flows has been dedicated to inverse autoregressive transformations (Kingma et al., 2016; Huang et al., 2018; Müller et al., 2018; De Cao et al., 2019; Jaini et al., 2019), as they are general enough to induce any density function (Hyvärinen and Pajunen, 1999; Bogachev et al., 2005; Villani, 2008). When such transformations are used for and , the overall transformation is also a triangle map, since depends on for and . Such a function has some “blind spots” similar to the ones discovered by Van den Oord et al. (2016). One avenue for improvement is to design a transformation that increase the connectivity. Another avenue for improvement is to condition each (row-wise or column-wise) transformation on a learnable embedding of the row/column, such that each row/column is transformed by a slightly different function than another 999We try this idea in the preliminary stage of the project, but find it harder to optimize. This is potentially due to the extra parameters that have to be learned..

Appendix D Tail bound of empirical KL

We begin with some preliminaries and lemmas in Section D.1, and prove the main result in Section D.2.

d.1 Basic tail bounds and Bernstein inequality

The tools developed in this section is to translate the coefficients (such as variance) of sub-Gaussian random variables and sub-exponential random variables. We start with the definition of sub-Gaussians:

Definition 1.

We write if is a random variable satisfying

We write as the Gamma function: . Note that for positive integers ,

. The following lemma gives an upper-bound on the moments of a sub-Gaussian.

Lemma 4.

For , for any integer , .

Proof.

Since is non-negative, similar to Lemma 6, we have

where we let and . ∎

The following definition is the main tool for translating the coefficients.

Definition 2.

Let be a random variable. For integer , define the -Orlicz norm as

i.e, the smallest constant for which the super-exponential moment of is bounded by 2. The Orlicz norm is infinity if there’s no finite for which exists.

It is easy to verify that the Orlicz norm is indeed a norm. We call the sub-Gaussian norm, and the sub-exponential norm. Note that .

The following lemma upper bounds the sub-Gaussian norm by its variance.

Lemma 5.

If , .

Proof.

By power series expansion of the exponential function,

where we used Lemma 4 for the inequality. The RHS converges and is equal to if . Thus, . ∎

The following lemma gives an upper bound on the moments of sub-exponential random variables.

Lemma 6.

If for some , , then .

Proof.

By Markov’s inequality,

For , since is non-negative,

where we let and . ∎

Finally, we derive a concentration bound for sub-exponential random variables.

Theorem 7.

(Bernstein’s inequality for sub-exponential random variables) Let be independent real-valued random variables satisfying for some , with mean , and let . Then, for any , the following concentration bound holds:

Proof.

Let and . Then by Lemma 6, and for integers : . Then by Corollary 2.11 of Boucheron et al. (2013), we have

d.2 Proof of Theorem 3

Since is -Lipschitz, according to Theorem 5.5 and 5.6 of Boucheron et al. (2013), . And we have that

due to triangle inequality of the norm. Now since is -Lipschitz, its expectation can be bounded by

Since is standard-normally distributed, follows the chi distribution with degrees of freedom, which has an expectation that can be upper-bounded using Gautschi’s inequality (using Wendel’s version of the upper bound):

Combining the above and using Lemma 5, we have

Setting to be the RHS and applying Theorem 7 yield the desired result.

Appendix E Experimental Details

For the predictive tasks (Section 5.1

), we use a cosine annealing schedule for the learning rate, scaling down to 0.01 of the initial learning rate, and pretrain a deterministic network for 10 epochs using the Adam optimizer with a learning rate of 0.001, to initialize the mean of the Gaussian

, and train for 200 epochs.

LeNet-5 MNIST.

We use a linear annealing schedule of the coefficient (from 0 back to 1) for 50,000 iterations. We use the Adam optimizer with a learning rate of 0.0005. The result we get for K-Linear uses polyak averaging with exponential decay coefficient 0.995. We use the volume preserving version of RealNVP for the K-Nonlinear. We use the standard Gaussian prior for .

LeNet-5 CIFAR-5

We use the same architecture as Louizos and Welling (2017) (192 convolutional kernels and 1,000 hidden units for the fully connected layers). We use a linear annealing schedule of the coefficient (from 0 back to 1) for 20,000 iterations for Diag, and no annealing for K-Linear and K-Nonlinear. We use the Adam optimizer with a learning rate of 0.0003, 0.0003, 0.0005 for Diag, K-Linear and K-Nonlinear, respectively. We use the volume preserving version of RealNVP for K-Nonlinear. We use the standard Gaussian prior for .

Vgg-16 Cifar-10

We use the modified version of VGG-16 proposed by Zhang et al. (2017). We use a learning rate of 0.0005 for all experiments but K-Nonlinear in the regular setup (where we use 0.001). We use the isotropic Gaussian prior with variance being 0.1, and set to be [0.5, 0.1, 0.1, 0.5] in the regular setup and [0.5, 0.1, 0.1, 0.1] in the data augmented setup for Diag, K-Diag, K-Linear, and K-Nonlinear, respectively. We use the volume preserving version of RealNVP for the K-Nonlinear.

PAC-Bayes MLP

We follow the same steps as Dziugaite and Roy (2017), except we did not discretize the prior variance after tuning. In practice this does not affect the bound much. We also did not initialize the mean of in our setup using SGD for our experiments. We train the stochastic network for 300 epochs, with a learning rate of 0.002. The bound holds with probability at least 0.965 over the choice of prior and the training set. The and coefficients in Dziugaite and Roy (2017) are set as 100 and 0.1. We use the volume preserving version of IAF for the K-Nonlinear.

PAC-Bayes LeNet-5

The same setup as PAC-Bayes MLP, except with polyak averaging with coefficient 0.995. We use the volume preserving version of IAF for the K-Nonlinear.

Bandit Benchmark

All the models share the same architechture: one hidden layer with 50 units. We use the volume preserving version of RealNVP for K-NonLinear. We train models every 50 time steps for 200 training iterations using a batch-size of 512.

Bandit problem number of actions number of contexts
Mushroom 2 50000
Statlog 7 43500
Covertype 7 50000
Financial 8 3713
Jester 8 19181
Adult 14 45222
Table 5: Description of bandit problem: number of actions and number of contexts used for experiments. Comparing to Riquelme et al. (2018) benchmark, we restrict ourself to 50000 contexts for Covertype instead of 150000 contexts.