Bayesian machine learning is a popular framework for dealing with uncertainty in a principled way by integrating over model parameters rather than finding point estimates (Bishop, 2006; Barber, 2012; Ghahramani, 2015). Unfortunately, exact inference is usually not feasible due to the intractable normalization constant of the posterior. A popular alternative is variational inference (Wainwright et al., 2008)
, where a tractable approximate distribution is optimized to resemble the true posterior as closely as possible. Due to its amenability to stochastic gradient descent(Hoffman et al., 2013; Kingma & Welling, 2013; Titsias & Lázaro-Gredilla, 2014; Rezende & Mohamed, 2015), variational inference is scalable to large models and datasets.
The most common choice for the variational posterior is a factorized Gaussian. Outside of Bayesian inference, parameter noise has been found to be an effective regularizer(Graves et al., 2013; Plappert et al., 2018; Fortunato et al., 2018), e.g. for training neural networks. In combination with -regularization, additive Gaussian parameter noise corresponds to variational inference with a Gaussian approximate posterior with fixed variance. Interestingly, it has been observed that flexible posteriors can perform worse than simple ones (Turner & Sahani, 2011; Trippe & Turner, 2018; Braithwaite & Kleijn, 2018; Shu et al., 2018).
Variational inference follows the Minimum Description Length (MDL) principle (Rissanen, 1978, 1983; Hinton & van Camp, 1993), a formalization of Occam’s Razor. Loosely speaking, it states that of two models describing the data equally well, the ‘simpler’ one should be preferred. However, MDL is only an objective for compressing the training data and the model, and makes no formal statement about generalization to unseen data. Yet, generalization to new data is a key property of a machine learning algorithm.
Recent work (Xu & Raginsky, 2017; Bu et al., 2019; Bassily et al., 2018; Russo & Zou, 2015) has proposed upper bounds on the generalization error as a function of the mutual information between model parameters and training data. It states that the gap between train and test error can be reduced by limiting the mutual information. However, to the best of our knowledge these bounds and specific inference methods have so far not been linked.
In this work, we show that Gaussian mean field inference in models with Gaussian priors can be reinterpreted as point estimation in corresponding noisy models. This leads to an upper bound on the mutual information between model parameters and data through the data processing inequality. Our result holds for both supervised and unsupervised models. We discuss the connection to generalization bounds from Xu & Raginsky (2017) and Bu et al. (2019), suggesting that Gaussian mean field aids generalization. In our experiments, we show that limiting model capacity via mutual information is an effective measure of regularization, further supporting our theoretical framework.
2 Regularization through Mean Field
In our derivation, we denote a generic model as with unobserved variables and data . We refer to as the model parameters, however in latent variable models can also include the per-data point latents. The model consists of a prior and a likelihood . Ideally, one would like to find the posterior , where is the normalizer. However, calculating is typically intractable. Variational inference finds an approximation by maximizing the evidence lower bound (ELBO)
w.r.t. the approximate posterior . Our focus in this work lies on Gaussian mean field inference, so
is a fully factorized normal distribution with a learnable meanand variance . The prior is also chosen to be component-wise independent . The generative and inference models for this setting are shown in Figure a.
2.1 Fixed-Variance Gaussian Mean Field Inference
When the variance of the approximate posterior is fixed to some constant, the ELBO can be written as
which is optimized with respect to . We use to denote the parameter index.
To show how Gaussian mean field implicitly limits learned information, we extend the model with a noisy version of the parameters and let the likelihood depend on those noisy parameters. We choose the noise distribution to be the same as the inference distribution for the original model and find a lower bound on the log-joint of the noisy model. This leads to the same objective as mean-field variational inference in the original model.
Specifically, we define the noisy model as visualized in Figure b. We use to emphasize the distinction between distributions of the modified noisy model and the original one. As in the original model, represents the parameters (with same prior), i.e. . We denote the noisy parameters as . The likelihood remains unchanged, i.e. , except that it now depends on the noisy parameters instead of the ‘clean’ ones.
We now show that maximizing a lower bound on the log joint probability of the noisy model results in an identical objective as for variational inference in the clean model
where Equation 5 follows from Jensen’s inequality as in Equation 1. In the final equation we have replaced with (which is simply a change of names since we are maximizing the objective over this free variable) to emphasize that the objective functions are identical.
Since is independent of given , the joint
forms a Markov chain and the data processing inequality(Cover & Thomas, 2012) limits the mutual information between learned parameters and data through
The upper bound is given by
where denotes the number of parameters. Here, we exploit that and are Gaussian with and . This quantity is known as the capacity of channels with Gaussian noise in signal processing (Cover & Thomas, 2012). Intuitively, a high prior variance corresponds to a large capacity, while a high noise variance reduces it. Any desired capacity can be achieved by simply adjusting the signal-to-noise ratio .
2.2 Generalization Error vs. Limited Information
Intuitively, we characterize overfitting as learning too much information about the training data, suggesting that limiting the amount of information extracted from the training data into the hypothesis should improve generalization. This idea has recently been formalized by Xu & Raginsky (2017); Bu et al. (2019); Bassily et al. (2018); Russo & Zou (2015) showing that limiting mutual information between data and learned parameters bounds the expected generalization error under certain assumptions.
Specifically, their work characterizes the following process: Assume that our training dataset is sampled from a true distribution . Based on this training set, a learning algorithm subsequently returns a distribution over hypotheses given by . The process defines a mutual information
on the joint distribution
. Under certain assumptions on the loss function,Xu & Raginsky (2017) derive a bound on the generalization error of the learning algorithm in expectation over this sampling process. Bu et al. (2019) relax the condition on the loss and prove applicability to a simple estimation algorithm involving -loss.
Exact Bayesian inference returns the true posterior on a model . The theorem then states that a bound on limits the expected generalization error as described in Bu et al. (2019) if the model captures the nature of the generating process in the marginal . This is a common assumption necessary to justify any (variational) Bayesian approach.
Exact inference is intractable on deep models, and instead, one typically learns variational or point estimates for the posterior. That is also true for the objective on the noisy model above, where we only used a point estimate as given by Equation 6. Therefore, the assumption of exact inference is not met. Yet, we believe that those bounds motivate the expectation that variational inference aids generalization by limiting the learned information. If we performed exact inference on the noisy model in the last section, the given mutual information would imply a bound on generalization error as implied by Xu & Raginsky (2017) and Bu et al. (2019). Therefore, we are optimistic that the gap between variational inference and those generalization bounds can be closed either by performing more accurate inference in the noisy model or by taking the dynamics of the training algorithm into account when bounding mutual information (see subsection 5.2 for further discussion).
2.3 Learned-Variance Gaussian Mean Field Inference
The variance in Gaussian mean field inference is typically learned for each parameter (Kingma et al., 2015; Rezende & Mohamed, 2015; Blundell et al., 2015).
Similar to when the variance in the approximate posterior is fixed, one can obtain a capacity constraint.
This is the case even for a generalization of the objective from Equation 1 where the KL-term is scaled by some factor .111Higgins et al. (2017) propose using to learn ‘disentangled’ representations in variational autoencoders. Further,
to learn ‘disentangled’ representations in variational autoencoders. Further,is commonly annealed from to for expressive models (e.g. Bowman et al. (2015); Blundell et al. (2015); Sønderby et al. (2016)). In the following, we quantify a general capacity depending on , where recovers the standard variational objective. For notational simplicity, we here assume a prior variance of . It is straight-forward to adapt the derivation to the general case.
In this case, the objective can be written as
where now both and
represent learned vectors, anddenotes a variable composed of pairwise independent Gaussian components with means and variances given by the elements of and .
Similar to the previous section, we show a lower bound on the log-joint of a new noisy model to be identical to Equation 9. Specifically, we define the noisy model (Figure c), with independent priors and where
denotes the Gamma distribution. As previously donesubsection 2.1, we define the noise-injected parameters as and likelihood as .
The priors are chosen so that with Jensen’s inequality, we find a lower bound on the log-joint probability of this model that recovers the objective from Equation 9
In the noisy model, the data processing inequality and the independence of dimensions implies a bound
where the capacity per dimension is derived in Appendix A. Figure 2 shows numerical results for various values of . Standard variational inference () results in a capacity of per dimension. We observe that higher corresponds to smaller capacity, which is given by the mutual information between our new latent and . This formalizes the intuition that a higher weight of the complexity term in our objective increases regularization by decreasing a limit on the capacity.
2.4 Supervised and Unsupervised Learning
The above derivations apply to any learning algorithm that is purely trained with Gaussian mean-field inference. This covers supervised and unsupervised tasks.
In supervised learning, the training data typically consists of pairs of inputs and labels, and a loss is assigned to each pair that depends on the trained model, e.g. neural network parameters. When all parameters are learned with one of the discussed mean-field methods, the given bounds apply.
The derivation also comprises unsupervised methods with per-data latents and even amortized inference such as VAEs (Kingma et al., 2015; Rezende & Mohamed, 2015), again as long as all learned variables are learned via Gaussian mean field inference. While this might be helpful to find generalizing representations, the focus of the experiments is on validating the generalizing behavior of mean field variational Bayes on neural network parameters for overfitting regimes, namely small dataset and complex models.
2.5 Flexible Variational Distributions
The objective function for variational inference is maximized when the approximate posterior is equal to the true one. This motivates the development of flexible families of posterior distributions (Rezende & Mohamed, 2015; Kingma et al., 2016; Salimans et al., 2015; Ranganath et al., 2016; Huszár, 2017; Chen et al., 2018; Vertes & Sahani, 2018; Burda et al., 2015; Cremer et al., 2017). In the case of exact inference, a bound on generalization as discussed in subsection 2.2 only applies if the model itself has finite mutual information between data and parameters. However, estimating mutual information is generally a hard problem, particularly in high-dimensional, non-linear models. This makes it hard to state a generic bound, which is why we focus on the case of Gaussian mean field inference.
3 Related Work
Regularization in Neural Networks
Gaussian mean field is intimately related with other popular regularization approaches in deep learning: As apparent fromEquation 6, fixed-variance Gaussian mean field applied to training neural network weights is equivalent to L2-regularization (weight decay) combined with Gaussian parameter noise (Graves et al., 2013; Plappert et al., 2018; Fortunato et al., 2018) on all network weights. Molchanov et al. (2017) shows that additive parameter noise results in multiplicative noise on the unit activations. The resulting dependencies between noise components on the layer output can be ignored without significantly changing empirical results (Wang & Manning, 2013). This is in turn equivalent to scaled Gaussian dropout (Kingma et al., 2015).
The Information Bottleneck principle by Tishby et al. (2000); Shamir et al. (2010) aims to find a representation of some input that is most useful to predict an output . For this purpose, the objective is to maximize the amount of information the representation contains about the output under a bounded amount of information about the input
They describe a training procedure using the softly constrained objective
where controls the trade-off.
Alemi et al. (2016) suggest a variational approximation for this objective. For the task of reconstruction, where labels are identical to inputs , this results exactly in the -VAE objective (Achille & Soatto, 2017; Alemi et al., 2018). This is in accordance with our result from subsection 2.3 that there is a maximum capacity per latent dimension that decreases for higher . Setting , as suggested by Higgins et al. (2017) for obtaining disentangled representations, corresponds to lower capacity per latent component than achieved by standard variational inference.
Both Tishby et al. (2000) and Higgins et al. (2017) introduce as a trade-off parameter without a quantitative interpretation. With our information-theoretic perspective, we quantify the implied capacity and provide a link to the generalization error. Further, both methods are concerned with the information in the latent representation. They do not limit the mutual information with the model parameters, leaving them vulnerable to model overfitting under our theoretical assumptions. We experimentally validate this vulnerability and explore the effect of filling this gap by applying Gaussian mean field inference to the model parameters.
Information Estimation with Neural Networks
Multiple recent techniques (Belghazi et al., 2018; van den Oord et al., 2018; Hjelm et al., 2018) propose the use of neural networks for obtaining a lower bound on the mutual information. This is useful in settings when we want to maximize mutual information, e.g. between the data and a lower-dimensional representation. In contrast, we show that Gaussian variational inference on variables with a Gaussian prior implicitly places an upper bound on the mutual information between these variables and the data, and explore its regularizing effect.
In this section, we analyze the implications of applying Gaussian mean field inference of fixed scale to the model parameters in the supervised and unsupervised context. Our theoretical results suggest that varying the capacity will affect the generalization capability and we show this effect on small data regimes and how it changes with the training set size. Furthermore, we investigate whether capacity is the only predictor for generalization or whether varying priors and architectures also have an effect. Finally, we demonstrate qualitatively how the capacity bounds are reflected in fashion MNIST reconstruction.
4.1 Supervised Learning
We begin with a supervised classification task on the CIFAR10 dataset, training only on a subset of the first 5000 samples. We use 6 3x3 convolutional layers with 128 channels each followed by a ReLU activation function, every second of which implements striding 2 to reduce the input dimensionality. Finally, the last layer is a linear projection which parameterizes a categorical distribution. The capacity of each parameter in this network is set to specific values given byEquation 8.
Figure 3 shows that decreasing the model capacity per dimension (by increasing the noise) reduces the training log-likelihood and increases the test loglikelihood until both of them meet at an optimal capacity. One can observe that very small capacities lead to a signal that is too noisy and good predictions are no longer possible. In short, regimes of underfitting and overfitting are generated depending on the capacity.
4.2 Unsupervised Learning
We now evaluate the regularizing effect of fixed-scale Gaussian mean field inference in an unsupervised setting for MNIST image reconstruction. Therefore, we use a VAE (Kingma & Welling, 2013)
with 2 latent dimensions and a 3-layer neural network parameterizing the conditional factorized Gaussian distribution. As usual, it is trained using the free energy objective, but different from the original work, we also use Gaussian mean field inference for the model parameters. Again, we use a small training set of 200 examples for the following experiments if not denoted otherwise.
Varying Model Capacity and Priors
In our first experiment, we analyze generalization by inspecting the test ELBO when varying the model capacity which can be seen in Figure a. Similar to the supervised case, we can observe that there is a certain model capacity range that explains the data very well while less or more capacity results in noise drowning and overfitting respectively. In the same figure, we also investigated whether the information-theoretic model capacity can predict generalization independently of the specific prior distribution. Since we merely state an upper bound on mutual information in subsection 2.1, the prior may have an effect in practice which cannot be explained only by the capacity. Figure a shows that indeed, while the general behavior remains the same for different model priors, the generalization error is not entirely independent. Furthermore, the observation that all curves descend with larger capacities, for all priors, suggests that weight decay (Krogh & Hertz, 1992) of fixed scale without parameters noise is not sufficient to regularize arbitrarily large networks. In Figure b we investigated the extreme case of dropping the prior entirely and switching to maximum-likelihood learning instead by using an improper uniform prior. This approach recovers Gaussian dropout (Srivastava et al., 2014; Kingma et al., 2015). Dropping the prior sets the bottleneck capacity to infinity and should lead to worse generalization. Comparing the test ELBO of this Gaussian dropout variant to the original Gaussian mean field inference in Figure b confirms this result for larger capacities. For larger noise scales, generalization is still working well, a result that is not explained in our information-theoretic framework, but plausible due to the deployed limited architecture.
Varying Training Set Size
Figure a shows how limiting the capacity affects the test ELBO for varying amounts of training data. Models with very small capacity extract less information from the data into the model, thus yielding a good test ELBO somewhat independent of the dataset size. This is visible as a graph that ascends very little with more training data (e.g. total model capacity 330 kbits). Note that we here report the capacity of the entire model, which is the sum of the capacities for each parameter. In order to improve the test ELBO, more information from the data has to be extracted into the model. But clearly, this leads to non-generalizing information being extracted when the dataset is small, leading to overfitting. Only for larger datasets the extracted information generalizes. This is visible as a strongly ascending test ELBO with larger dataset sizes and bad generalization for small datasets. We can therefore conclude that the information bottleneck needs to be chosen based on the amount of data that is available. Intuitively, when more information is available, more information should be extracted into the model.
Varying Model Size
Furthermore, we inspected how the size of the model (here in terms of number of layers) affects generalization in Figure b. Similar to varying the prior distribution, we are interested in how well the total capacity predicts generalization and the role the architecture plays. It can be observed that larger networks are more resilient to larger total capacities before they start overfitting. This indicates that the total capacity is less important than the individual capacity (i.e. noise) per parameter. Nevertheless, larger networks are more prone to overfitting for very large model capacities. This makes sense as their functional form is less constrained, an aspect that is not captured by our theory.
Finally, we plot test reconstruction means for the binarized fashion MNIST dataset under the same setup for various capacities inFigure 6. In accordance with previous experiments, we observe that if the capacity is chosen too small, the model is not learning anything useful, while too large capacities result in overconfidence. This can be observed in most means being close to either 0 or 1. An intermediate capacity, on the other hand, makes sensible predictions (given that it was trained only on 200 samples) with sensible uncertainty, visible through gray pixels that correspond to high entropy.
In this section, we discuss how the capacity can be set, as well as the effect of model architecture and learning dynamics.
5.1 Choosing the Capacity
We have obtained a new trade-off parameter, the capacity, that has a simple quantitative interpretation: It determines how many bits to maximally extract from the training set. In contrast, for the parameter introduced in Tishby et al. (2000) and Higgins et al. (2017) a clear interpretation is not known. Yet, it may still be hard to set the capacity optimally. Simple mechanisms such as evaluation on a validation set to determine its value may be used. We expect that more theoretically rigorous methods could be developed.
Furthermore, in this paper, we have focused on the regularization that Gaussian mean field inference implies on the model parameters. The same concept is valid for data-dependent latent variables, for instance in VAEs, as discussed in subsection 2.4. In VAEs, Gaussian mean field inference on the latents leads to a restricted latent capacity, but leaves the capacity of the model unbounded. This leaves VAEs vulnerable to model overfitting, as demonstrated in the experiments, and setting as done in Higgins et al. (2017) is not sufficient to control complexity. This motivates the limitation of capacity between the data and both per-datapoint latents and model parameters. The interaction between the two is an interesting future research direction.
5.2 Role of Learning Dynamics and Architecture
As discussed in subsection 2.2, it is necessary to perform exact inference in the noisy model for the bounds on the generalization error to hold. However, this assumption is not met. In practice encodes the complete learning algorithm, which in deep learning typically includes parameter initialization and dynamics of the stochastic gradient descent optimization.
Our experiments confirm the relevance of these other factors: -regularization works in practice, even though no noise is added to the parameters. This could be explained by the fact that noise is already implicitly added through stochastic gradient descent (Lei et al., 2018) or through the output distribution of the network. Similarly, Gaussian dropout (Graves et al., 2013; Plappert et al., 2018; Fortunato et al., 2018) without a prior on the parameters helps generalization. Again, early stopping combined with a finite reach of gradient descent steps effectively shapes a prior of finite variance in the parameter space. This could also formalize why the annealing schedule employed by Bowman et al. (2015); Blundell et al. (2015) and Sønderby et al. (2016) is effective.
This observed dependence on other factors suggests that quantifying mutual information of the actual distribution created by the learning dynamics might be a promising approach to explain why neural networks often generalize well on their own. This idea is in accordance with recent work that links the learning dynamics of small neural networks to generalization behavior (Li & Liang, 2018).
On the other hand, the architecture choice also had an influence on generalization, which is expected by our theory since we only formulate a bound on mutual information that is completely agnostic to the actual model choice. Tightening this bound based on the model architecture and output distribution is usually hard, as discussed in subsection 2.5, but might be possible.
Another promising direction would be to approximately sample from the exact posterior on network parameters (i.e. as done by Marceau-Caron & Ollivier (2017)), on a capacity-limited architecture, instead of the usual approach of point estimation. In the limit of infinite training time, this would fully realize the discussed bound on the expected generalization error.
We have explained the regularizing effects observed in Gaussian mean field approaches from an information-theoretic perspective. The derivation features a capacity that can be naturally interpreted as a limit on the amount of information extracted about the given data by the inferred model. We validated its practicality for both supervised and unsupervised learning.
How this capacity should be set for parameters and latent variables depending on task and data is an interesting direction of research. We exploited a theoretical link of mutual information and generalization error. While this work is restricted to Gaussian mean field, incorporating the effect of learning dynamics on mutual information in future work might allow understanding why overparameterized neural networks still generalize well to unseen data.
Appendix A Capacity in Learned-Variance Gaussian Mean Field Inference
The capacity per dimension for the model discussed in subsection 2.3 is given by
with implies . Together with , this implies
Numerical results for the capacity with varying are given below and plotted in Figure 2.
- Achille & Soatto (2017) Achille, A. and Soatto, S. Emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350, 2017.
- Alemi et al. (2018) Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, R. A., and Murphy, K. Fixing a broken ELBO. In International Conference on Machine Learning, 2018.
- Alemi et al. (2016) Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. In International Conference on Learning Representations, 2016.
- Barber (2012) Barber, D. Bayesian Reasoning and Machine Learning. 2012.
- Bassily et al. (2018) Bassily, R., Moran, S., Nachum, I., Shafer, J., and Yehudayoff, A. Learners that use little information. In Algorithmic Learning Theory, 2018.
- Belghazi et al. (2018) Belghazi, I., Rajeswar, S., Baratin, A., Hjelm, R. D., and Courville, A. Mine: Mutual information neural estimation. In International Conference on Machine Learning, 2018.
- Bishop (2006) Bishop, C. M. Pattern Recognition and Machine Learning. 2006.
- Blundell et al. (2015) Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In International Conference on Machine Learning, 2015.
- Bowman et al. (2015) Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
- Braithwaite & Kleijn (2018) Braithwaite, D. and Kleijn, W. B. Bounded information rate variational autoencoders. arXiv preprint arXiv:1807.07306, 2018.
- Bu et al. (2019) Bu, Y., Zou, S., and Veeravalli, V. V. Tightening mutual information based bounds on generalization error. arXiv preprint arXiv:1901.04609, 2019.
- Burda et al. (2015) Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
- Chen et al. (2018) Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. arXiv preprint arXiv:1806.07366, 2018.
- Cover & Thomas (2012) Cover, T. M. and Thomas, J. A. Elements of information theory. John Wiley & Sons, 2012.
- Cremer et al. (2017) Cremer, C., Morris, Q., and Duvenaud, D. Reinterpreting importance-weighted autoencoders. In International Conference on Learning Representations Workshop, 2017.
- Fortunato et al. (2018) Fortunato, M., Azar, M. G., Piot, B., Menick, J., Hessel, M., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., Blundell, C., and Legg, S. Noisy networks for exploration. In International Conference on Learning Representations, 2018.
Probabilistic machine learning and artificial intelligence.Nature, 2015.
Graves et al. (2013)
Graves, A., Mohamed, A.-r., and Hinton, G.
Speech recognition with deep recurrent neural networks.In IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
- Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. -VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
- Hinton & van Camp (1993) Hinton, G. and van Camp, D. Keeping neural networks simple by minimising the description length of weights. In Computational Learning Theory, 1993.
- Hjelm et al. (2018) Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
- Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research, 14(1), 2013.
- Huszár (2017) Huszár, F. Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235, 2017.
- Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
- Kingma et al. (2015) Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, 2015.
- Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, 2016.
- Krogh & Hertz (1992) Krogh, A. and Hertz, J. A. A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems, 1992.
- Lei et al. (2018) Lei, D., Sun, Z., Xiao, Y., and Wang, W. Y. Implicit regularization of stochastic gradient descent in natural language processing: Observations and implications. arXiv preprint arXiv:1811.00659, 2018.
- Li & Liang (2018) Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, 2018.
- Marceau-Caron & Ollivier (2017) Marceau-Caron, G. and Ollivier, Y. Natural langevin dynamics for neural networks. In International Conference on Geometric Science of Information. Springer, 2017.
- Molchanov et al. (2017) Molchanov, D., Ashukha, A., and Vetrov, D. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pp. 2498–2507, 2017.
- Plappert et al. (2018) Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz, M. Parameter space noise for exploration. In International Conference on Learning Representations, 2018.
- Ranganath et al. (2016) Ranganath, R., Tran, D., Altosaar, J., and Blei, D. Operator variational inference. In Advances in Neural Information Processing Systems, 2016.
- Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.
- Rissanen (1978) Rissanen, J. Modeling by shortest data description. Automatica, 1978.
- Rissanen (1983) Rissanen, J. A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 1983.
- Russo & Zou (2015) Russo, D. and Zou, J. How much does your data exploration overfit? Controlling bias via information usage. arXiv preprint arXiv:1511.05219, 2015.
- Salimans et al. (2015) Salimans, T., Kingma, D., and Welling, M. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pp. 1218–1226, 2015.
- Shamir et al. (2010) Shamir, O., Sabato, S., and Tishby, N. Learning and generalization with the information bottleneck. Theoretical Computer Science, 2010.
- Shu et al. (2018) Shu, R., Bui, H. H., Zhao, S., Kochenderfer, M. J., and Ermon, S. Amortized inference regularization. arXiv preprint arXiv:1805.08913, 2018.
- Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. Ladder variational autoencoders. In Advances in Neural Information Processing Systems, 2016.
- Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014.
- Tishby et al. (2000) Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. arXiv preprint physics/0004057, 2000.
- Titsias & Lázaro-Gredilla (2014) Titsias, M. and Lázaro-Gredilla, M. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, 2014.
- Trippe & Turner (2018) Trippe, B. and Turner, R. Overpruning in variational Bayesian neural networks. arXiv preprint arXiv:1801.06230, 2018.
- Turner & Sahani (2011) Turner, R. and Sahani, M. Two problems with variational expectation maximisation for time-series models. Bayesian Time series models, 2011.
- van den Oord et al. (2018) van den Oord, A., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Vertes & Sahani (2018) Vertes, E. and Sahani, M. Flexible and accurate inference and learning for deep generative models. arXiv preprint arXiv:1805.11051, 2018.
- Wainwright et al. (2008) Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 2008.
- Wang & Manning (2013) Wang, S. and Manning, C. Fast dropout training. In International Conference on Machine Learning, 2013.
- Xu & Raginsky (2017) Xu, A. and Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, 2017.