Information Theoretic Lower Bounds on Negative Log Likelihood

by   Luis A. Lastras, et al.

In this article we use rate-distortion theory, a branch of information theory devoted to the problem of lossy compression, to shed light on an important problem in latent variable modeling of data: is there room to improve the model? One way to address this question is to find an upper bound on the probability (equivalently a lower bound on the negative log likelihood) that the model can assign to some data as one varies the prior and/or the likelihood function in a latent variable model. The core of our contribution is to formally show that the problem of optimizing priors in latent variable models is exactly an instance of the variational optimization problem that information theorists solve when computing rate-distortion functions, and then to use this to derive a lower bound on negative log likelihood. Moreover, we will show that if changing the prior can improve the log likelihood, then there is a way to change the likelihood function instead and attain the same log likelihood, and thus rate-distortion theory is of relevance to both optimizing priors as well as optimizing likelihood functions. We will experimentally argue for the usefulness of quantities derived from rate-distortion theory in latent variable modeling by applying them to a problem in image modeling.



page 1

page 2

page 3

page 4


Improving latent variable descriptiveness with AutoGen

Powerful generative models, particularly in Natural Language Modelling, ...

Super-resolution Variational Auto-Encoders

The framework of variational autoencoders (VAEs) provides a principled m...

Preventing Posterior Collapse with delta-VAEs

Due to the phenomenon of "posterior collapse," current latent variable g...

Lipschitz Parametrization of Probabilistic Graphical Models

We show that the log-likelihood of several probabilistic graphical model...

Latent Variable Modeling with Diversity-Inducing Mutual Angular Regularization

Latent Variable Models (LVMs) are a large family of machine learning mod...

Exact Rate-Distortion in Autoencoders via Echo Noise

Compression is at the heart of effective representation learning. Howeve...

Thermodynamic Machine Learning through Maximum Work Production

Adaptive thermodynamic systems – such as a biological organism attemptin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A statistician plans to use a latent variable model


where is known as the prior over the latent variables, and is the likelihood of the data conditional on the latent variables; we will often use

as a shorthand for the model. Frequently, both the prior and the likelihood are parametrized and the statistician’s job is to find reasonable parametric families for both - an optimization algorithm then chooses the parameter within those families. The task of designing these parametric families can sometimes be time consuming - this is one of the key modeling tasks when one adopts the representation learning viewpoint in machine learning.

In this article we ask the question of how much can be improved if one fixes and viceversa, with the goal of equipping the statistician with tools to make decisions on where to invest her time. One way to answer whether can be improved for fixed is to drop the assumption that must belong to a particular family and ask how a model could improve in an unrestricted setting. Mathematically, given data the first problem we study is the following optimization problem: for a fixed ,


which as we will show, is also indirectly connected to the problem of determining if can be improved for a given fixed . The quantity being optimized in (2) is called the average negative log likelihood of the data, and is used whenever one assumes that the data have been drawn statistically independently at random. Note that in this paper we are overloading the meaning of : in (1) it refers to prior in the “current” latent variable model in the statistician’s hands, and in (2) and other similar equations, it refers to a prior that can be optimized. We believe that the meaning will be clear depending on the context.

Obviously, for any given , from the definition (1) we have the trivial upper bound


Can we give a good lower bound? A lower bound could tell us how far we can improve the model by changing the prior. The answer turns out to be in the affirmative. In an important paper, Lindsay (1983) proved several facts about the problem of optimizing priors in latent variable models. In particular, he showed that


This result is very general - it holds for both discrete and continuous latent variable spaces, scalar or vector. It is also

sharp - if you plug in the right prior, the upper and lower bounds match. It also has the advantage that the lower bound is written as a function of the trivial upper bound (3) - if someone proposes a latent variable model which uses a likelihood function , the optimal negative log likelihood value when we optimize the prior is thus known to be within a gap of


bits. The individual quantities under the


have a specific functional meaning: they can be regarded as multiplicative factors that tell you how to modify the prior to improve the log likelihood (see the Blahut-Arimoto algorithm (Blahut, 1972)):


Lindsay (1983) derived his results with no apparent reference to earlier published work on rate-distortion theory, which is how information theorists study the problem of lossy compression Shannon (1959). There is no reason for why this connection could be reasonably made, as it is not immediately obvious that the problems are connected; However, the quantity and the lower bound in fact can be derived from ideas in Berger’s book on rate-distortion theory (Berger, 1971); in fact Lindsey’s classical result that the optimal prior in a latent variable model has finite support with size equal to the size of the training data can be, drawing the appropriate connection, seen as a result of a similar result in rate-distortion theory also contained in Berger (1971).

With time, the fundamental connection between the log likelihood optimization in latent variable modeling and the computation of rate-distortion functions became more clear. Although not explicitly mentioned in these terms, Rose (1998), Neal & Hinton (1998) restate the optimal log likelihood as a problem of minimizing the variational free energy of a certain statistical system; this expression is identical to the one that is optimized in rate-distortion theory. The Information Bottleneck method of Tishby et al. (1999) is a highly successful idea that exists in this boundary, having created a subfield of research that remains relevant nowadays (Tishby & Zaslavsky, 2015) (Shwartz-Ziv & Tishby, 2017). Slonim & Weiss (2002) showed a fundamental connection between maximum likelihood and the information bottleneck method: “every fixed point of the IB-functional defines a fixed point of the (log) likelihood and vice versa”. Banerjee et al. (2004) defined a rate-distortion problem where the output alphabet

is finite and is jointly optimized with the test channel. By specializing to the case of where the distortion measure is a Bregman divergence, they showed the mathematical equivalence between this problem and that of maximum likelihood estimation where the likelihood function is an exponential family distribution derived from the Bregman divergence.

Watanabe & Ikeda (2015)

study a variation of the maximum likelihood estimation for latent variable models where the maximum likelihood criterion is instead replaced with the entropic risk measure. The autoencoder concept extensively used in the neural network community is arguably directly motivated by the encoder/decoder concepts in lossy/lossless compression.

Giraldo & Príncipe (2013) proposed using a matrix based expression motivated by rate-distortion ideas in order to train autoencoders while avoiding estimating mutual information directly. Recently, Alemi et al. (2018) exploited the

-VAE loss function of

Higgins et al. (2017) to explicitly introduce a trade-off between rate and distortion in the latent variable modeling problem, where the notions of rate and distortion have similarities to those used in our article.

Latent Variable Modeling is undergoing an exciting moment as it is a form of

representation learning, the latter having shown to be an important tool in reaching remarkable performance in difficult machine learning problems while simultaneously avoiding feature engineering. This prompted us to look deeper into rate-distortion theory as a tool for developing a theory of representation learning. What excites us is the possibility of using a large repertoire of tools, algorithms and theoretical results in lossy compression in meaningful ways in latent variable modeling; notably, we believe that beyond what we will state in this article, Shannon’s famous lossy source coding theorem, the information theory of multiple senders and receivers, and the rich Shannon-style equalities and inequalities involving entropy and mutual information are all of fundamental relevance to the classical problem (1) and more complex variations of it.

The goal of this article is laying down a firm foundation that we can use to build towards the program above. We start by proving the fundamental equivalence between these two fields by using simple convexity arguments, avoiding the variational calculus arguments that had been used before. We then take an alternate path to proving (2) also involving simple convexity arguments inspired by earlier results in rate-distortion theory.

We then focus on what is a common problem - instead of improving a prior for a fixed likelihood function, improve the likelihood function for a fixed prior but for a fixed prior. Interestingly, rate-distortion theory still is relevant to this problem, although the question that we are able to answer with it is smaller in scope. Through a simple change of variable argument, we will argue that if the negative log likelihood can be improved by modifying the prior, exactly the same negative log likelihood can be attained by modifying the likelihood function instead. Thus if rate-distortion theory predicts that there is scope for improvement for a prior, the same holds for the likelihood function but conversely, while rate-distortion theory can precisely determine when it is that a prior can no longer be improved, the same cannot be said for the likelihood function.

Finally, we test whether the lower bound derived and the corresponding fundamental quantity are useful in practice when making modeling decisions by applying these ideas to a problem in image modeling for which there have been several recent results involving Variational Autoencoders.

2 Technical preliminaries

In our treatment, we will use notation that is commonly used in information theory. If you have two distributions the KL divergence from to is denoted as

which is known to be nonnegative. For a discrete random variable

, we denote its entropy . If you have two random variables , , their mutual information is

We assume that the data comes from some arbitrary alphabet which could be continuous or discrete. For example, it could be the set of all sentences of length up to a given number of words, the set of all real valued images of a given dimension, or the set of all real valued time series with a given number of samples.

Let be a random variable distributed uniformly over the training data . The entropy of this random variable is , and the fundamental lossless compression theorem of Shannon tells us that any compression technique for independently, identically distributed data samples following the law of must use at least bits per sample. But what if one is willing to allow losses in the data reconstruction? This is where rate-distortion theory comes into play. Shannon introduced the idea of a reconstruction alphabet (which need not be equal to ), and a distortion function which is used to measure the cost of reproducing an element of with an element of . He then introduced an auxiliary random variable , which is obtained by passing through a channel , and defined the rate-distortion function as


Shannon’s fundamental lossy source coding theorem in essence shows that plays the same role as the entropy plays for lossless compression - it represents the minimum number of bits per sample needed to compress within a fidelity , showing both necessity and sufficiency when the number of samples being compressed approaches infinity. This fundamental result though is not needed for this paper; its relevance to the modeling problem will be shown in a subsequent publication. What we will use instead is the theory of how you compute

rate-distortion functions. In particular, by choosing a uniform distribution over

and by not requiring that these are unique, we are in effect setting the source distribution to be the empirical distribution observed in the training data.

3 The main mathematical results

In latent variable modeling, the correct choice for the distortion function turns out to be

The fundamental reason for this will be revealed in the statement of Theorem 1 below. Computing a rate-distortion function starts by noting that the constraint in the optimization (8) defining the rate-distortion function can be eliminated by writing down the Lagrangian:

Our first result connects the prior optimization problem (2) with the optimization of the Lagrangian:

Theorem 1 (prior optimization is an instance of rate-distortion)

For any ,


The two central actors in this result , both have very significant roles in both latent variable modeling and rate-distortion theory. The prior is known as the “output marginal” in rate-distortion theory, as it is related to the distribution of the compressed output in a lossy compression system. On the other hand, , which is called the “test channel” in rate-distortion theory, is the conditional distribution that one uses when optimizing the famous Evidence Lower Bound (ELBO) which is an upper bound to the negative log likelihood (in contrast, (4) is a lower bound on the same). In the context of variational autoencoders, which is a form of latent variable modeling, is also called the “stochastic encoder”. An optimal prior according to (2) is also an optimal output distribution for lossy compression, and conversely, from an optimal test channel in lossy compression one can derive an optimal prior for modeling.

3.1 Proof of Theorem 1

The proof consists on proving “less than or equal” first in (9) and then “more than or equal”. We start with the following lemma which is a simple generalization of the Evidence Lower Bound (ELBO). It can be proved through a straightforward application of Jensen’s inequality.

Lemma 1

For any conditional distribution , , , and for any , and ,


Taking the expectation w.r.t. in (10) (assuming is uniform over ), and through an elementary rearrangement of terms, one derives that for any , any conditional distribution , any , and defining to be random variables distributed according to ,


Since we are free to choose whatever way we want, we set thus eliminating the divergence term in (11).

Since this is true for any conditional distribution , we can take the infimum in the right hand side, proving ”less than or equal”. To prove the other direction, let be any distribution and define


Let be distributed according to . Then

Since this is true for any distribution , we can take the minimum in the right hand side, completing the proof.

3.2 Lower bound on the negative log likelihood: optimizing priors

The goal in this section is to derive the lower bound (4). Theorem 1 suggests that the problem of lower bounding negative log likelihood in a latent variable modeling problem may be related to the problem of lower bounding a rate-distortion function. This is true - mathematical tricks in the latter can be adapted to the former. The twist is that in information theory, these lower bounds apply to (8), where the object being optimized is the test channel whereas in latent variable modeling we want them to apply directly to (1), where the object being optimized is the prior .

The beginning point is an elementary result, inspired by the arguments in Theorem 2.5.3 of Berger (1971), which can be proved using the inequality :

Lemma 2

For any , and

Taking the expectation with respect to in Lemma 2 we obtain

For any , if you substitute then we obtain

We now eliminate any dependence on on the right hand side by taking the , and then take the over on both sides, obtaining that for any ,

The we eliminated refers to the “optimal” prior, which we don’t know (hence why we want to eliminate it). We now bring the ”suboptimal” prior: since the above is true for any , set , where is the latent variable model (1) for a given . Note now that the expression on the left is a log likelihood only for , so choosing this value and assuming that is distributed uniformly over the training data , then we obtain

Theorem 2 (information theoretic lower bound on negative log likelihood)

This bound is sharp. If you plug in the that attains the , then the is exactly zero, and conversely, if does not attain the , then there will be slackness in the bound. It’s sharpness can be argued in a parallel manner to the original rate-distortion theory setting. In particular, by examining the variational problem defining the latent variable modeling problem, we can deduce that an optimal must satisfy

Thus for an optimal the in Theorem 2 is exactly zero and the lower bound matches exactly with the trivial upper bound (3).

The reader may be wondering: what if we don’t know exactly? We reinforce that represents not the true model, but rather the proposed model for the data. Still, when performing variational inference, the most typical result is that we know how to evaluate exactly a lower bound to (this is true when we are using the Evidence Lower Bound or importance sampling (Burda et al., 2016)). In that case, you obtain an upper bound to the gap (see (6) for the definition of ). If the statistician wants to get a tighter bound, she can simply increase the number of samples in the importance sampling approach; by Theorem 1 of Burda et al. (2016) we know that this methodology will approximate arbitrarily closely as the number of samples grows.

3.3 Optimizing the likelihood function for a fixed prior.

A very common assumption is to fix the prior to a simple distribution, for example, a unit variance, zero mean Gaussian random vector, and to focus all work on designing a good likelihood function

. Can rate-distortion theory, which we have only shown relevant to the problem optimizing a prior, say anything about this setting?

We assume that is an Euclidean space in some dimension . Assume that we start with a latent variable model , and then we notice that there is a better choice for the same , in the sense that


Further assume that there is a function with the property that if is a random variable distributed as , then is a random variable distributed as . For example, may describe a unit variance memoryless Gaussian vector and may describe a Gaussian vector with a given correlation matrix; then it is known that there is a linear mapping such that has distribution . Another example is when both and are product distributions , , where each of the components , are continuous distributions. Then from the probability integral transform and the inverse probability integral transform we can deduce that such a exists. We do point out that when is discrete (for example, when is finite), or when it is a mixture of discrete and continuous components, then such in general cannot be guaranteed to exist.

We regard , as probability measures used to measure sets from the -algebra . In the Appendix, we prove that if is a measurable function the two following Lebesgue integrals are identical:


The last integral in (15) is identical to th integral in the left hand side of (14). Therefore, one can define a new , and the negative log likelihood of the latent variable model will be identical to that of . A key consequence is that if the lower bound in Theorem 2) shows signs of slackness, then it automatically implies that the likelihood function admits improvement for a fixed prior. It is important to note that this is only one possible type of modification to a likelihood function. Thus if rate-distortion theory predicts that a prior cannot be improved any more, then this does not imply that the likelihood function cannot be improved - it only means that there is a specific class of improvements that are ruled out.

4 Experimental validation

The purpose of this section is to answer the question: are the theoretical results introduced in this article of practical consequence? The way we intend to answer this question is to pose a hypothetical situation involving a specific image modeling problem where there has been a significant amount of innovation in the design of priors and likelihood functions, and then to see if our lower bound, or quantities motivated by the lower bound, can be used, without the benefit of hindsight to help guide the modeling task.

We stress that it makes little sense to choose an “optimal” likelihood function or “optimal” prior if the resulting model overfits the training data. We have taken specific steps to ensure that we do not overfit. The methodology that we follow, borrowed from (Tomczak & Welling, 2018)

, involves training a model with checkpoints, which store the best model found from the standpoint of its performance on a validation data set. The training is allowed to go on even as the validation set performance degrades, but only up to a certain number of epochs (50), during which an even better model may be found or not. If a better model is not found within the given number of epochs, then the training is terminated early, and the model used is the best model as per the checkpoints. We then apply our results on that model. Thus, if it turns out that our mathematical results suggest that it is difficult to improve the model, then we have the best of both worlds - a model that is believed to not overfit while simultaneously is close to the best log likelihood one can expect (when optimizing a prior) or a model that cannot be improved much more in a specific way (when improving the likelihood function).

The main analysis tool revolves around the quantity defined as . We know that if , then it is possible to improve the negative log likelihood of a model by either changing the prior while keeping the likelihood function fixed, or by changing the likelihood while keeping the prior fixed, the scope of improvement being identical in both cases, and upper bounded by nats. We also know that if then the prior cannot be improved any more (for the given likelihood function), but the likelihood function may still be improved.

Thus is in principle an attractive quantity to help guide the modeling task. If the latent space is finite, as it is done for example with discrete variational autoencoders (Rolfe, 2017), then it is straightforward to compute this quantity provided is not too large. However if is continuous, in most practical situations computing this quantity won’t be possible.

The alternative is to choose samples in some reasonable way, and then to compute some statistic of . Recall that the individual elements can be used to improve the prior using the Blahut-Arimoto update rule:

If is close to a zero (infinite) vector, then the update rule would modify the prior very little, and because rate-distortion functions are convex optimization problems, this implies that we are closer to optimality. Inspired on this observation, we will compute the following two statistics:


which we will call the glossy statistics, in reference to the fact that they are supporting a generative modeling process using lossy compression ideas. The core idea is that the magnitude of these statistics are an indicator of the degree of sub optimality of the prior. As discussed previously, sub-optimality of a prior for a fixed likelihood function immediately implies sub-optimality of the likelihood function for a fixed prior. These statistics of will vary depending on how the sampling has been done, and thus can only be used as a qualitative metric of the optimality of the prior.

In Variational Inference it is a common practice to introduce a ”helper” distribution and to optimize the ELBO lower bound of log likelihood . Beyond its use as an optimization trick, the distribution , also called the ”encoder” in the variational autoencoder setting (Kingma & Welling, 2014), plays an important practical role: it allows us to map a data sample to a distribution over the latent space.

Note that one wants to find values of for which is ”large”. One way to find good such instances is to note that the dependency on is through . Additionally note that in essence is designed to predict for a given values for for which will be large. Our proposal thus is to set to be the mean of for (and thus ). There are many additional variants of this procedure that will in general result in different estimates for the glossy statistics. We settled on this choice as we believe that the general conclusions that one can draw from this experiment are well supported by this choice.

4.1 Experiments and interpretation

Our experimental setup is an extension of the publicly available source code that the authors of the VampPrior article (Tomczak & Welling, 2018) used in their work. We use both a synthetic dataset as well as standard datasets. The purpose of the synthetic dataset is to illustrate how the statistics proposed evolve during training of a model where the true underlying latent variable model is known. For space reasons, the results for the synthetic dataset are included in the Appendix. The data sets that we will use are image modeling data sets are Static MNIST (Larochelle & Murray, 2011), OMNIGLOT (Lake et al., 2015), Caltech 101 Silhouettes (Marlin et al., 2010), Frey Faces (FreyFaces, ), Histopathology (Tomczak & Welling, 2016) and CIFAR (Krizhevsky, 2009). In order to explore a variety of prior and likelihood function combinations we took advantage of the publicly available source code that the authors of the VampPrior article (Tomczak & Welling, 2018) used in their work, and extended their code to create sample from using the strategy described above, and implemented the computation of the quantities, both of which are straightforward given that the ELBO optimization gives us and an easy means to evaluate for any desired value of .

staticMNIST Omniglot
likelihood function
prior class NLL
max stat
std stat
max stat
std stat
VAE standard 88.44 15.7 34.3 107.83 18.4 27.3
VampPrior 85.78 12.1 37.6 107.62 19.0 25.6
HVAE standard 86.21 11.8 38.3 103.40 16.3 33.3
() VampPrior 84.96 9.8 40.1 103.78 14.6 31.6
PixelHVAE standard 80.41 7.3 3.0 90.80 10.3 1.0
() VampPrior 79.86 7.7 4.1 90.99 8.0 0.8
Caltech 101 Frey Faces
likelihood function
prior class NLL
max stat
std stat
max stat
std stat
VAE standard 128.63 79.4 160.5 1808.19 89.3 331.5
VampPrior 127.64 63.6 143.7 1746.33 82.3 349.1
HVAE standard 125.82 59.4 204.8 1798.42 170.2 347.0
() VampPrior 121.02 47.2 210.7 1761.82 152.0 325.5
PixelHVAE standard 85.79 18.4 7.4 1687.10 45.7 55.1
() VampPrior 86.34 20.8 7.5 1676.04 9.4 33.8
Histopathology cifar10
likelihood function
prior class NLL
max stat
std stat
max stat
std stat
VAE standard 3295.07 167.8 324.5 13812.19 198.1 1325.3
VampPrior 3286.61 69.5 348.9 13817.33 628.1 1303.9
HVAE standard 3149.11 127.3 475.3 13400.96 479.3 1634.4
() VampPrior 3127.67 92.2 484.9 13399.95 195.6 1539.0
PixelHVAE standard 2636.78 -0.0 0.0 10915.34 178.4 105.5
() VampPrior 2625.61 -0.1 0.0 10961.67 246.7 36.3
Table 1: Glossy statistics for our experiments - test NLL in nats. Lower is better. denotes the number of stochastic layers.

For the choice of priors, we use the standard zero mean, unit variance Gaussian prior as well as the variational mixture of posteriors prior parametric family from (Tomczak & Welling, 2018) with 500 pseudo inputs for all experiments. Our choices for the autoencoder architectures are a single stochastic layer Variational Autoencoder (VAE) with two hidden layers (300 units each), a two stochastic layer hierarchical Variational Autoencoder (HVAE) described in (Tomczak & Welling, 2018) as well as a PixelHVAE, a variant of PixelCNN (van den Oord et al., 2016) which uses the HVAE idea described earlier. Each of these (VAE, HVAE, PixelHVAE) represent distinct likelihood function classes. We are aware that there are many more choices of priors and models than considered in these experiments; we believe that the main message of the paper is sufficiently conveyed with the restricted choices we have made. In all cases the dimension of the latent vector is 40 for both the first and second stochastic layers. We use Adam (Kingma & Ba, 2015) as the optimization algorithm. The log test likelihood reported is not an ELBO evaluation - it is obtained through importance sampling (Burda et al., 2016). We refer the reader to this Tomczak & Welling (2018) for a precise description of their setup.

We now refer the reader to Table 1. We want the reader to focus on this prototypical setting:

A modeling expert has chosen a Gaussian unit variance, zero mean prior (denoted by ”standard”), and has decided on a simple likelihood function class (denoted by ”VAE”). Can these choices be improved?

The goal is to answer without actually doing the task of improving the parametric families. This setting is the first row in all of the data sets in Table 1. We first discuss the MNIST experiment (top left in the table). The modeling expert computes the and Std statistics, and notices that they are relatively large compared to the negative log likelihood that was estimated. Given the discussion in this paper, the expert can conclude that the negative log likelihood can be further improved by either updating the prior or the likelihood function.

A modeling expert may then decide to improve the prior (second row of the table), or improve the likelihood function (third and fifth rows of the table). We see that in all these cases, there are improvements in the negative log likelihood. It would be incorrect to say that this was predicted by the glossy statistics in the first row; instead what we can say is that this was allowed - it can still be the case that the statistics allow an improvement on the negative log likelihood, but a particular enhancement proposal does not result on any improvement. Next notice that the std glossy statistic for the fifth row is much smaller than the in the previous configurations. The suggestion is that improving the negative log likelihood by improving the prior is becoming harder. Indeed, in the sixth row we see that a more complex prior class did not result in an improvement in the negative log likelihood. As discussed previously, this does not mean that the likelihood function class can no longer be improved. It does mean though that a particular class of enhancements to the likelihood function - in particular, transforming the latent variable with an invertible transformation - will likely not be fruitful directions for improvement.

The general pattern described above repeats in other data sets. For example, for Caltech 101 the reduction of the and Std glossy statistics when PixelHVAE is introduced is even more pronounced than with MNIST, suggesting a similar conclusion more strongly. A result that jumps out though is the PixelHVAE result for Histopathology, which prompted us to take a close look - here the statistics are actually zero. It turns out that this is an instance where the likelihood function learned to ignore the latent variable, a phenomenon initially reported by Bowman et al. (2016). It also serves as a cautionary tale: the glossy statistics only tell us whether for a fixed likelihood function one can improve the prior, or whether for a fixed prior, a certain type of changes to the likelihood function will actually be an improvement to the negative log likelihood. If the likelihood function is ignoring the latent variable, any prior would be optimal, and no transformation of that prior would result in a better likelihood function, which is what the statistics report. We stress that the exact numbers being reported matter little - a simple change in the sampling process for producing the will immediately result in different numbers. However our experimentation with changing the sampling process resulted in essentially the same general conclusion we are deriving above.

Based on these experiments, we claim that the information theoretically motivated quantity and its associated glossy statistics (16) do provide useful guidance in the modeling task, and given how easy they are to estimate, could be regularly checked by a modeling expert to gauge whether their model can be further improved by either improving the prior or the likelihood function.

5 Conclusions and future work

The main goal for this article was to argue strongly for the inclusion of rate-distortion theory as key for developing a theory of representation learning. In the article we showed how some classical results in latent variable modeling can be seen as relatively simple consequences of results in rate-distortion function computation, and further argued that these results help in understanding whether prior or likelihood functions can be improved further (the latter with some limitations), demonstrating this with some experimental results in an image modeling problem. There is a large repertoire of tools, algorithms and theoretical results in lossy compression that we believe can be applied in meaningful ways to latent variable modeling. For example, while rate-distortion function computation is an important subject in information theory, the true crown jewel is Shannon’s famous source coding theorem; to-date we are not aware of this important result being connected directly to the problem of latent variable modeling. Similarly, rate-distortion theory has evolved since Shannon’s original publication to treat multiple sources and sinks; we believe that these are of relevance in more complex modeling tasks. This research will be the subject of future work.


  • Alemi et al. (2018) Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V. Dillon, Rif A. Saurous, and Kevin Murphy. Fixing a broken elbo. In Proceedings of the 35th International Conference on Machine Learning. 2018.
  • Banerjee et al. (2004) Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, and Srujana Merugu. An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, pp. 8–, New York, NY, USA, 2004. ACM. ISBN 1-58113-838-5.
  • Berger (1971) T. Berger. Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice-Hall electrical engineering series. Prentice-Hall, 1971.
  • Blahut (1972) R. Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18(4):460–473, Jul 1972.
  • Bowman et al. (2016) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating sentences from a continuous space. In The SIGNLL Conference on Computational Natural Language Learning. 2016.
  • Burda et al. (2016) Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. In The International Conference on Learning Representations (ICLR). 2016.
  • (7) FreyFaces. Frey Faces Data Set. URL
  • Giraldo & Príncipe (2013) Luis Gonzalo Sánchez Giraldo and José C. Príncipe. Rate-distortion auto-encoders. CoRR, abs/1312.7381, 2013. URL
  • Higgins et al. (2017) I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In The International Conference on Learning Representations (ICLR), Toulon. 2017.
  • Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR). 2015.
  • Kingma & Welling (2014) D.P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In The International Conference on Learning Representations (ICLR), Banff. 2014.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • Lake et al. (2015) Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. In Science, volume 350, pp. 1332?1338. 2015.
  • Larochelle & Murray (2011) Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In

    Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)

    . 2011.
  • Lindsay (1983) Bruce G. Lindsay. The geometry of mixture likelihoods: A general theory. Ann. Statist., 11(1):86–94, 03 1983.
  • Marlin et al. (2010) Benjamin Marlin, Kevin Swersky, Bo Chen, and Nando Freitas.

    Inductive principles for restricted boltzmann machine learning.

    In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9, pp. 509–516, 2010.
  • Neal & Hinton (1998) Radford M. Neal and Geoffrey E. Hinton. A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants, pp. 355–368. Springer Netherlands, Dordrecht, 1998.
  • Rolfe (2017) Jason Tyler Rolfe. Discrete variational autoencoders. In The International Conference on Learning Representations (ICLR), Toulon. 2017.
  • Rose (1998) Kenneth Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, (11):2210–2239, November 1998.
  • Shannon (1959) C. E. Shannon. Coding theorems for a discrete source with a fidelity criterion. In IRE Nat. Conv. Rec., Pt. 4, pp. 142–163. 1959.
  • Shwartz-Ziv & Tishby (2017) Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. CoRR, abs/1703.00810, 2017. URL
  • Slonim & Weiss (2002) Noam Slonim and Yair Weiss. Maximum likelihood and the information bottleneck. In Proceedings of the 15th International Conference on Neural Information Processing Systems, NIPS’02, pp. 351–358, Cambridge, MA, USA, 2002. MIT Press.
  • Tishby et al. (1999) N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pp. 368–377. 1999.
  • Tishby & Zaslavsky (2015) Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop. 2015.
  • Tomczak & Welling (2016) Jakub M. Tomczak and Max Welling. Improving variational auto-encoders using householder flow. Neural Information Processing Systems Workshop on Bayesian Deep Learning, 2016.
  • Tomczak & Welling (2018) Jakub M. Tomczak and Max Welling. VAE with a VampPrior. In The 21nd International Conference on Artificial Intelligence and Statistics. 2018.
  • van den Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.

    Pixel Recurrent Neural Networks.

    In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA. 2016.
  • Watanabe & Ikeda (2015) Kazuho Watanabe and Shiro Ikeda. Entropic risk minimization for nonparametric estimation of mixing distributions. Machine Learning, 99(1):119–136, Apr 2015.

6 Appendix

6.1 Supporting proof for Equation (15)

In the paper we argue that if one is able to improve a latent variable model by changing the prior, then the same improvement can be attained by changing the likelihood function instead. In order to prove this, we claimed that if is a measurable function the two following Lebesgue integrals are identical:


This can be seen as follows: for a measurable set , by definition

and therefore setting for any given real ,

One definition of the Lebesgue integral is through the Riemann integral:

This demonstrates (15).

6.2 Experiments on a synthetic data set

The theme of the article is on how we can use information theoretic methods to understand how much more a model can be improved. One way to illustrate this is to create training and test data samples coming from a known “true” latent variable model, and then to showcase the statistics that we are proposing in the context of a training event. As the training goes on, we would hope to see how the statistics ”close in” the true test negative log likelihood.

For the synthetic data set, in the true latent variable model we have a discrete latent alphabet

, each denoting an digit that we took from a static binarized MNIST data set

(Larochelle & Murray, 2011). We assume that is the uniform distribution over , and we assume that is a probability law which adds i.i.d. Bernoulli noise () to each of the binary pixels of the digit associated with latent variable . Following the MNIST data set structure, we created a data set with 50K training, 10K validation and 10K test images by sampling from the latent variable model. The 10K test images have a true negative log likelihood score of 78.82.

We refer the reader to Figure 1 for our experimental results on the synthetic data set. For this experiment, we consider only a single stochastic layer VAE with the standard zero mean, unit variance prior. This figure is a graphical depiction of the upper and lower bounds (3) and (4), applied to the test data set negative log likelihood. The figure shows how the estimate of the lower bound starts to approach the true negative log likelihood as the epochs progress - the gap between the upper and lower bounds is exactly


which can be computed exactly since the latent variable space is finite. The figure also overlays on a second axis the square root of the glossy variance statistic which we may also call the glossy std statistic.

What we were hoping to show is that the statistics initially show a large potential margin for improvement which eventually shows convergence. As we can see from the Figure, this intuition is confirmed. The gap between upper and lower bounds (18

) starts large, but does improve significantly as the training progresses. We also see improvements in the standard deviation statistic, although they are not as dramatic as those in the upper and lower bound gaps - we suspect we had to wait longer to see this effect.

Figure 1: Upper bound and glossy lower bound estimate on negative log likelihood as the training evolves for the synthetic data set. The glossy standard deviation statistic is also overlaid.