Finding a model able to identify the underlying casual factors (hidden representation
) of the visible data is a key problem in machine learning research. In literature it is possible to distinguish many ways to learn such representation(dinh2016density, ; hinton2006fast, ; maddison2017filtering, ; radford2015unsupervised, ), but recently two family models become dominant: Variational AutoEncoder (VAE) (kingma2013auto, ; rezende2014stochastic, ) and Generative Adversarial Network (GAN) (goodfellow2014generative, ). Although the two families follow different approaches, they share the same principles: assuming there exists a representation , for any observation distributed according to
, find the probability
approximating the original one according to a specific metric; in particular, VAE minimizes a lower bound of the Kullback-Leibler divergence,, whereas GAN minimizes the Jensen-Shannon divergence, . But, since in both cases the objective function does not depend on the learned representation, , it is not guaranteed that these methods will learn a useful representation for the generative model.
In particular, VAE, a method which is defined to learn a good representation, in the specific case where the decoder is particularly powerful, tends to ignore the encoded variable , i.e. it does not learn a useful representation. Common suggestions to overcome the poor representation issue are: bound the encoding capacity (chen2016variational, ; higgins2017beta, ), or maximize the mutual information between the visible and hidden representation (alemi2017fixing, ); indeed, a useful representation is the one containing the salient properties of the visible data.
Starting from such observations, in this manuscript we propose a method that maximizes a variational lower bound of the mutual information between the visible and hidden representation while maintaining a bound on the entropy of the encoded data. The derived model is a variational autoencoder having the same form as the Wasserstein AutoEncoder (WAE) (tolstikhin2017wasserstein, ), an autoencoder minimizing the optimal transport between the visible and generated data. Thanks to the information description we are able to highlight the role of the network capacity: the amount of information that can be stored by the representation. In particular, we observe that in order to learn a good representation it is not necessary to minimize the encoding mutual information channel as suggested in (higgins2017beta, ), but it is sufficient to bound the capacity of the network, i.e. the entropy of the hidden term. The obtained results lead us to argue that an unsupervised network should optimize a capacity-constrained InfoMax measure, a principle slightly different from the Information Bottleneck (tishby2000information, ).
The work is divided as follows: in the second section we describe briefly the VAE and its variants, in the third and fourth sections we describe the variational infomax (VIMAE) method and the related work. We conclude the paper with experimental results and the conclusions.
The aim of this section is to describe VAE, understand its principal issues and describe the two most relevant approaches to overcome such issues.
2.1 Notation and preliminary definitions
We use calligraphic letters (i.e. ) for sets, capital letters (i.e.
) for random variables, and lower case letters (i.e.) for their samples. With abuse of notation we denote both the probability and the corresponding density with the lower case letters (i.e. ).
Given two random distributions and , the -divergence
is an (intuitive) measure of the distance between the distributions and . In the case , is called Kullback-Leibler (KL) divergence.
Given two random variables and
, with joint distributionand marginals and , the mutual information
is a measure of the reduction of uncertainty in due to the knowledge of
2.2 Variational autoencoder
From now on let us assume that the unknown distribution of the data coincides with the empirical one , and that the distribution of the latent representation is known. In this context the VAE is a model solving the following optimization problem: find the generative model , specified by the parameters of the associated neural network, maximizing the ELBO objective
a lower bound of the unfeasible-to-compute marginal likelihood . The ELBO objective is optimized by a regularized autoencoder, with encoder and decoder parameterizing, respectively, the inference and generative distributions, and , with regularizer defined by the rate term , measuring the excess number of bits required to encode samples from the encoder using the optimal code designed for .
2.3 Uninformative representation issue
As underlined in the introduction, the main issue of VAE is that it learns an uninformative representation and a -independent generative model . Such issues are intrinsic in the ELBO objective (2), that reaches the optimum when , (zhao2017infovae, ). The rate term, which can be rewritten as
is a penalty on the encoding capacity, and reaches the optimum when , with , i.e. when does not encode any information about the input .
We now describe the two most relevant models that try to overcome the uninformative representation issue.
In (zhao2017infovae, ) the InfoVAE family of models was proposed, a generalization of the VAE model optimizing the objective
with and two real hyper-parameters.
The main advantage of this definition is that it is possible to consider separately the two components of the rate term. In particular, in (zhao2017infovae, ) it was observed that by eliminating the information penalty (), the generative performance of the model improves and the representation is more informative.
In (higgins2017beta, ), starting from the observation that the optimal case is rare, but most of the learned features by VAE are uninformative, an opposite approach is proposed: put a high penalty to the rate term. The -VAE family is a particular case of InfoVAE where . This idea, that at first sight looks counter-intuitive, is based on the observation that by the additive property of the KL-divergence
pushing the penalty associated with the rate is equivalent to penalizing the informativeness of most features, leaving few features containing the relevant information. A similar conclusion was derived in (chen2016variational, ) where starting from a bits-back coding argument it is highlighted that minimizing the encoding capacity bounds the variational extra term added by the variational approximation in (2), , measuring the extra-code length for using a non-precise inference .
We conclude this section observing that although InfoVAE and VAE approaches are antithetic, in both the cases the hyper-parameter associated to the KL divergence term is bigger then .
3 The Model
3.1 The Variational InfoMax (VIMAE)
Assuming known the distribution associated to the two random variables and the InfoMax objective is defined as: find the joint distribution maximizing the mutual information .
Since the definition via KL divergence is computationally intractable, it is necessary to re-write the mutual information as
where is the entropy of and is a measure of the information contained by the random variable, and is the conditional entropy, a measure of the information lost by about . Since the entropy is constant. Then in order to maximize the mutual information it is sufficient to minimize the conditional entropy.
Excluding some special cases (bell1997independent, ), minimizing the conditional entropy is unfeasible. Thus it is necessary to consider a variational lower bound of the original objective. In the same fashion as done in (agakov2004algorithm, ), we see that by the non-negativity of the KL divergence, for any the conditional entropy is bounded by the reconstruction accuracy term , indeed:
Then the associated variational objective to maximize is given by:
In order to proceed to a numerical optimization of , and optimize the variational conditional entropy minimizing the reconstruction loss of the associated autoencoder, as done in VAE, it is necessary to remove the condition , and consider the following relaxed form
where is an hyper-parameter associated to the generic -divergence , penalizing all the far from . In this way the objective (9) is optimized by a variational autoencoder model with regularizer defined by .
From now on let us assume, . In this case the regularizer is approximated via the Maximum Mean Discrepancy (MMD)(zhao2017infovae, ) defined as
where is the Reproducing Kernel Hilbert Space associated to a positive definite kernel .
In VAE we observed that an uninformative representation was caused by the non-informativeness of the encoding map . Since from equation (9) it is not clear how behaves, we consider an equivalent representation, (zhao2017infovae, ):
From (11) we see that the infomax objective (9) can be read as a composition of three sub-objectives: find a generative model resembling the visible representation (first term); maximize the encoded mutual information (fourth term); and learn an inferred distribution close to the generative model . Then the optimum is obtained by such that is maximal, confirming the validity of the approximation made above.
3.2 Channel capacity
The divergence term in (9) can be rewritten as
where is the cumulative distribution associated to , i.e. . Thanks to the relationship in (12) we see that minimizing the KL divergence is equivalent to maximizing a constrained entropy of the latent variable. In this way it is possible to interpret as a constraint for the shape of distribution and its entropy, since the entropy in (12) is lower then zero and it is equal to zero only when .
This observation, although simple, is theoretically relevant: it allows us to interpret the divergence penalty as a bound of the network capacity ; i.e. a bound of the information that can be stored by the variable, and then consequently it suggests to see the variational InfoMax as an approximation of the following objective:
By the constrained infomax objective (13
), and by the disentangled representation learned by-VAE, we deduce that the ability to learn the relevant factors of the visible data is associated to the constraint in the capacity of the channel. Intuitively, since a small capacity network can contain only a small amount of information, the network has to transfer only the relevant features. In order to test this assumption in the experiments (see below) we trained the model assuming
is logistically distributed with unity variance. We choose such a distribution for two reasons: it has less entropy than a Gaussian distribution and because it is a common assumption in natural science to suppose that the hidden factors of the visible data are logistically distributed(hyvarinen2009natural, ).
4 Related work
Autoencoder models are one of the most used family of neural networks to extract features in an unsupervised way (bengio2013representation, ), and their relationship with Information Theory is well-established from the first unregularized autoencoders (baldi1989neural, ). The classical unregularized autoencoders, minimizing the reconstruction loss , are maximizing an unbounded information, i.e. they are looking for a solution in the space . A solution in this wide space is good only to reconstruction performance because contains all the possible information that can be stored in the space , but from this representation it is impossible to sample, because the prior is unknown; and moreover such a representation, in general, is not robust to input noise (vincent2008extracting, ).
Many regularized models have been proposed, but the most well known is VAE, that minimizes the expected code length of communicating . As we observed in the previous sections, it is not guaranteed that this method finds a useful representation, and in the second section we illustrated two principal ways to improve VAE.
The objective (9) was derived independently in (tolstikhin2017wasserstein, ) and (zhao2017infovae, ). The derivation in (tolstikhin2017wasserstein, ) is of particular relevance because it allows us to describe an informative model as the one minimizing the transport cost between the original and generated data.
Finally, we underline that in case we wish to consider a Jensen-Shannon divergence in (9) it is necessary to consider an adversarial network model, discriminating the true samples from the fake sampled by (goodfellow2014generative, ). In the latter case the obtained model is equivalent to the Adversarial AutoEncoder (makhzani2015adversarial, ). We conclude by remarking that in all the cases cited above the Infomax objective was never maximized using a prior different from a Gaussian.
Information theoretic literature
Information theory is strongly related with neural networks, and not only with autoencoders. Originally the InfoMax objective was applied to a self-organized system with a single hidden layer, (bell1997independent, ; linsker1989application, )
where the bound in the capacity was given by the numbers of hidden neurons. Recently the (naive) InfoMax has given way to a new information-theoretic principle: the Information-Bottleneck (IB)(tishby2000information, )
. The idea of this principle is that a feed-forward neural network trained for tasktends to learn a minimally sufficient representation of the data, maximizing the following objective:
Although it was shown that in the general case this principle does not hold true (michael2018on, ), the principle was used as a regularization technique with success both in unsupervised (higgins2017beta, ) and supervised (alemi2016deep, ) settings. We observe that the VIM (9) and IB (14) differ only in the constraint term, respectively the capacity and the encoding information, and coincide in the case of a deterministic encoder.
Here we empirically evaluate the Variational InfoMax model. The section is divided in two parts: in the first part we compare the ability of VAE, -VAE and VIMAE to infer the representation, , observing that the latter two models are able to learn a posterior fitting well .
In the second part we compare the quality of the learned representation by the different models, in particular paying attention on the differences between a small entropy, logistically distributed representation, and a Gaussian distributed one. In order to evaluate the representation we compare the models in the following tasks: generation, reconstruction and semi-supervised learning; where the last two tasks will be performed both with corrupted and clean input data in order to evaluate the robustness of the learned features.
Since we approximate the KL divergence, , with the MMD distance, the regularizer in the autoencoder has the following form:
with and . We used the inverse multi-quadratics kernel , with . We choose this kernel because it is characterized by heavy tails and then it is suitable to measure the distance of a leptokurtic distribution as the logistic distribution that we used as prior .
5.1 The shape of Z
The experiments in this subsection are performed with an autoencoder trained with the MNIST data-set, a collection of 70k monocromatic handwritten digits, where both the inference and generative distributions are modelled by 3-layer deep neural nets with 256 hidden units in each layer and .
In figure 1 are plotted the 2d learned representations by the different methods and we observe that VAE is not able to learn an hidden representation fitting the prior . This phenomenon is consistent with the -VAE hypothesis and attests how the theoretically null rate issue, , is rare and it happens only in case of really powerful autoregressive generative models like PixelRNN (oord2016pixel, ) and PixelCNN++ (salimans2017pixelcnn++, ). This example is particularly useful to understand the necessity to penalize the capacity term, as done in -VAE and VIMAE. These models are able to learn a representation fitting fairly well to .
5.2 The role of capacity
with batch normalization(ioffe2015batch, ). We consider two data-sets: MNIST, a standard benchmark with ground-truth labels, to evaluate the quality of the learned representation and CelebA (liu2015deep, ), consisting of roughly of 203k center cropped faces of resolution, in order to compare the generative quality of the pictures. After considering many parameters for and , we choose, in accordance to what is suggested in (tolstikhin2017wasserstein, ), for MNIST and for CelebA experiments.
We define a good representation as one containing the relevant properties of the visible data. In order to evaluate such quality, following the approach proposed in (rifai2011higher, ), we evaluate the accuracy of an SVM directly trained on the learned features of the data. Proceeding as made in (zhao2017infovae, ), we train the M1+TSVM (kingma2014semi, ) and use the semi-supervised performance over 1000 samples as an approximate metric to verify the relevance and the quality of the learned representation. In order to evaluate the robustness of the learned features, we performed the same algorithm on the representation associated to a corrupted data. In particular we consider two types of noise corruption: Gaussian and mask. In the Gaussian case, we add to each MNIST pixel a value sampled from with ; in the masking case a fraction
of the elements is forced to be 0 according to a Bernoulli distribution. Higher classification performance suggests that the learned representation contains the relevant information and, in case of corrupted input data, that it is robust.
From the classification scores listed in table 1
, it is clear that VAE, differently from the other methods, is not able to learn a relevant representation and is not robust to noise. Particularly relevant are the last two rows, where the two VIMAEs are compared, VIMAE-n and VIMAE-l with normal and logistic priors, respectively. The normal distribution, having larger entropy than the logistic distribution, is able to store more information, and the associated representation has the best classification score with clean data; but such information is not completely relevant and the representation is not as robust as the one learned by VIMAE-l. Such phenomenon is particularly clear in the highly corrupted data case,, where it is necessary to individuate and to extract the relevant lineaments of the visible data.
Generation and reconstruction
The models that we are considering are defined as generative models: given a sample , they should be able to generate a new data similar to the original one. In figure 2 are plotted the reconstruction and the generated samples, obtained from the different models, and we observe that, although all the models are able to reconstruct (all of them are autoencoders), VAE and VAE do not generate good samples. Such behaviour is not surprising, and it is correlated with the inability to learn a prior fitting ; indeed, in the generative case the representation is sampled from and not from as in the reconstruction case.
In order to emphasize the ability of VIMAE to learn the relevant properties, in the second row of figure 2 we plotted the reconstruction when we fed the model with fifty percent masked digits. Such an experiment is particularly useful to see that VIMAE does not reconstruct the corrupted data, but the associated clean one. Indeed, the VIMAE reconstruction is the one minimizing the transport between the (corrupted) input and one of the (clean) training set.
The experiments with MNIST underlined that the VIMAE models outperform the others. For this reason we decided to compare the two variants of VIMAE when trained with a more challenging data-set: CelebA (liu2015deep, ). From figure 3 we observe that the differences between the two models, VIMAE-n and VIMAE-l, both in the reconstruction and generation are minimal. The small difference in generative performance is confirmed also by the Frechet Inception Distance (heusel2017gans, ) in table 2, where according to what was seen in table 1, we notice that the Gaussian prior variant, VIMAE-n, containing more information in the representation, has slightly better generative performance than the logistic prior variant, VIMAE-l.
Trying to solve the issue of the uninformativeness of the learned representation in VAE, we proposed a variational method that learns a generative model by maximizing the mutual information between the visible and hidden representations. In particular, the method maximizes a capacity-constrained InfoMax, where the constraint is given by the choice of the prior distribution .
We described the role of information capacity in the major variational autoencoder models and we deduce that by reducing the capacity a network tends to learn more robust and relevant features. The deduction was confirmed by computational experiments.
Future work will include the generalization of the capacity-constrained InfoMax to autoregressive models and to a supervised setting.
- (1) D. B. F. Agakov. The im algorithm: a variational approach to information maximization. Advances in Neural Information Processing Systems, 16:201, 2004.
- (2) A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
- (3) A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy. Fixing a broken elbo. arXiv preprint arXiv:1711.00464, 2017.
P. Baldi and K. Hornik.
Neural networks and principal component analysis: Learning from examples without local minima.Neural networks, 2(1):53–58, 1989.
- (5) A. J. Bell and T. J. Sejnowski. The “independent components” of natural scenes are edge filters. Vision research, 37(23):3327–3338, 1997.
- (6) Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- (7) X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.
- (8) L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
- (9) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- (10) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500, 12(1), 2017.
- (11) I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, volume 3, 2017.
- (12) G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
A. Hyvärinen, J. Hurri, and P. O. Hoyer.
Natural image statistics: A probabilistic approach to early computational vision., volume 39. Springer Science & Business Media, 2009.
- (14) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- (15) D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589, 2014.
- (16) D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- (17) R. Linsker. An application of the principle of maximum information preservation to linear systems. In Advances in neural information processing systems, pages 186–194, 1989.
- (18) Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
- (19) C. J. Maddison, J. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet, and Y. Teh. Filtering variational objectives. In Advances in Neural Information Processing Systems, pages 6573–6583, 2017.
- (20) A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
- (21) A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
- (22) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- (23) D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
- (24) S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Bengio, Y. Dauphin, and X. Glorot. Higher order contractive auto-encoder. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 645–660. Springer, 2011.
- (25) T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
- (26) A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox. On the information bottleneck theory of deep learning. In International Conference on Learning Representations, 2018.
- (27) N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
- (28) I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Extracting and composing robust features with denoising autoencoders.In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
- (30) S. Zhao, J. Song, and S. Ermon. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017.