Finding a proper data representation can be a crucial part of a given machine learning approach. In many cases, when there is a need for inferring property from a sample, it is the main purpose of the method. For instance, classification task aims to discover such data representation that has a useful high-level interpretation for a human such as class. Unsupervised learning aims to find patterns in unlabeled data that can somehow help to describe it and/or perform a relevant task. Recent deep neural network approaches tackle this problem from a perspective of representation learning, where the goal is to learn a representation that captures some semantic properties of data. If learned representations of salient properties are interpretable and disentangled it would improve generalization and make the downstream tasks robust and easier(Lake et al., 2016).
Over the last decade, generative models have become popular in unsupervised learning research. The intuition is that by generative modeling it may be possible to discover latent representations and their relation to observations. The two most popular training frameworks for such models are Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) and Variational Autoencoders (Kingma and Welling, 2013; Rezende et al., 2014). The latter is a powerful method for unsupervised learning of directed probabilistic latent variable models. Training a model within Variational Autoencoder (VAE) framework allows performing both tasks of inference and generation. This model is trained by maximizing the evidence lower bound (ELBO) which is a clear objective and results in stable training. However, the latent variable in ELBO is marginalized, thus it does not assess the ability of the model to do inference and the quality of latent code (Huszár, 2017; Alemi et al., 2018). Therefore, having a high ELBO does not necessarily mean that useful latent representations were learned. Moreover, powerful enough decoder can ignore the conditioning on latent code (Bowman et al., 2016; Chen et al., 2016b).
The key idea of our approach is to maximize mutual information (MI) between samples from the posterior distribution (represented by the encoder) and observations. Unfortunately, exact MI computing is hard and may be intractable. To overcome this, our framework employs Variational Information Maximization (Barber and Agakov, 2003) to obtain lower bound on true MI. This technique relies on approximation by auxiliary distribution and we represent it by the additional inference network. The obtained lower bound on MI is used as the regularizer to the original VAE objective to force the latent representations to have a strong relationship with observations and prevent the model from ignoring them.
We have conducted our experiments on VAE models trained on MNIST and FasionMNIST with such latent distributions: Gaussian, and joint Gaussian and discrete. We compare qualitatively and quantitatively the models trained using pure ELBO objective and with introduced MI regularizer.
2 Related Work
There is a number of works that propose approaches to improve latent representations learned by VAE. In (Bowman et al., 2016) authors vary the weight of KL divergence component of objective function during training by gradual increasing it from 0 to 1. -VAE (Higgins et al., 2017) employs weighting coefficient that scales KL divergence term. It balances latent channel capacity and independence constraints with reconstruction accuracy to improve the disentanglement of representations (Burgess et al., 2018). Alemi et al. 2018 introduced an information theoretic framework to characterize tradeoff between compression and reconstruction accuracy. Authors use bounds on mutual information to derive a rate-distortion curve. On this curve, different points represent a family of models with the same ELBO but different characteristics. Also, the authors state that the proposed framework generalizes aforementioned -VAE in the sense that this coefficient controls MI between observations and latent variables. InfoVAE (Zhao et al., 2017) employs a modification of the objective to weight the preference between correct inference and fitting data distribution, and specify preference on how much the model should rely on the latent variables. In (Chen et al., 2018) authors decompose ELBO to determine the source of disentanglement in -VAE. Additionally, this work introduces -TCVAE that is able to discover more interpretable representations. The authors of InfoGAN (Chen et al., 2016a) address problem of entangled latent representations in GAN by maximizing the mutual information between the part of the latent code and produced by generator samples. For that purpose, they also employ Variational Information Maximization (Barber and Agakov, 2003).
3 Information Maximization for VAE
3.1 Variational Autoencoder
where is an approximate posterior distribution represented by encoder neural network with parameters and is decoder network parametrized by . By passing samples from prior distribution to decoder it is possible to generate new data samples.
3.2 Variational Mutual Information
Mutual information (MI) between samples of the posterior distribution and observation is formally defined as
where denotes the entropy of the corresponding variables.
is intractable, since it requires the intractable posterior to compute. Following the reasoning in (Chen et al., 2016a), we obtain the lower bound on MI between observations and latent variables:
where is auxiliary distribution. We represent it by a neural network that takes the decoder output as input. We treat the as a constant for simplicity.
The problem with the obtained lower bound is that it requires sampling from the posterior in the inner expectation. This can be overcome by applying this lemma used in InfoGAN (Chen et al., 2016a). The proof could be found in Appendix A.
For random variables
For random variablesand function under suitable regularity conditions:
With this we could re-define the variational lower bound on MI:
By using this lower bound for a fixed VAE it is possible now to maximize mutual information between observations and latent variables by maximizing the lower bound.
Finally, we define mutual information maximization regularizer for variational autoencoder as
that can be estimated using Monte Carlo sampling.
3.3 Resulting Framework
In our Variational Mutual Information Maximization Framework we combine ELBO with the proposed regularizer to form the objective
where is a scaling coefficient that controls the impact of on VAE training. For each training batch, we maximize the objective with respect to the auxiliary distribution to make lower bound on mutual information tighter first. Then, we maximize it with respect to parameters of the VAE ( and ) to train it using the regularizer to maximize MI between latent codes and observations . Please, see Fig.1 for visualization of the proposed model.
4 Experimental Setup
4.1 VAE with Gaussian latent
In this setting, we employ VAE with 32-dimensional Gaussian latent variable with prior
. We train and compare two identically initialized networks with same hyperparameters on MNIST. One is trained using only ELBO objective (1) and the other with mutual information maximization (4). For the latter case, we select only two components latent code vector forming sub-vector (,)= for mutual information maximization and define the regularizer as
We select only two components of the latent code since it is straightforward to illustrate their impact on observations in 2D visualizations by just manipulating their individual values without any latent space interpolations.
4.2 VAE with joint Gaussian and discrete latent
In this section, we define setting for VAE model with joint latent distribution of continuous and discrete (categorical) variables. We defineas 16-dimensional Gaussian part of latent code with prior and as discrete part with 10 categories and uniform prior. In this setting, the encoder network represents joint posterior approximation , decoder network is and prior is . Then, the resulting ELBO objective has the form of
By the assumption that and are mutually and conditionally independent, we can decompose the term as
For the categorical latent variable, we employ a continuous differentiable relaxation technique proposed by (Jang et al., 2016; Maddison et al., 2016). The categorical variable in our setting has 10 categories and let …
be the respective probabilities that define this distribution. We represent categorical samples as 10-dimensional one-hot vectors. Also, let… be i.i.d. Gumbel samples. Then, using softmax function we can draw samples from this categorical distribution having for 10-dimensional sample vector each component defined as
for . Where is temperature hyperparameter. For this VAE form trained with mutual information maximization, we maximize MI with respect to observation and categorical latent variable . In that case, we define our mutual information maximization regularizer term as
In this setting, we train and compare two identically initialized VAE models with same hyperparameters on MNIST and FasionMNIST. One using only objective (eq. 9) and other combined with regularizer following (eq. 7).
5 Experimental results
5.1 VAE with Gaussian latent
As we mentioned before, we trained two identically initialized VAE models: one using ELBO objective and one with added regularizer for sub-part of latent code (,)=. In Fig.2 we provide a qualitative comparison of the impact of these two components of the code on produced samples. For each latent code, we vary and from -3 to 3 with fixed remaining part and decode it. As you can see in Fig.2 (a), in vanilla VAE does not have much impact on the output samples. In contrast, with maximized mutual information in VAE by regularizer have a significant impact on output samples. For this model, we can see that outputs morph between three digit types as the code changes. Moreover, you can see that the particular combinations of these two components of 32-dimensional code morph the original sample into digit 1 and 6 regardless of the original sample type. All of this means that the provided regularizer indeed forces this part of learned latent codes to have high MI and strong relationship with observations.
Also, we compare resulting models with different values of scaling coefficient of regularizer in Fig.3. As you can see, with a low value of lambda, the impact of is the same as in vanilla VAE. However, with the increase of , the impact of on observations also increases.
5.2 VAE with joint Gaussian and discrete latent
In this section, we compare two identically initialized VAE model with joint Gaussian and discrete latent variables but trained in a different manner. One model is trained using only the ELBO of the form represented by eq.9. The second one is trained with added regularizer (eq.12) for MI maximization between data samples and categorical part of the learned latent code.
As you can see on Fig.4 (a), in VAE that was trained using pure ELBO, the categorical part of the latent code does not have an influence on produced samples. Thus, even when our strong prior assumption that the data have 10 categories was incorporated into the latent variable, the trained model ignores it and does not assign any interpretable representation to a categorical variable.
In contrast, for VAE that was trained with MI maximization between observations and categorical code, produced samples show a completely different response to the latent categorical variable change. As the categorical part of latent code varies between 10 categories, the samples change in a class-wise manner. For most of the samples, the particular value of the categorical variable changes them to the same digit type while preserving other features of original sample like thickness and angle. We interpret it as that the model generalizes and disentangles digit type from style representations by categorical and Gaussian part of the latent code respectively.
For the sake of quantitative comparison, we applied the encoder categorical component as a classifier to the MNIST classification task. VAE trained with ELBO objective has 21% classification accuracy on MNIST while VAE withregularizer achieved 74% accuracy.
In Fig.6 we compare histograms of categorical latent variable probabilities that were collected during the training of both models. As you can see, for the case of vanilla VAE the probabilities are mostly concentrated around 0.1 and do not reach the area around 1. In the case of VAE with maximized MI for the categorical variable, you can see that the probability values are concentrated around 0.5 and 0.9. It is natural behavior since when the category probabilities are uniform regardless of the input, it is pointless to do any further inference using this variable for the decoder network and thus the resulting model ignores it. Therefore, we interpret this observation as that VAE model with maximized MI between data samples and categorical part of the latent variable indeed makes more use of this variable.
As we mentioned before, our Variational Mutual Information Maximization Framework can be used for MI evaluation between latent variables and observations for a fixed VAE by obtaining lover bound on MI (eg.5). In Fig.6, we provide plots of lover bound MI estimate between observations and categorical part of latent codes during the training process of two models. One was trained using only the ELBO without MI loss minimization and the other with MI maximization. As you can see, the VAE model with MI maximization has higher MI lover bound estimate during training than one that was trained without regularizer.
Also, we provide categorical distribution KL divergence estimate plots in Figure 8
for models with and without MI maximization. From the theoretical perspective on KL divergence estimate, the value of this estimate is an upper bound on MI between latent variables and observations. Our experimental results are consistent with it. Moreover, the upper bound for the model with maximized MI reaches the maximum possible value of MI for the categorical variable that is modeled as a discrete distribution with 10 values and uniform prior. For this case, the maximum possible value of MI is the entropy of 10 category uniform distribution that is approximately 2.3. The KL divergence estimate is pretty close to this value for the model with MI maximization for the categorical latent variable as you can see on the plot.
The network that represents auxiliary distribution can be seen as a classifier network or a feature extractor network. When we maximize (eq. 6) regularizer with respect to parameters of this network, the maximization of the first term is the same as minimization of negative log-likelihood as we do when train classification models. Thus, we can interpret the whole training procedure as training the Q network to correctly classify in terms of its original generative factors or to better extract them from the sample. Then, when maximizing w.r.t. VAE, we are forcing the model to make these features more extractable from the produced sample and classifiable.
7 Discussion and Conclusion
In our work, we have presented a method for evaluation, control, and maximization of mutual information between latent variables and observations in VAE. In comparison to other related works, it provides an explicit and tractable objective. Using the proposed technique, it is possible to compare MI for different fixed VAE networks. Moreover, our experimental results illustrate that the Variational Mutual Information Maximization Framework can indeed strengthen the relationship between latent variables and observations. Also, it improves learned representations (section 5.2). However, it comes with an increase in computational and memory cost, since mutual information lover bound estimate requires auxiliary distribution that we represent by an additional encoder neural network and train.
We believe, that our work (with further analysis and improvements) have the potential to fill the gaps between previous theoretical insights for VAE from Information Theory perspective and empirical results. KL divergence term in VAE, by analysis from (Makhzani and Frey, 2017; Kim and Mnih, 2018), is an upper bound on true mutual information between latent codes and observations. Our empirical results are consistent with this insight: KL divergence estimate for categorical latent (which is 0.89) at the end of training in VAE without MI maximization is less than for VAE with MI maximization (which is 2.22). Please, see Fig.5 for KL divergence estimates values collected during training for both models.
On top of that, in (Alemi et al., 2018) authors state that in -VAE, by varying coefficient that scales KL divergence it is possible to vary . In (Burgess et al., 2018) along with KL scaling coefficient, authors propose to use the constant that KL divergence estimate should match and thus explicitly control mutual information upper bound.
In (Dupont, 2018) authors tackle the problem of training VAE model with joint Gaussian and categorical latent that is similar to those that we define in section 4.2. They reported that the model even trained in -VAE setting also ignores the categorical variable. Thus, they followed weighting and constraining technique by (Burgess et al., 2018) for each KL divergence term to force VAE to not ignore this latent part. In our work, we tackle the same problem from different perspectives. They increase and control the upper bound on MI. In contrast, we maximize lower bound on MI. Which approach is more suitable for a particular problem is an open question and possible future research direction. However, our explicit formulation of MI maximization, that lead to similar results, bridges theoretical insights that KL divergence controls MI between observations and latent codes.
- Fixing a broken elbo. In International Conference on Machine Learning, pp. 159–168. Cited by: §1, §2, §7.
- The im algorithm: a variational approach to information maximization.. In NIPS, pp. 201–208. Cited by: §1, §2.
- Generating sentences from a continuous space. CoNLL 2016, pp. 10. Cited by: §1, §2.
- Understanding disentangling beta-vae. arXiv preprint arXiv:1804.03599. Cited by: §2, §7, §7.
- Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.
- Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2, §3.2, §3.2.
- Variational lossy autoencoder. arXiv preprint arXiv:1611.02731. Cited by: §1.
- Learning disentangled joint continuous and discrete representations. In Advances in Neural Information Processing Systems, pp. 710–720. Cited by: §7.
- Correcting a proof in the infogan paper. External Links: Cited by: Appendix A.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
- Beta-vae: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, Vol. 3. Cited by: §2.
- Is maximum likelihood useful for representation learning. Cited by: §1.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §4.2.
- Disentangling by factorising. In International Conference on Machine Learning, pp. 2654–2663. Cited by: §7.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.1.
- Building machines that learn and think like people. arXiv preprint arXiv:1604.00289. Cited by: §1.
The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §4.2.
- Pixelgan autoencoders. In Advances in Neural Information Processing Systems, pp. 1975–1985. Cited by: §7.
Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286. Cited by: §1, §3.1.
- Infovae: information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262. Cited by: §2.
Appendix A Proof of the Lemma 3.1
For random variables and function under suitable regularity conditions:
This proof was originally introduced in (Ford and Oliver, 2018).
Make expectations explicit:
By definition of and :
Rename to :
By the law of total expectation:
Make expectations implicit:
Appendix B Histograms of encoded digits into particular one-hot categorical vector
We have counted numbers of particular digits from MNIST dataset encoded into particular one-hot vectors (categorical variable) for VAE models trained with and without MI maximization. We represent this results in figures 9 and 10. As you can see, the digit images from particular classes align pretty well with particular one-hot vectors in case of VAE with MI maximization. In contrast, for VAE without MI maximization, the distribution of particular type digit images are uniform across all one-hot vectors.