1 Introduction
The problem of approximating conditional probability distributions
is a central point in the field of supervised learning. Although, learning a complex manytoone mapping is straightforward if a sufficient amount of data is available [7, 13], most methods fail when it comes to structuredprediction problems, where a distribution with multiple modes (onetomany mapping) has to be modelled [16].Conditional variational autoencoders (CVAEs) [14] are a class of latent variable models for approximating onetomany functions. They define a lower bound on the intractable marginal likelihood by introducing a variational posterior distribution. The learned generative model and the corresponding (approximate) posterior distribution of the latent variables provide a decoder/encoder pair that captures semantically meaningful features of the data. In this paper we address the issue of learning informative encodings/latent representations with the goal of increasing the generalisation capacity of CVAEs.
In contrast to variational autoencoders (VAEs) [6, 12], the decoder of CVAEs is a function of the latent variable and the condition . Thus, the model is not incentivised to learn an informative latent representation. To tackle this problem, we propose to apply a VAElike decoder that depends only on the latent variable. This modification requires that the model is capable of learning a rich encoding. We follow the line of argument in [4]—where the expressiveness of the generative model is increased by introducing a flexible prior—and show that a multimodal prior substantially improves optimisation.
Building on that, we propose to apply a learnable mixture distribution as prior. We show that the classical mixture of Gaussians prior suffers from focusing on outliers during optimisation causing a badly trained generative model. Instead of learning the means and variances of the respective mixture components directly, we address this issue by introducing a Gaussian mixture prior, inspired by
[17], that is parameterised through both the encoder and the decoder, and evaluated at learned pseudo latent variables.2 Methods
2.1 Preliminaries: Conditional VAEs
In structured prediction problems each condition can be related to several targets (onetomany mapping), which results in a multimodal conditional distribution . Conditionallatentvariable models (CLVM), defined by
(1) 
are capable of modelling multimodality by means of latent variables . However, in most cases the integral in Eq. (1) is intractable. Amortised variational inference [6, 12] allows to address this issue by approximating through maximising the evidence lower bound (ELBO):
(2) 
where the parameters of the approximate posterior , the likelihood , and the prior are defined as neuralnetwork functions of the conditioning variables. This model is known as conditional variational autoencoder (CVAE) [14]. Consequently, we will refer to the neural networks representing and as encoder and decoder, respectively.
2.2 Incentivising Informative Latent Representations
In the CVAE, the likelihood is conditioned on and . Therefore, the model is not incentivised to learn an informative latent representation. Rather, latent variables can be viewed as an assistance for enabling multimodality in . For being able to fully exploit the generalisation capacity of CVAEs, we argue that an informative latent representation is necessary. Thus, determines completely, i.e. the mutual information . Following this line of argument, we obtain , and thus , leading to the following CLVM:
(3) 
This modification enforces the model to learn a richer latent representation because all the information given by the training data has to be encoded.
However, the model must also be capable of learning such a complex latent representation. In case of CVAEs, the prior
is usually defined as a Gaussian distribution, leading to limited flexibility of the model, and hence to a worse generalisation, as addressed in
[4] and shown in Sec. 4.2 and 4.3. We build on the line of argumentation in [4], where the above limitation is tackled by introducing an expressive prior. The KLdivergence in Eq. 2 can be viewed as a regulariser to avoid overfitting. Therefore, a flexible prior allows for learning a more complex latent representation and leads automatically to a more expressive generative model .2.3 Modelling LowDensity Regions
In the previous section, we discussed the need of expressive priors in our setting. Next, we will specify an important property the prior has to posses. In most models within the VAE/CVAE framework, the prior is defined as a unimodal distribution. This leads to a significant shortcoming illustrated by the following structuredprediction task: generating grasping poses (targets) for a certain object (condition). Imagine a generated grasping pose is located in the middle of a plate instead of on the edge. Hence, generating targets between modes of might be an exclusion criterion.
To understand the cause, let us assume a dataset consisting of only a single condition with different targets. Thus, , , and (note that this is equivalent to a vanilla VAE). We want to represent by transforming through a bijective function , i.e. . By applying the change of variables, we derive:
In this context, we define the magnification factor [2]. Setting requires either or . Thus, zerodensity regions can only be represented at if either the original density is zero or the becomes infinitely large (see Sec. 4.1.1 for visualisation). For example, when using a Gaussian distribution as prior, nearzero density regions occur only at its tails. If is the likelihood neural network and we assume it to be continuous, zerodensity regions can only be obtained in tails. For zero densities elsewhere, infinitely large values are required. Thus, the derivative of becomes infinitely large: , leading to a badlyconditioned optimisation problem. The above line of argument applies equally to datasets with multiple conditions.
2.4 Expressive Priors for Conditional VAEs
A natural approach to address the difficulties introduced in Sec. 2.2 and 2.3 is a flexible multimodal prior. This could be realised by a conditional mixture of Gaussians (CMoG) prior , where is the number of mixture components. As in case of the vanilla CVAE, the parameters of the prior are represented by a neural network. Unfortunately, this approach performs badly, especially in high dimensional latent spaces (see Sec. 4.2).
We suspect this mainly due to the following reason: the prior is optimised through minimising (see Eq. 2). The optimal Bayes prior is the aggregated posterior —representing the manifold of the encoded data. Since the parameters of each mixture component of the CMoG prior are learned independently, it is not possible to avoid that mixture components leave the manifold of the encoded data by focusing on outliers (see Sec. 4.2 for experimental support). This leads to a badly trained generative model. Thus, the problem is that the prior is not incentivised to stay on the manifold of the encoded data.
Instead of learning the mean and variance of each mixture component of the prior directly, we tackle the above issue by introducing a parameterisation through both the encoder and the decoder. This approach is inspired by the VampPrior [17] (VAE framework), which is parameterised through the encoder. When extending it to the CVAE framework, we obtain the conditional VampPrior , which is evaluated at learned pseudo targets . However, pseudo latent variables would require less parameters and thus are less complex to optimise for representing the manifold of the encoded data. Evaluating the conditional VampPrior at decoded would make use of this advantage (see Sec. 4.2 for experimental support). Below, we introduce the conditional decoderbased Vamp (CDV) prior:
(4) 
where is the mean of the likelihood and are defined as functions of the condition and approximated by a single neural network
, which is trained through backpropagation. Thus, the parameters of the prior are
. As an additional feature, this approach requires less parameters than the CMoG prior, since only the pseudo latent variables ()) have to be learned instead of the means and variances (each ) of the CMoG prior.The CLVM in Eq. 3 was introduced to incentivise a more informative latent representation for achieving a higher generalisation capacity. This step demands a flexible multimodal prior that allows the model for capturing semantically meaningful features of the data. The CDV prior meets these requirements and, in contrast to a classical Gaussian mixture prior, it facilitates a well trained generative model.
3 Related Work
Learning informative latent representations in VAEs is an ongoing field of research [5, 15, 1]. The connection between informative latent representations and a flexible prior was pointed out in [4] and motivated through BitsBack Coding. Several additional works improved VAEs by learning more complex priors [10, 17]. The reason for increasing the expressiveness of the prior is a lower KLdivergence—and thus a better trained decoder, leading to more qualitative samples of the generative model. Based on that, it can be derived that the optimal Bayes prior is the aggregated posterior [17]. The VampPrior [17] approximates the aggregated posterior by a uniform mixture of approximate posteriors, evaluated at learned pseudo inputs in the observable space.
In contrast to the (conditional) VampPrior, the CDV prior is parameterised through both the encoder and the decoder, and evaluated at learned pseudo latent variables. Since the latent space has in general a lower dimension than the observable space, pseudo latent variables need less parameters and are easier to optimise for approximating the aggregated posterior.
Several applications based on the concept of CVAEs were published: they can be used for filling pixels given a partial image [14]
, for image inpainting conditioned on visual attributes (e.g., colour and gender)
[21], or for predicting events by conditioning the distribution of possible movements on a scene [19]. As in [14], we use CVAEs to complete images—with the aim of obtaining a widest possible variety of generations, thus a classical onetomany mapping. However, with an additional difficulty: it is learned from a dataset of onetoone mappings to validate the generalisation capacity of the models.Another important field where CVAEs are applied is robot grasping: earlier work has focused on detecting robust grasping poses [9, 11], while recent work is often based on structured prediction with the idea of learning multimodal conditional probability distributions for generating grasping poses [18]. In [9, 11]
, classifiers are applied to detect whether a grasping pose is robust. A problem here is that suitable grasping poses need to be proposed by hand. In our approach, CVAEs are used to generate grasping poses for unknown objects. Afterwards, similar to
[9], a discriminator is applied to validate them.4 Experiments
We conduct five experiments to compare the introduced models: first, we visualise on a simplified task the difficulty of unimodal priors. Building on that, we demonstrate on a synthetic toy dataset that CMoG and CDVCVAEs are capable of modelling nearzerodensity regions. Second, we show on a modified version of MNIST and FashionMNIST that the variety of generated samples is significantly larger when combining the CVAE with the CMoG or CDV prior. Finally, we compare the CVAE with the CDVCVAE on real world data, the Cornell Robot Grasping dataset.
To train our models we applied a linear annealing scheme [3]
for the first epoch. This is especially important for the CDVCVAE because it is sensitive to overregularisation by the KLterm in the initial optimisation phase.
4.1 Modelling LowDensity Regions
4.1.1 Visualisation of the Problem
To reduce complexity, we trained a vanilla VAE with a Gaussian prior on a simple toy dataset consisting of four Gaussian distributions. This toy dataset can be interpreted as a simplified structuredprediction task with only one condition and four targets.
Fig. (a)a shows the twodimensional latent space, which depicts the aggregated posterior of the model. Each of the four Gaussians is encoded by a different colour. To map from a unimodal to a multimodal distribution, the decoder has to model large gradients, as discussed in Sec. 2.3. The magnification factor is visualised by the greyscale in Fig. (a)a, which represents the Jacobian of the decoder. The support of the aggregated posterior is noticeably smaller than the support of the prior. Since the decoder is a continuous function, a gap at the boundaries of different classes in the latent space (as shown in Fig. (a)a) represents the distance between the modes in the observable space. The size of the gap depends on the gradients that our model is able to achieve: the higher the gradient, the smaller the gap in the latent space.
When sampling from the generative model, we first sample from the prior. If the sample comes from a region which is not supported by the aggregated posterior, the decoded sample will end up between two modes, as demonstrated in Fig. (b)b.
4.1.2 Synthetic Toy Dataset
In this experiment we reused a synthetic toy dataset [16] for validating models for structuredprediction tasks. It consists of onedimensional onetomany mappings (see Fig. (a)a): the horizontalaxis represents the conditions and the verticalaxis the targets. Even though the dataset is simple, the abrupt changes of the number and location of the targets are quite challenging to model.
For all three models, we used latent spaces with two dimensions. CMoGCVAE () and CDVCVAE () outperformed the original CVAE () as shown in Fig. 8. Multimodal priors facilitate the modelling of nearzerodensity regions between different modes (Fig. (c)c, (d)d), as discussed in Sec. 2.3. Fig. 13 shows how the CDV prior distribution changes with the condition .
4.2 Verifying the Generalisation Capacity
We created a modified version of MNIST [8] and FashionMNIST [20] to evaluate the generalisation capacity of the different models. For this purpose, we split binarised MNIST/FashionMNIST images into two parts: a conditional part, the lower third (last pixels) of the image—and a target part, the upper twothirds (first pixels). The dataset has therefore only one target per condition. The goal is to investigate whether the models are able to define a set of new targets for each condition of the test set. In other words, whether they can learn a onetomany from a onetoone mapping.
In all three models, we used a 32dimensional latent space. CMoGCVAEs (Fig. (b)b) and CDVCVAEs (Fig. (c)c) were able to represent a multimodal likelihood distribution, in contrast to vanilla CVAEs (Fig. (a)a). This is shown by the significantly larger variety of generated targets per condition.
To measure the variety of the generated targets, we trained a classifier on MNIST/FashionMNIST and sampled 10 targets for each condition of the test set. Afterwards, we used the classifier to determine how many different classes were generated per condition. Fig. 20 shows the results for the different models and datasets. Note that we only took sampled targets into account, which could be clearly assigned to a class—especially to avoid treating poor generations as additional classes. In case of both datasets, CMoG and CDVCVAEs learned to generate several classes per condition, and thus a onetomany from a onetoone mapping. Additionally, CMoG and CDVCVAEs achieved a larger variety of generations within the same class (see Fig. 17).
Based on the above results, we can deduce that CMoG and CDVCVAEs have a higher generalisation capacity. The larger variety of the generations is due to the structure of the priors: since they are mixtures of distributions, each target is represented by one or more mixture components. However, as discussed in Sec. 2.4, CMoG priors perform badly, especially in high dimensional latent spaces. This becomes evident by the high amount of poor generations in Fig. (b)b. To verify our hypothesis—that the poor generations are caused by mixture components of the CMoG prior that focused on outliers during optimisation—we encoded our training data (MNIST) and measured the Euclidean distance to the respective mean of each prior component. Fig. 24 shows the number of nearest neighbours (encoded data points) as a function of the Euclidean distance in the latent space. Each line represents one of the 32 mixture components. In contrast to the CDV prior (Fig. (c)c), four mixture components of the CMoG prior (Fig. (a)a) have a significantly larger distance to the encoded data. This reinforces the conclusion that these mixture components focused on outliers during the optimisation process. We obtain poor generations like in Fig. (b)b if a generated target is based on one of these four components, because is only optimised (see Eq. 2) to decode samples that lie on the manifold of the encoded training data.
Additionally, we show that the CDV prior outperforms the conditional VampPrior (Fig. (b)b), where one mixture component has a significantly larger distance to the encoded data. As discussed in Sec. 2.4, we suspect this due to the higher dimension of the pseudo targets , making them more complex to optimise than pseudo latent variables .
4.3 Generating Grasping Poses



In this experiment we want to assess the generalisation capabilities of CVAE and CDVCVAE on a realworld dataset. To this end, we use the Cornell Robot Grasping dataset, which consists of 885 conditions ( pixels greyscale images of objects) and 5,110 targets (proposed grasping poses) [9]. The latent spaces of both models are 16dimensional. For training, we resized the conditions to pixels. Furthermore, we adapted the way how the grasping poses are represented: the rectangles (original representation) were redefined by a centre, a short and long axis, and a rotation angle.
Fig. (a)a shows a selection of objects and proposed grasping poses defined by the test dataset. Fig. (b)b and Fig. (c)c depict grasping poses generated by the CVAE and CDVCVAE, respectively. As discussed in Sec. 4.1 and 4.2, CDVCVAEs have a higher capability of modelling onetomany mappings and enable a larger variety of generated targets.
To verify whether the CDVCVAE has actually learned to generate more realistic grasping poses for unknown objects, we apply a similar approach as proposed in [9]. It is based on a discriminator for validating proposed grasping poses. For this purpose, we trained the discriminator in equal parts with samples from joint and marginal empirical distribution and , respectively. Subsequently, we generated 10 grasping poses for each condition in the test set and filtered out those with a discrimination score below 0.99. As a result, of the grasping poses generated by the CDVCVAE were above this threshold, whereas the CVAE reached . This allows the conclusion that the CDVCVAE is a useful extension to the CVAE framework.
5 Conclusion
In this paper, we have introduced a modified conditional latent variable model to incentivise informative latent representations. To enable the model for capturing semantically meaningful features of the data, we have proposed an expressive multimodal prior that facilitates, in contrast to a classical Gaussian mixture prior, a well trained generative model.
We have shown that our approach increases the generalisation capacity of CVAEs on a modified version of MNIST and FashionMNIST by achieving a significantly larger variety of generated targets—and on the Cornell Robot Grasping dataset by generating more realistic grasping poses. Additionally, we have demonstrated that a straightforward application of CVAEs to structuredprediction problems suffers from a difficulty to represent multimodal distributions and that our approach overcomes this limitation.
References
 [1] Alemi, A.A., Poole, B., Fischer, I., Dillon, J.V., Saurous, R.A., Murphy, K.: Fixing a broken ELBO. ICML (2018)

[2]
Bishop, C.M., Svens’ en, M., Williams, C.K.I.: Magnification factors for the SOM and GTM algorithms. Proceedings Workshop on SelfOrganizing Maps (1997)
 [3] Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. CoNLL (2016)
 [4] Chen, X., Kingma, D.P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., Abbeel, P.: Variational Lossy Autoencoder. CoRR (2016)
 [5] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: betaVAE: Learning basic visual concepts with a constrained variational framework. ICLR (2017)
 [6] Kingma, D.P., Welling, M.: Autoencoding variational Bayes. CoRR (2013)

[7]
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. NeurIPS (2012)
 [8] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradientbased learning applied to document recognition. Proceedings of the IEEE (1998)

[9]
Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting robotic grasps. The International Journal of Robotics Research (2015)
 [10] Nalisnick, E., Smyth, P.: Stickbreaking variational autoencoders. ICLR (2017)
 [11] Pinto, L., Gupta, A.: Supersizing selfsupervision: Learning to grasp from 50k tries and 700 robot hours. ICRA (2016)
 [12] Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. ICML (2014)
 [13] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. ICLR (2015)
 [14] Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. NeurIPS (2015)
 [15] Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational autoencoders. NeurIPS (2016)
 [16] Tang, Y., Salakhutdinov, R.R.: Learning Stochastic Feedforward Neural Networks. NeurIPS (2013)
 [17] Tomczak, J., Welling, M.: VAE with a VampPrior. AISTATS (2018)
 [18] Veres, M., Moussa, M., Taylor, G.W.: Modeling grasp motor imagery through deep conditional generative models. IEEE Robotics and Automation Letters (2017)
 [19] Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: Forecasting from static images using variational autoencoders. ECCV (2016)
 [20] Xiao, H., Rasul, K., Vollgraf, R.: FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 (2017)
 [21] Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2image: Conditional image generation from visual attributes. ECCV (2016)