The problem of approximating conditional probability distributionsis a central point in the field of supervised learning. Although, learning a complex many-to-one mapping is straightforward if a sufficient amount of data is available [7, 13], most methods fail when it comes to structured-prediction problems, where a distribution with multiple modes (one-to-many mapping) has to be modelled .
Conditional variational autoencoders (CVAEs)  are a class of latent variable models for approximating one-to-many functions. They define a lower bound on the intractable marginal likelihood by introducing a variational posterior distribution. The learned generative model and the corresponding (approximate) posterior distribution of the latent variables provide a decoder/encoder pair that captures semantically meaningful features of the data. In this paper we address the issue of learning informative encodings/latent representations with the goal of increasing the generalisation capacity of CVAEs.
In contrast to variational autoencoders (VAEs) [6, 12], the decoder of CVAEs is a function of the latent variable and the condition . Thus, the model is not incentivised to learn an informative latent representation. To tackle this problem, we propose to apply a VAE-like decoder that depends only on the latent variable. This modification requires that the model is capable of learning a rich encoding. We follow the line of argument in —where the expressiveness of the generative model is increased by introducing a flexible prior—and show that a multimodal prior substantially improves optimisation.
Building on that, we propose to apply a learnable mixture distribution as prior. We show that the classical mixture of Gaussians prior suffers from focusing on outliers during optimisation causing a badly trained generative model. Instead of learning the means and variances of the respective mixture components directly, we address this issue by introducing a Gaussian mixture prior, inspired by, that is parameterised through both the encoder and the decoder, and evaluated at learned pseudo latent variables.
2.1 Preliminaries: Conditional VAEs
In structured prediction problems each condition can be related to several targets (one-to-many mapping), which results in a multimodal conditional distribution . Conditional-latent-variable models (CLVM), defined by
are capable of modelling multimodality by means of latent variables . However, in most cases the integral in Eq. (1) is intractable. Amortised variational inference [6, 12] allows to address this issue by approximating through maximising the evidence lower bound (ELBO):
where the parameters of the approximate posterior , the likelihood , and the prior are defined as neural-network functions of the conditioning variables. This model is known as conditional variational autoencoder (CVAE) . Consequently, we will refer to the neural networks representing and as encoder and decoder, respectively.
2.2 Incentivising Informative Latent Representations
In the CVAE, the likelihood is conditioned on and . Therefore, the model is not incentivised to learn an informative latent representation. Rather, latent variables can be viewed as an assistance for enabling multimodality in . For being able to fully exploit the generalisation capacity of CVAEs, we argue that an informative latent representation is necessary. Thus, determines completely, i.e. the mutual information . Following this line of argument, we obtain , and thus , leading to the following CLVM:
This modification enforces the model to learn a richer latent representation because all the information given by the training data has to be encoded.
However, the model must also be capable of learning such a complex latent representation. In case of CVAEs, the prior
is usually defined as a Gaussian distribution, leading to limited flexibility of the model, and hence to a worse generalisation, as addressed in and shown in Sec. 4.2 and 4.3. We build on the line of argumentation in , where the above limitation is tackled by introducing an expressive prior. The KL-divergence in Eq. 2 can be viewed as a regulariser to avoid over-fitting. Therefore, a flexible prior allows for learning a more complex latent representation and leads automatically to a more expressive generative model .
2.3 Modelling Low-Density Regions
In the previous section, we discussed the need of expressive priors in our setting. Next, we will specify an important property the prior has to posses. In most models within the VAE/CVAE framework, the prior is defined as a unimodal distribution. This leads to a significant shortcoming illustrated by the following structured-prediction task: generating grasping poses (targets) for a certain object (condition). Imagine a generated grasping pose is located in the middle of a plate instead of on the edge. Hence, generating targets between modes of might be an exclusion criterion.
To understand the cause, let us assume a dataset consisting of only a single condition with different targets. Thus, , , and (note that this is equivalent to a vanilla VAE). We want to represent by transforming through a bijective function , i.e. . By applying the change of variables, we derive:
In this context, we define the magnification factor . Setting requires either or . Thus, zero-density regions can only be represented at if either the original density is zero or the becomes infinitely large (see Sec. 4.1.1 for visualisation). For example, when using a Gaussian distribution as prior, near-zero density regions occur only at its tails. If is the likelihood neural network and we assume it to be continuous, zero-density regions can only be obtained in tails. For zero densities elsewhere, infinitely large -values are required. Thus, the derivative of becomes infinitely large: , leading to a badly-conditioned optimisation problem. The above line of argument applies equally to datasets with multiple conditions.
2.4 Expressive Priors for Conditional VAEs
A natural approach to address the difficulties introduced in Sec. 2.2 and 2.3 is a flexible multimodal prior. This could be realised by a conditional mixture of Gaussians (CMoG) prior , where is the number of mixture components. As in case of the vanilla CVAE, the parameters of the prior are represented by a neural network. Unfortunately, this approach performs badly, especially in high dimensional latent spaces (see Sec. 4.2).
We suspect this mainly due to the following reason: the prior is optimised through minimising (see Eq. 2). The optimal Bayes prior is the aggregated posterior —representing the manifold of the encoded data. Since the parameters of each mixture component of the CMoG prior are learned independently, it is not possible to avoid that mixture components leave the manifold of the encoded data by focusing on outliers (see Sec. 4.2 for experimental support). This leads to a badly trained generative model. Thus, the problem is that the prior is not incentivised to stay on the manifold of the encoded data.
Instead of learning the mean and variance of each mixture component of the prior directly, we tackle the above issue by introducing a parameterisation through both the encoder and the decoder. This approach is inspired by the VampPrior  (VAE framework), which is parameterised through the encoder. When extending it to the CVAE framework, we obtain the conditional VampPrior , which is evaluated at learned pseudo targets . However, pseudo latent variables would require less parameters and thus are less complex to optimise for representing the manifold of the encoded data. Evaluating the conditional VampPrior at decoded would make use of this advantage (see Sec. 4.2 for experimental support). Below, we introduce the conditional decoder-based Vamp (CDV) prior:
where is the mean of the likelihood and are defined as functions of the condition and approximated by a single neural network
, which is trained through backpropagation. Thus, the parameters of the prior are. As an additional feature, this approach requires less parameters than the CMoG prior, since only the pseudo latent variables ()) have to be learned instead of the means and variances (each ) of the CMoG prior.
The CLVM in Eq. 3 was introduced to incentivise a more informative latent representation for achieving a higher generalisation capacity. This step demands a flexible multimodal prior that allows the model for capturing semantically meaningful features of the data. The CDV prior meets these requirements and, in contrast to a classical Gaussian mixture prior, it facilitates a well trained generative model.
3 Related Work
Learning informative latent representations in VAEs is an ongoing field of research [5, 15, 1]. The connection between informative latent representations and a flexible prior was pointed out in  and motivated through Bits-Back Coding. Several additional works improved VAEs by learning more complex priors [10, 17]. The reason for increasing the expressiveness of the prior is a lower KL-divergence—and thus a better trained decoder, leading to more qualitative samples of the generative model. Based on that, it can be derived that the optimal Bayes prior is the aggregated posterior . The VampPrior  approximates the aggregated posterior by a uniform mixture of approximate posteriors, evaluated at learned pseudo inputs in the observable space.
In contrast to the (conditional) VampPrior, the CDV prior is parameterised through both the encoder and the decoder, and evaluated at learned pseudo latent variables. Since the latent space has in general a lower dimension than the observable space, pseudo latent variables need less parameters and are easier to optimise for approximating the aggregated posterior.
Several applications based on the concept of CVAEs were published: they can be used for filling pixels given a partial image 
, for image inpainting conditioned on visual attributes (e.g., colour and gender), or for predicting events by conditioning the distribution of possible movements on a scene . As in , we use CVAEs to complete images—with the aim of obtaining a widest possible variety of generations, thus a classical one-to-many mapping. However, with an additional difficulty: it is learned from a dataset of one-to-one mappings to validate the generalisation capacity of the models.
Another important field where CVAEs are applied is robot grasping: earlier work has focused on detecting robust grasping poses [9, 11], while recent work is often based on structured prediction with the idea of learning multimodal conditional probability distributions for generating grasping poses . In [9, 11]
, classifiers are applied to detect whether a grasping pose is robust. A problem here is that suitable grasping poses need to be proposed by hand. In our approach, CVAEs are used to generate grasping poses for unknown objects. Afterwards, similar to, a discriminator is applied to validate them.
We conduct five experiments to compare the introduced models: first, we visualise on a simplified task the difficulty of unimodal priors. Building on that, we demonstrate on a synthetic toy dataset that CMoG- and CDV-CVAEs are capable of modelling near-zero-density regions. Second, we show on a modified version of MNIST and Fashion-MNIST that the variety of generated samples is significantly larger when combining the CVAE with the CMoG or CDV prior. Finally, we compare the CVAE with the CDV-CVAE on real world data, the Cornell Robot Grasping dataset.
To train our models we applied a linear annealing scheme 
for the first epoch. This is especially important for the CDV-CVAE because it is sensitive to over-regularisation by the KL-term in the initial optimisation phase.
4.1 Modelling Low-Density Regions
4.1.1 Visualisation of the Problem
To reduce complexity, we trained a vanilla VAE with a Gaussian prior on a simple toy dataset consisting of four Gaussian distributions. This toy dataset can be interpreted as a simplified structured-prediction task with only one condition and four targets.
Fig. (a)a shows the two-dimensional latent space, which depicts the aggregated posterior of the model. Each of the four Gaussians is encoded by a different colour. To map from a unimodal to a multimodal distribution, the decoder has to model large gradients, as discussed in Sec. 2.3. The magnification factor is visualised by the greyscale in Fig. (a)a, which represents the Jacobian of the decoder. The support of the aggregated posterior is noticeably smaller than the support of the prior. Since the decoder is a continuous function, a gap at the boundaries of different classes in the latent space (as shown in Fig. (a)a) represents the distance between the modes in the observable space. The size of the gap depends on the gradients that our model is able to achieve: the higher the gradient, the smaller the gap in the latent space.
When sampling from the generative model, we first sample from the prior. If the sample comes from a region which is not supported by the aggregated posterior, the decoded sample will end up between two modes, as demonstrated in Fig. (b)b.
4.1.2 Synthetic Toy Dataset
In this experiment we reused a synthetic toy dataset  for validating models for structured-prediction tasks. It consists of one-dimensional one-to-many mappings (see Fig. (a)a): the horizontal-axis represents the conditions and the vertical-axis the targets. Even though the dataset is simple, the abrupt changes of the number and location of the targets are quite challenging to model.
For all three models, we used latent spaces with two dimensions. CMoG-CVAE () and CDV-CVAE () outperformed the original CVAE () as shown in Fig. 8. Multimodal priors facilitate the modelling of near-zero-density regions between different modes (Fig. (c)c, (d)d), as discussed in Sec. 2.3. Fig. 13 shows how the CDV prior distribution changes with the condition .
4.2 Verifying the Generalisation Capacity
We created a modified version of MNIST  and Fashion-MNIST  to evaluate the generalisation capacity of the different models. For this purpose, we split binarised MNIST/Fashion-MNIST images into two parts: a conditional part, the lower third (last pixels) of the image—and a target part, the upper two-thirds (first pixels). The dataset has therefore only one target per condition. The goal is to investigate whether the models are able to define a set of new targets for each condition of the test set. In other words, whether they can learn a one-to-many from a one-to-one mapping.
In all three models, we used a 32-dimensional latent space. CMoG-CVAEs (Fig. (b)b) and CDV-CVAEs (Fig. (c)c) were able to represent a multimodal likelihood distribution, in contrast to vanilla CVAEs (Fig. (a)a). This is shown by the significantly larger variety of generated targets per condition.
To measure the variety of the generated targets, we trained a classifier on MNIST/Fashion-MNIST and sampled 10 targets for each condition of the test set. Afterwards, we used the classifier to determine how many different classes were generated per condition. Fig. 20 shows the results for the different models and datasets. Note that we only took sampled targets into account, which could be clearly assigned to a class—especially to avoid treating poor generations as additional classes. In case of both datasets, CMoG- and CDV-CVAEs learned to generate several classes per condition, and thus a one-to-many from a one-to-one mapping. Additionally, CMoG- and CDV-CVAEs achieved a larger variety of generations within the same class (see Fig. 17).
Based on the above results, we can deduce that CMoG- and CDV-CVAEs have a higher generalisation capacity. The larger variety of the generations is due to the structure of the priors: since they are mixtures of distributions, each target is represented by one or more mixture components. However, as discussed in Sec. 2.4, CMoG priors perform badly, especially in high dimensional latent spaces. This becomes evident by the high amount of poor generations in Fig. (b)b. To verify our hypothesis—that the poor generations are caused by mixture components of the CMoG prior that focused on outliers during optimisation—we encoded our training data (MNIST) and measured the Euclidean distance to the respective mean of each prior component. Fig. 24 shows the number of nearest neighbours (encoded data points) as a function of the Euclidean distance in the latent space. Each line represents one of the 32 mixture components. In contrast to the CDV prior (Fig. (c)c), four mixture components of the CMoG prior (Fig. (a)a) have a significantly larger distance to the encoded data. This reinforces the conclusion that these mixture components focused on outliers during the optimisation process. We obtain poor generations like in Fig. (b)b if a generated target is based on one of these four components, because is only optimised (see Eq. 2) to decode samples that lie on the manifold of the encoded training data.
Additionally, we show that the CDV prior outperforms the conditional VampPrior (Fig. (b)b), where one mixture component has a significantly larger distance to the encoded data. As discussed in Sec. 2.4, we suspect this due to the higher dimension of the pseudo targets , making them more complex to optimise than pseudo latent variables .
4.3 Generating Grasping Poses
In this experiment we want to assess the generalisation capabilities of CVAE and CDV-CVAE on a real-world dataset. To this end, we use the Cornell Robot Grasping dataset, which consists of 885 conditions ( pixels greyscale images of objects) and 5,110 targets (proposed grasping poses) . The latent spaces of both models are 16-dimensional. For training, we resized the conditions to pixels. Furthermore, we adapted the way how the grasping poses are represented: the rectangles (original representation) were redefined by a centre, a short and long axis, and a rotation angle.
Fig. (a)a shows a selection of objects and proposed grasping poses defined by the test dataset. Fig. (b)b and Fig. (c)c depict grasping poses generated by the CVAE and CDV-CVAE, respectively. As discussed in Sec. 4.1 and 4.2, CDV-CVAEs have a higher capability of modelling one-to-many mappings and enable a larger variety of generated targets.
To verify whether the CDV-CVAE has actually learned to generate more realistic grasping poses for unknown objects, we apply a similar approach as proposed in . It is based on a discriminator for validating proposed grasping poses. For this purpose, we trained the discriminator in equal parts with samples from joint and marginal empirical distribution and , respectively. Subsequently, we generated 10 grasping poses for each condition in the test set and filtered out those with a discrimination score below 0.99. As a result, of the grasping poses generated by the CDV-CVAE were above this threshold, whereas the CVAE reached . This allows the conclusion that the CDV-CVAE is a useful extension to the CVAE framework.
In this paper, we have introduced a modified conditional latent variable model to incentivise informative latent representations. To enable the model for capturing semantically meaningful features of the data, we have proposed an expressive multimodal prior that facilitates, in contrast to a classical Gaussian mixture prior, a well trained generative model.
We have shown that our approach increases the generalisation capacity of CVAEs on a modified version of MNIST and Fashion-MNIST by achieving a significantly larger variety of generated targets—and on the Cornell Robot Grasping dataset by generating more realistic grasping poses. Additionally, we have demonstrated that a straightforward application of CVAEs to structured-prediction problems suffers from a difficulty to represent multimodal distributions and that our approach overcomes this limitation.
-  Alemi, A.A., Poole, B., Fischer, I., Dillon, J.V., Saurous, R.A., Murphy, K.: Fixing a broken ELBO. ICML (2018)
Bishop, C.M., Svens’ en, M., Williams, C.K.I.: Magnification factors for the SOM and GTM algorithms. Proceedings Workshop on Self-Organizing Maps (1997)
-  Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. CoNLL (2016)
-  Chen, X., Kingma, D.P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., Abbeel, P.: Variational Lossy Autoencoder. CoRR (2016)
-  Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: beta-VAE: Learning basic visual concepts with a constrained variational framework. ICLR (2017)
-  Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. CoRR (2013)
-  LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning applied to document recognition. Proceedings of the IEEE (1998)
Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting robotic grasps. The International Journal of Robotics Research (2015)
-  Nalisnick, E., Smyth, P.: Stick-breaking variational autoencoders. ICLR (2017)
-  Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. ICRA (2016)
-  Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. ICML (2014)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ICLR (2015)
-  Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. NeurIPS (2015)
-  Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational autoencoders. NeurIPS (2016)
-  Tang, Y., Salakhutdinov, R.R.: Learning Stochastic Feedforward Neural Networks. NeurIPS (2013)
-  Tomczak, J., Welling, M.: VAE with a VampPrior. AISTATS (2018)
-  Veres, M., Moussa, M., Taylor, G.W.: Modeling grasp motor imagery through deep conditional generative models. IEEE Robotics and Automation Letters (2017)
-  Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: Forecasting from static images using variational autoencoders. ECCV (2016)
-  Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 (2017)
-  Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2image: Conditional image generation from visual attributes. ECCV (2016)