Deep generative models offer promising results in generating diverse, realistic samples, such as images, text, motion, and sound, from purely unlabeled data. One example of such successful generative models are variational autoencoders  (VAEs), the stochastic variant of autoencoders, which, thanks to strong and expressive decoders, can generate high-quality samples.
VAEs are a family of generative models that utilize neural networks to learn the distribution of the data. To this end, VAEs first learn to generate a latent variablegiven the data , i.e., approximate the posterior distribution , where
are the parameters of a neural network, the encoder, whose goal is to model the variation of the data. From this latent random variable, VAEs then generate a sample by learning , where denotes the parameters of another neural network, the decoder, whose goal is to maximize the log likelihood of the data.
These two networks, i.e., the encoder () and the decoder (
), are trained jointly, using a prior over the latent variable. This prior is usually the standard Normal distribution,. Note that VAEs use a variational approximation of the posterior, i.e., , rather than the true posterior. This enables the model to maximize the variational lower bound of the log likelihood with respect to the parameters and , given by
where the second term encodes the KL divergence between the posterior and the prior distributions. In practice, the posterior distribution is approximated by a Gaussian , whose parameters are output by the encoder. Note,
is a vector and we defineas a vector whose elements are the squared elements of . To facilitate optimization, the reparameterization trick  is used. That is, the latent variable is computed as
where is a vector sampled from the standard Normal distribution . As an extension to VAEs, CVAEs use auxiliary information, i.e., the conditioning variable or observation, to generate the data . In the standard setting, both the encoder and the decoder are conditioned on the conditioning variable . That is, the encoder is denoted as and the decoder as . Then, the objective of the model becomes
As illustrated in Fig. 1, in practice, conditioning is typically done by concatenation; the input of the encoder is the concatenation of the data and the condition , i.e., , and that of the decoder the concatenation of the latent variable and the condition , i.e., . Thus, the prior distribution is still , and the latent variable is sampled independently of the conditioning one. However, given Eq. 3, one should use . This means that it is then left to the decoder to combine the information from the latent and conditioning variables to generate a data sample. We see this as a major limitation when using CVAEs in practice, and, to the best of our knowledge, this problem has never been studied nor addressed in the literature.
In this paper, we introduce an approach to overcome this limitation by explicitly making the sampling of the latent variable depend on the condition, In other words, instead of using as prior distribution, we truly use . This not only respects the theory behind the design of CVAEs, but, as we empirically evidence, it leads to generating samples of higher quality, that preserve the context of the conditioning signal. To achieve this, we develop a CVAE architecture that learns a distribution not only of the latent variable but also of the conditioning one. We then use this distribution as a prior over the latent variable, making its sampling explicitly dependent on the condition. As such, we name our method CPP-VAE, for Condition Posterior as Prior.
We empirically show the effectiveness of our approach for problems that are stochastic in nature. In particular, we focus on scenarios where the training dataset is deterministic, i.e., one condition per data sample, and the conditioning signal is strong enough for an expressive decoder to generate a plausible output from it. Thus, the model does not need to see an informative latent variable to generate a sample of high quality. We also show that, by unifying latent variable sampling and conditioning, we can mitigate posterior collapse problem, a known problem of VAEs with expressive decoders. This is mainly due to the fact that the decoder no longer receives two separate sources of information, i.e., the latent variable and the condition, thus, the model is prevented from identifying the latent variable and then ignore it .
As a stochastic problem, we evaluate our approach on diverse human motion prediction, that is, forecasting future 3D poses given a sequence of observed ones. In this context, existing methods typically fail to model the stochastic nature of human motion, either because they learn a deterministic mapping from the observations to the output, or because the stochastic latent variables they combine with the observations can be ignored by the model. As an alternative application, we also evaluate our approach on image captioning, i.e., generating diverse and plausible captions describing an image. Our empirical results show that not only does our approach yield a much wider variety of plausible samples than concatenation-based stochastic methods, but it also preserves the semantic information of the condition, such as the type of action performed by the person in motion prediction or visual image elements in captioning, without explicitly exploiting this information.
Remark: Relation to the posterior collapse problem.
Training conditional generative latent variable models is challenging due to posterior collapse, typically occurring in scenarios when the conditioning signal is strong and the decoder is expressive enough to generate a plausible sample given only the condition. This phenomenon of posterior collapse is even more severe when training on a deterministic dataset, i.e., having one sample per condition. Posterior collapse manifests itself by the KL divergence term becoming zero, which means that, regardless of the input, the approximate posterior distribution is equal to the prior distribution. In other words, there is no semantic connection between the encoder and the decoder, and thus the latent variable drawn from the approximate posterior does not convey any useful information to obtain an input-dependent reconstruction. In this case, the decoder generates samples that approximate the mean of the whole training set, minimizing the reconstruction loss. We found that one of the major reasons behind posterior collapse in the case of conditional VAEs with strong conditioning signals and expressive decoders is rooted in the conventional way of conditioning, i.e., through concatenation of the latent variable and the condition. Concatenation allows the decoder to decouple the latent variable from the deterministic condition, thus allowing the decoder to optimize its reconstruction loss given only the condition.
2 Unifying Sampling Latent Variable and Conditioning
In this section, we introduce our approach as a general framework with a new conditioning scheme for CVAEs that is capable of generating diverse and plausible samples where the latent variables are sampled from an appropriate region of the prior distribution. In essence, our framework consists of two autoencoders, one acting on the conditioning signal and the other on the samples we wish to learn the distribution of. The latent representation of the condition then serves as conditioning variable to generate the desired samples.
As discussed above, we are interested in problems that are stochastic in nature; given one condition, multiple plausible and natural samples are likely. However our training data is insufficiently sampled, in that for any given condition, the dataset contains only a single observed sample, in effect making the data appear deterministic. Moreover, in these cases, the condition provides the core signal to generate a good sample, even in a deterministic model. Therefore, it is highly likely that a CVAE trained for this task learns to ignore the latent variable and rely only on the condition to produce its output (related to the posterior collapse problem in strongly conditioned VAEs ). Below, we address this by forcing the sampling of the random latent variable to depend on the conditioning one. By making this dependency explicit, we (1) sample an informative latent variable given the condition, and thus generate a sample of higher quality, and (2) prevent the network from ignoring the latent variable in the presence of a strong condition, thus enabling it to generate diverse outputs.
Note that conditioning the VAE encoder via standard strategies, e.g., concatenation, is perfectly fine, since the two inputs to the encoder are deterministic and useful to compress the sample into the latent space. However, conditioning the VAE decoder requires special care that we focus on below.
2.1 Stochastically Conditioning the Decoder
We propose to make the sampling of the latent variable from the prior/posterior distribution explicitly depend on the condition instead of treating these two variables as independent. To this end, we first learn the distribution of the condition via a simple VAE, which we refer to as CS-VAE because this VAE acts on the conditioning signal. The goal of CS-VAE is to reconstruct the condition given its latent representation. We take the prior of CS-VAE as a standard Normal distribution . Following , this allows us to approximate the CS-VAE posterior with another sample from a Normal distribution . That is, we write
where and are the parameters of the posterior distribution generated by the VAE encoder.
Following the same strategy for the VAE on the data, called CPP-VAE, translates to treating the conditioning and the data latent variables independently, which we want to avoid. Therefore, as illustrated in Fig. 2 (Right), we instead define the CPP-VAE posterior as not directly normally distributed, but conditioned on the posterior of CS-VAE. To this end, we extend the standard reparameterization as
where comes from Eq. 4. In fact, in Eq. 4 is a sample from the scaled and translated version of given and , and in Eq. 5 is a sample from the scaled and translated version of given and . Since we have access to the observations during both training and testing, we always sample from the condition posterior. As is sampled given , one expects the latent variable to carry information about the strong condition, and thus a sample generated from to correspond to a plausible sample given the condition. This extended reparameterization trick allows us to avoid conditioning the CPP-VAE decoder by concatenating the latent variable with a deterministic representation of the condition, thus mitigating posterior collapse. However, it changes the variational family of the CPP-VAE posterior. In fact, the posterior is no longer
, but a Gaussian distribution with meanand covariance matrix . This will be accounted for when designing the KL divergence loss discussed below.
To learn the parameters of our model, we rely on the availability of a dataset containing training samples . Each training sample is a pair of condition and desired sample. For CS-VAE, that learns the distribution of the condition, we define the loss as the KL divergence between its posterior and the standard Gaussian prior, that is,
By contrast, for CPP-VAE, we define the loss as the KL divergence between the posterior of CPP-VAE and the posterior of the CS-VAE, i.e., of the condition. To this end, we freeze the weights of CS-VAE before computing the KL divergence, since we do not want to move the posterior of the condition but that of the data. The KL divergence is then computed as the divergence between two multivariate Normal distributions, encoded by their mean vectors and covariance matrices, as
Let , , be the dimensionality of the latent space and the trace of a square matrix, the loss in Eq. (7) can be written as111See Appendix 0.B for more details on the KL divergence between two multivariate Gaussians and the derivation of Eq. 8.
After computing the loss in Eq. 8, we unfreeze CS-VAE and update it with its previous gradient. Trying to match the posterior of CPP-VAE to that of CS-VAE allows us to effectively use our extended reparameterization trick in Eq. 5. Furthermore, we use the standard reconstruction loss for both CS-VAE and CPP-VAE, minimizing the negative log-likelihood (NLL) or the mean squared error (MSE) of the condition and the corresponding data, given the task. We refer to the reconstruction losses as and for CS-VAE and CPP-VAE. Thus, our complete loss is
In practice, since our VAE appears within a recurrent model, we weigh the KL divergence terms by a function corresponding to the KL annealing weight of . We start from , forcing the model to encode as much information in as possible, and gradually increase it to during training, following a logistic curve. We then continue training with .
In short, our method can be interpreted as a simple yet effective framework (designed for CVAEs) for altering the variational family of the posterior such that (1) a latent variable from this posterior distribution is explicitly sampled given the condition, both during training and inference time, and (2) the model is being prevented from posterior collapse by making sure that there is a positive mismatch between the two distributions in the KL loss of Eq. 8.
In this paper, we mainly focus on stochastic human motion prediction, where given partial observation, the task is to generate diverse and plausible continuations. Additionally, to show that our CPP-VAE generalizes to other domains, we tackle the problem of stochastic image captioning, where given an image representation, the task is to generate diverse yet related captions.
3.1 Diverse Human Motion Prediction
To evaluate the effectiveness of our approach on the task of stochastic human motion prediction, we use the Human3.6M dataset , the largest publicly-available motion capture (mocap) dataset. Human3.6M comprises more than 800 long indoor motion sequences performed by 11 subjects, leading to 3.6M frames. Each frame contains a person annotated with 3D joint positions and rotation matrices for all 32 joints. In our experiments, for our approach and the replicated VAE-based baselines, we represent each joint in 4D quaternion space. We follow the standard preprocessing and evaluation settings used in [27, 11, 29, 16]. We also evaluate our approach on a real-world dataset, Penn Action , which contains 2326 sequences of 15 different actions, where for each person, 13 joints are annotated in 2D space. The results on Penn Action are provided in Appendix 0.F.
, we report the estimated upper bound on the reconstruction error as ELBO, along with the KL-divergence on the held-out test set. Additionally, we also use quality and diversity [36, 2, 37] metrics (which should be considered together), a context metric, and the training KL at convergence. To measure the diversity of the motions generated by a stochastic model, we make use of the average distance between all pairs of the
motions generated from the same observation. To measure quality, we train a binary classifier to discriminate real (ground-truth) samples from fake (generated) ones. The accuracy of this classifier on the test set is inversely proportional to the quality of the generated motions. Context is measured by the performance of a good action classifier trained on ground-truth motions. The classifier is then tested on each of the motions generated from each observation. For observations and
continuations per observation, the accuracy is measured by computing the argmax over each prediction’s probability vector, and we report context as the mean class accuracy on themotions. For all metrics, we use motions per test observation. We also provide qualitative results in Appendix 0.L. For all experiments related to motion prediction, we use 16 frames (i.e., 640ms) as observation to generate the next 60 frames (i.e., 2.4sec).
In Table 1, we compare our approach (with the architecture described in Appendix 0.H) with the state-of-the-art stochastic motion prediction models [35, 2, 33, 4]. Note that one should consider the reported metrics jointly to truly evaluate a stochastic model. For instance, while MT-VAE  and HP-GAN  generate high-quality motions, they are not diverse. Conversely, while Pose-Knows  generates diverse motions, they are of low quality. On the other hand, our approach generates both high quality and diverse motions. This is also the case of Mix-and-Match , which, however, preserves much less context. In fact, none of the baseline can effectively convey the context of the observation to the generated motions properly. As shown in Table 2, the upper bound for the context on Human3.6M is 0.60 (i.e., the classifier  performance given the ground-truth motions). Our approach yields a context of 0.54 when given only about 20% of the data. Altogether, our approach yields diverse, high-quality and context-preserving predictions. This is further evidenced by the t-SNE  plots of Fig. 5, where different samples of various actions are better separated for our approach than for, e.g., MT-VAE . We refer the reader to the human motion prediction related work section in Appendix 0.C for a brief overview of the baselines. We also encourage reading Appendix 0.D
for further discussion of the aforementioned baselines and a deeper insight of their behavior under different evaluation metrics.
|ELBO (KL)||Diversity||Quality||Context||Training KL|
|MT-VAE ||0.51 (0.06)||0.26||0.45||0.42||0.08|
|Pose-Knows ||2.08 (N/A)||1.70||0.13||0.08||N/A|
|HP-GAN ||0.61 (N/A)||0.48||0.47||0.35||N/A|
|Mix-and-Match ||0.55 (2.03)||3.52||0.42||0.37||1.98|
|MT-VAE ||CPP-VAE ()||CPP-VAE ()|
|Lower bound||GT||Zero velocity||0.38|
|Upper bound (GT poses as future motion)||GT||GT||0.60|
|Ours (sampled motions as future motion)||GT||Sampled from CPP-VAE||0.54|
Evaluating Sampling Quality.
To further evaluate the sampling quality, we evaluate stochastic baselines using the standard mean angle error (MAE) metric in Euler space. To this end, we use the best of the generated motions for each observation (aka S-MSE ). A model that generates more diverse motions has more chances of generating a motion close to the ground-truth one. As shown in Table 3, this is the case with our approach and Mix-and-Match , which both yield higher diversity. However, our approach performs better thanks to its context-preserving latent representation and its higher quality of the generated motions.
In Table 4, we compare our approach with the state-of-the-art deterministic motion prediction models [27, 16, 12, 9, 11] using the MAE metric in Euler space. To have a fair comparison, we generate one motion per observation by setting the latent variable to the distribution mode, i.e.,
. This allows us to generate a plausible motion without having access to the ground-truth. To compare against the deterministic baselines, we follow the standard setting, and thus use 50 frames (i.e., 2sec) as observation to generate the next 25 frames (i.e., 1sec). Surprisingly, despite having a very simple motion decoder architecture (one-layer GRU network) with a very simple reconstruction loss function (MSE), this motion-from-mode strategy yields results that are competitive with those of the baselines that use sophisticated architectures and advanced loss functions. We argue that learning a good, context-preserving latent representation of human motion is the contributing factor to the success of our approach. This, however, could be used in conjunction with sophisticated motion decoders and reconstruction losses, which we leave for future research.
In Appendix 0.E, we study alternative designs to condition the VAE encoder and decoder.
|Model||ELBO (KL)||Perplexity||Quality||Diversity||Context||Training KL|
|Conditional VAE||2.86 (0.00)||17.46||0.39||0.00||0.44||0.00|
3.2 Diverse Image Captioning
For the task of conditional text generation, we focus on stochastic image captioning. To demonstrate the effectiveness of our approach, we report results on the MSCOCO captioning task with the original train/test splits of 83K and 41K images, respectively. The MSCOCO dataset has five captions per image. However, we make it deterministic by removing four captions per image, yielding a Deterministic-MSCOCO captioning dataset. Note that the goal of this experiment is not to advance the state of the art in image captioning, but rather to explore the effectiveness of our approach on a different task, where we have strong conditioning signal and an expressive decoder in the presence of a deterministic dataset.
A brief review of the recent work on diverse text generation is given in Appendix 0.J.
We compare CPP-VAE (with the architecture described in Appendix 0.I) with a standard CVAE and with its autoregressive, non-variational counterpart222Note that CPP-VAE is agnostic to the choice of data encoder/decoder architecture. Thus, one could use more sophisticated architectures, which we leave for future research.. For quantitative evaluation, we report the ELBO (the negative log-likelihood), along with the KL-divergence and the Perplexity of the reconstructed captions on the held-out test set. We also quantitatively measure the diversity, the quality, and the context of sampled captions. To measure the context, we rely on the BLEU1 score, making sure that the sampled captions represent elements that appear in the image. For CVAE and CPP-VAE, we compute the average BLEU1 score for captions sampled per image and report the mean over the images. To measure the diversity, we measure the BLEU4 score between every pair of sampled captions per image. The smaller the BLEU4 is, the more diverse the captions are. The diversity metric is then 1-BLEU4, i.e., the higher the better. To measure the quality, we use a metric similar to that in our human motion prediction experiments, obtained by training a binary classifier to discriminate real (ground-truth) captions from fake (generated) ones. The accuracy of this classifier on the test set is inversely proportional to the quality of the generated captions. We expect a good stochastic model to have high quality and high diversity at the same time, while capturing the context of the given image. We provide qualitative examples for all the methods in Appendix 0.M. As shown in Table 5, a CVAE learns to ignore the latent variable as it can minimize the caption reconstruction loss given solely the image representation. By doing so, all the generated captions at test time are identical, despite sampling multiple latent variables. This can be further seen in the ELBO and Perplexity of the reconstructed captions. We expect a model that gets as input the captions and the image to have a much lower reconstruction loss compared to the autoregressive baseline (which gets only the image as input). However, this is not the case with CVAE, indicating that the connection between the encoder and the decoder, i.e., the latent variable, does not carry essential information about the input caption. However, the quality of the generated sample is reasonably good. This is also illustrated in the qualitative evaluations in Appendix 0.M. CPP-VAE, on the other hand, is able to effectively handle this situation by unifying the sampling of the latent variable and the conditioning, leading to diverse but high quality captions, as reflected by the ELBO of our approach in Table 5 and the qualitative results in Appendix 0.M. Additional quantitative evaluations and ablation studies for image captioning are provided in Appendix 0.K.
In this paper, we have studied the problem of conditionally generating diverse sequences with a focus on scenarios where the conditioning signal is strong enough such that an expressive decoder can generate plausible samples from it only. In standard CVAEs, the sampling of latent variables is completely independent of the conditioning signal. However, these two variables should be tied together such that the latent variable is sampled given the condition. We have addressed this problem by forcing the sampling of the latent variable to depend on the conditioning one. By making this dependency explicit, the model receives a latent variable that carries information about the condition during both training and the test time. This further prevents the network from ignoring the latent variable in the presence of a strong condition, thus enabling it to generate diverse outputs. To demonstrate the effectiveness of our approach, we have investigated two application domains: Stochastic human motion prediction and diverse image captioning. In both cases, our CPP-VAE
model was able to generate diverse and plausible samples, as well as to retain contextual information, leading to semantically-meaningful predictions. In the future, we will apply our approach to other problems that rely on strong conditions, such as image inpainting and super-resolution, for which only deterministic datasets are available.
VIENA: a driving anticipation dataset.
Asian Conference on Computer Vision, pp. 449–466. Cited by: Appendix 0.C.
-  (2019) Learning variations in human motion via mix-and-match perturbation. arXiv preprint arXiv:1908.00733. Cited by: Appendix 0.C, Appendix 0.D, §1, §2, §3.1, §3.1, §3.1, Table 1, Table 3.
-  (2016) Deep action-and context-aware sequence learning for activity recognition and anticipation. arXiv preprint arXiv:1611.05520. Cited by: Appendix 0.C.
HP-gan: probabilistic 3d human motion prediction via gan.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1418–1427. Cited by: Appendix 0.C, Appendix 0.D, §3.1, §3.1, Table 1, Table 3.
-  (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §2.2.
-  (2019) Mixture content selection for diverse sequence generation. arXiv preprint arXiv:1909.01953. Cited by: Appendix 0.J.
Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: Appendix 0.H.
-  (2019) Implicit deep latent variable models for text generation. arXiv preprint arXiv:1908.11527. Cited by: Appendix 0.J.
-  (2015) Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354. Cited by: Appendix 0.C, §3.1, Table 4.
-  (2017) Learning human motion models for long-term predictions. In 2017 International Conference on 3D Vision (3DV), pp. 458–466. Cited by: Appendix 0.C, Table 4.
-  (2018) Adversarial geometry-aware human motion prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 786–803. Cited by: Appendix 0.C, §3.1, §3.1, Table 4.
-  (2018) Few-shot human motion prediction via meta-learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 432–450. Cited by: Appendix 0.C, §3.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Appendix 0.I.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: Appendix 0.I.
-  (2014-07) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §3.1.
Structural-rnn: deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5308–5317. Cited by: Appendix 0.C, §3.1, §3.1, Table 4.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: Appendix 0.A, Appendix 0.H, §1, §1, §2.1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Appendix 0.I.
-  (2019) VideoFlow: a flow-based generative model for video. arXiv preprint arXiv:1903.01434. Cited by: Appendix 0.D.
-  (2018) BiHMP-gan: bidirectional 3d human motion prediction gan. arXiv preprint arXiv:1812.02591. Cited by: Appendix 0.C.
-  (2019) A surprisingly effective fix for deep latent variable modeling of text. arXiv preprint arXiv:1909.00868. Cited by: Appendix 0.J.
-  (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055. Cited by: §3.1, §3.1, Table 2.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.2.
-  (2018) Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652. Cited by: Appendix 0.C.
Visualizing data using t-sne.
Journal of machine learning research9 (Nov), pp. 2579–2605. Cited by: §3.1.
-  (2019) Learning trajectory dependencies for human motion prediction. In ICCV, Cited by: Table 4.
-  (2017) On human motion prediction using recurrent neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4674–4683. Cited by: Appendix 0.C, §3.1, §3.1, Table 4.
-  (2019) Modeling human motion with quaternion-based neural networks. arXiv preprint arXiv:1901.07677. Cited by: Appendix 0.C.
-  (2018) QuaterNet: a quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485. Cited by: Appendix 0.C, §3.1.
-  (2018) Action anticipation by predicting future dynamic images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: Appendix 0.C.
-  (2017-10) Encouraging lstms to anticipate actions very early. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Appendix 0.C.
-  (2019) Mixture models for diverse machine translation: tricks of the trade. arXiv preprint arXiv:1902.07816. Cited by: Appendix 0.J.
-  (2017) The pose knows: video forecasting by generating pose futures. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 3352–3361. Cited by: Appendix 0.C, Appendix 0.D, §3.1, §3.1, Table 1, Table 3.
-  (1989) A learning algorithm for continually running fully recurrent neural networks. Neural computation 1 (2), pp. 270–280. Cited by: Appendix 0.H.
-  (2018) MT-vae: learning motion transformations to generate multimodal human dynamics. In European Conference on Computer Vision, pp. 276–293. Cited by: Appendix 0.C, Appendix 0.D, Figure 5, §3.1, §3.1, §3.1, Table 1, Table 3.
-  (2019) Diversity-sensitive conditional generative adversarial networks. In International Conference on Learning Representations, External Links: Cited by: §3.1.
-  (2019) Diverse trajectory forecasting with determinantal point processes. arXiv preprint arXiv:1907.04967. Cited by: §3.1.
-  (2013) From actemes to action: a strongly-supervised representation for detailed action understanding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2248–2255. Cited by: §3.1.
Appendix 0.A Detailed Technical Background on Evidence Lower Bound
To solve the maximum likelihood problem, we would like to have and . Using Variational Inference, we aim to approximate the true posterior with another distribution . This distribution is computed via another neural network parameterized by (called variational parameters), such that . Using such an approximation, Variational Autoencoders 
, or VAEs in short, are able to optimize the marginal likelihood in a tractable way. The optimization objective of the VAEs is a variational lower bound, also known as evidence lower bound, or ELBO in short. Recall that variational inference aims to find an approximation of the posterior that represents the true one. One way to do this is to minimize the divergence between the approximate and the true posterior using Kullback-Leibler divergence, or KL divergence in short. That is,
This can be seen as an expectation,
The second term above, i.e., the true posterior, can according to Bayes’ theorem, be written as. The data distribution is independent of the latent variable , and can thus be pulled out of the expectation term,
By shifting the term to the right hand side of the above equation, we can write,
The second expectation term in the above equation is, by definition, the KL divergence between the approximate posterior and the prior distributions. Thus, this can be written as
In the above equation, is the log-likelihood of the data which we would like to optimize. is the KL divergence between the approximate and the true posterior distributions, and while not computable, from definition we know that it is non-negative. is the reconstruction loss, and is the KL divergence between the approximate posterior distribution and a prior over the latent variable. The last term can be seen as a regularizer of the latent representation. Therefore, the intractability and non-negativity of only allows us to optimize the lower bound of the log-likelihood of the data,
which we call variational or evidence lower bound (ELBO).
Appendix 0.B KL Divergence Between Two Gaussian Distributions
In our approach, the model encourages the posterior of CPP-VAE to be close to the one of the CS-VAE. In general, the KL divergence between two distributions and is defined as
In a general case, one can have a multivariate Gaussian distribution in where where and are predicted by the encoder network of the VAE. The density function of such a distribution is
Thus, the KL divergence between two multivariate Gaussians is computed as
where is the trace operation. In Eq. 18, the covariance matrix and mean corresponds to distribution and covariance matrix and mean corresponds to distribution .
Given Eq. 19, we can then compute the KL divergence of the CPP-VAE and the posterior distribution with mean and covariance matrix . Let , , the dimensionality of the latent space, and the trace of a square matrix. The loss in Eq. (7) can then be written as333See Appendix 0.B for more details on the KL divergence between two multivariate Gaussians.
Since , will be cancelled out in the term, which yields
Appendix 0.C Stochastic Human Motion Prediction Related Work
Most motion prediction methods are based on deterministic models [29, 28, 11, 16, 27, 12, 9, 10], casting motion prediction as a regression task where only one outcome is possible given the observation. While this may produce accurate predictions, it fails to reflect the stochastic nature of human motion, where multiple futures can be highly likely for a single given series of observations. Modeling stochasticity is the topic of this paper, and we therefore focus the discussion below on the other methods that have attempted to do so.
The general trend to incorporate variations in the predicted motions consists of combining information about the observed pose sequence with a random vector. In this context, two types of approaches have been studied: The techniques that directly incorporate the random vector into the RNN decoder and those that make use of an additional CVAE. In the first class of methods,  samples a random vector at each time step and adds it to the pose input of the RNN decoder. By relying on different random vectors at each time step, however, this strategy is prone to generating discontinuous motions. To overcome this,  makes use of a single random vector to generate the entire sequence. This vector is both employed to alter the initialization of the decoder and concatenated with a pose embedding at each iteration of the RNN. By relying on concatenation, these two methods contain parameters that are specific to the random vector, and thus give the model the flexibility to ignore this information. In , instead of using concatenation, the random vector is added to the hidden state produced by the RNN encoder. While addition prevents having parameters that are specific to the random vector, this vector is first transformed by multiplication with a learnable parameter matrix, and thus can again be zeroed out so as to remove the source of diversity, as observed in our experiments. The second category of stochastic methods introduce an additional CVAE between the RNN encoder and decoder. This allows them to learn a more meaningful transformation of the noise, combined with the conditioning variables, before passing the resulting information to the RNN decoder. In this context,  proposes to directly use the pose as conditioning variable. As will be shown in our experiments, while this approach is able to maintain some degree of diversity, albeit less than ours, it yields motions of lower quality because of its use of independent random vectors at each time step. Instead of perturbing the pose, the recent work  uses the RNN decoder hidden state as conditioning variable in the CVAE, concatenating it with the random vector. While this approach generates high-quality motions, it suffers from the fact that the CVAE decoder gives the model the flexibility to ignore the random vector, which therefore yields low-diversity outputs. Similar to  Mix-and-Match  perturbs the hidden states, but replaces the deterministic concatenation operation with a stochastic perturbation of the hidden state with the noise. Through such a perturbation, the decoder is not able decouple the noise and the condition, the phenomenon that happens in concatenation . However, since the perturbation is not learned and is a non-parametric operation, the quality of generated motion is comparably low.
Generating diverse plausible motions given limited observations has many applications, especially when the motions are generated in an action-agnostic manner, as done in our work. For instance, our model can be used for human action forecasting [3, 30, 31, 1], where one seeks to anticipate the action as early as possible, where one modality utilized is human motion/poses.
Appendix 0.D Further Discussion on the Performance of Stochastic Baselines
The MT-VAE model  tends to ignore the random variable , thus ignoring the root of variation. As a consequence, it achieves a low diversity, much lower than ours, but produces samples of high quality, albeit almost identical (see the qualitative comparison of different baselines in the appendix). To further confirm that the MT-VAE ignores the latent variable, we performed an additional experiment where, at test time, we sampled each element of the random vector independently from instead of from the prior . This led to neither loss of quality nor increase of diversity of the generated motions. Experiments on HP-GAN model  evidences the limited diversity of the sampled motions despite its use of random noise during inference. Note that the authors of  mentioned in their paper that the random noise was added to the hidden state. Only by studying their publicly available code444https://github.com/ebarsoum/hpgan
did we understand the precise way this combination was done. In fact, the addition relies on a parametric, linear transformation of the noise vector. That is, the perturbed hidden state is obtained as
Because the parameters are learned, the model has the flexibility to ignore , which leads to the low diversity of sampled motions. Note that the authors of  acknowledged that, despite their best efforts, they noticed very little variation between predictions obtained with different values. Since the perturbation is ignored, however, the quality of the generated motions is high. The other baseline, Pose-Knows , produces motions with higher diversity than the aforementioned two baselines, but of much lower quality. The main reason behind this is that the random vectors that are concatenated to the poses at each time-step are sampled independently of each other, which translates to discontinuities in the generated motions. This problem might be mitigated by sampling the noise in a time-dependent, autoregressive manner, as in  for video generation. Doing so, however, goes beyond the scope of our analysis. The Mix-and-Match approach  yields sampled motions with higher diversity and reasonable quality. The architecture of Mix-and-Match is very close to that of MT-VAE, but replaces the deterministic concatenation operation with a stochastic perturbation of the hidden state with the noise. Through such a perturbation, the decoder is not able decouple the noise and the condition, the phenomenon that happens in concatenation. However, since the perturbation is not learned and is a non-parametric operation, the quality of the generated motion is lower than ours and of other baselines (except for Pose-Knows). We see Mix-and-Match perturbation as a workaround to the posterior collapse problem while sacrificing the quality and the context in the sampled motions. We also provide a more complete related work on diverse human motion prediction in Appendix 0.C.
Appendix 0.E Ablation Study on Different Means of Conditioning
In addition to the experiments in the main paper, we also study various designs to condition the VAE encoder and decoder. As discussed before, conditioning the VAE encoder can be safely done via concatenating two deterministic sources of information, i.e., the representations of the past and the future, since both sources are useful to compress the future motion into the latent space. In Table 6, we use both a deterministic representation of the observation, , and a stochastic one, , as a conditioning variable for the encoder. Similarly, we compare the use of either of these variables via concatenation with that of our modified reparameterization trick (Eq. 5). This shows that, to condition the decoder, reparameterization is highly effective at addressing posterior collapse. Furthermore, for the encoder, a deterministic condition works better than a stochastic one. When both the encoder and decoder are conditioned via deterministic conditioning variables, i.e., row 2 in Table 6, the model learns to ignore the latent variable and rely solely on the condition, as evidenced by the KL term tending to zero.
|Encoder Conditioning||Decoder Conditioning||CPP-VAE’s Training KL|
|Concatenation ()||Reparameterization ()||6.92|
|Concatenation ()||Concatenation ()||0.04|
|Concatenation ()||Concatenation ()||0.61|
|Concatenation ()||Reparameterization ()||8.07|
Appendix 0.F Experimental Results on Penn Action Dataset
As a complementary experiment, we evaluate our approach on the Penn Action dataset, which contains 2326 sequences of 15 different actions, where for each person, 13 joints are annotated in 2D space. Most sequences have less than 50 frames and the task is to generate the next 35 frames given the first 15. Results are provided in Table 7. Note that the upper bound for the Context metric is 0.74, i.e., the classification performance given the Penn Action ground-truth motions.
|ELBO (KL)||Diversity||Quality||Context||Training KL|
|Autoregressive Counterpart||0.048 (N/A)||0.00||0.46||0.51||N/A|
Appendix 0.G Pseudo-code for CPP-VAE
Here, we provide the forward pass pseudo-codes for both CS-VAE and CPP-VAE.
Appendix 0.H Stochastic Human Motion Prediction Architecture
Our motion prediction model follows the architecture depicted in Fig. 2 (a). Below, we describe the architecture of each component in our model. Note that human poses, consisting of 32 joints in case of the Human3.6M dataset, are represented in 4D quaternion space. Thus, each pose at each time-step is represented with a vector of size
. All the tensor sizes described below ignores the mini-batch dimension for simplicity.
Observed motion encoder, or the CS-VAE’s motion encoder, is a single layer GRU  network with 1024 hidden units. If the observation sequence has the length , the observed motion encoder maps
into a single hidden representation of size, i.e., the hidden state of the last time-step. This hidden state, , acts as the condition to the CPP-VAE’s encoder and the direct input to the CS-VAE’s encoder.
CS-VAE, similar to any variational autoencoder, has an encoder and decoder. The CS-VAE
’s encoder is a fully-connected network with ReLU non-linearities, mapping the hidden state of the motion encoder, i.e.,, to an embedding of size
. Then, to generate the mean and standard deviation vectors, two fully connected branches are considered. These map the embedding of sizeto a vector of means of size and a vector of standard deviation of size , where 128 is the length of the latent variable. Note that we apply a ReLU non-linearity to the vector of standard deviations to make sure it is non-negative. We then use the reparameterization trick  to sample a latent variable of size . The CS-VAE’s decoder consists of multiple fully-connected layers, mapping the latent variable to a variable of size , acting as the initial hidden state of the observed motion decoder. Note that, we apply a Tanh non-linearity to the generated hidden state to mimic the properties of a GRU hidden state.
Observed motion decoder, or the CS-VAE’s motion decoder, is similar to its motion encoder, except for the fact that it reconstructs the motion autoregressively. Additionally, it is initialized with the reconstructed hidden state, i.e., the output of CS-VAE’s decoder. The output of each GRU cell at each time-step is then fed to a fully-connected layer, mapping the GRU output to a vector of size which represents a human pose with 32 joints in 4D quaternion space. To decode the motions, we use a teacher forcing technique  during training. At each time-step, the network chooses with probability whether to use its own output at the previous time-step or the ground-truth pose as input. We initialize
, and decrease it linearly at each training epoch such that, after a certain number of epochs, the model becomes completely autoregressive, i.e., uses only its own output as input to the next time-step. Note, at test time, motions are generated completely autoregressively, i.e., with.
Note, the future motion encoder and decoder have exactly the same architectures as the observed motion ones. The only difference is their input, where the future motion is represented by poses from to in a sequence. In the following, we describe the architecture of CPP-VAE for motion prediction.
CPP-VAE is a conditional variational encoder. Its encoder’s input is a representation of future motion, i.e., the last hidden state of the future motion encoder called , conditioned on . The conditioning is done by concatenation, thus, the input to the encoder is a representation of size . The CPP-VAE’s encoder, similar to CS-VAE’s encoder, maps its input representation to an embedding of size . Then, to generate the mean and standard deviation vectors, two fully connected branches are considered, mapping the embedding of size to a vector of means of size and a vector of standard deviations of size , where 128 is the length of the latent variable. Note that we apply a ReLU non-linearity to the vector of standard deviations to make sure it is non-negative. To sample the latent variable, we use our extended reparameterization trick, explained in Eq. 5. This unifies the conditioning and sampling of the latent variable. Then, similar to CS-VAE, the latent variable is fed to the CPP-VAE’s decoder, which is a fully connected network that maps the latent representation of size to a reconstructed hidden state of size for future motion. Note that, we apply a Tanh non-linearity to the generated hidden state to mimic the properties of a GRU hidden state.
Appendix 0.I Diverse Image Captioning Architecture
Our diverse image captioning model follows the architecture depicted in Fig. 2 (a). Below, we describe the architecture of each component in our model. Note, all tensor sizes described below ignore the mini-batch dimension for simplicity.
Image encoder is, here, ResNet152  pretrained on ImageNet . Given the encoder, the conditioning signal is a feature representation. Note that, to avoid an undesirable equilibrium in the reconstruction loss of the CS-VAE, we freeze ResNet152 during training.
CS-VAE is a standard variational autoencoder. The encoder of the CS-VAE maps the input representation of size to an embedded representation of size . Then, to generate the mean and standard deviation vectors, two fully connected branches are considered, mapping the embedding of size to a vector of means of size and a vector of standard deviations of size , where 256 is the length of the latent variable. The decoder of the CS-VAE maps the sampled latent variable of size to a representation of size . The generated representation acts as a reconstructed image representation. During training, we learn the reconstruction by computing the smoothed loss between the generated representation and the image feature (of the frozen ResNet152).
Caption encoder is a single layer GRU network with the hidden size of 1024. Each word in the caption is represented through a randomly initialized embedding layer that maps each word to a representation of size . The caption encoder gets a caption as input and generates a hidden representation of size .
CPP-VAE is a conditional variational autoencoder. As the input to its encoder, we first concatenate the image representation of size to the caption representation of size . The encoder then maps this representation to an embedded representation of size . Then, to generate the mean and standard deviation vectors, two fully connected branches are considered, mapping the embedding of size to a vector of means of size and a vector of standard deviations of size , where 256 is the length of the latent variable. To sample the latent variable, we make use of our extended reparameterization trick, explained in Eq. 5. This unifies the conditioning and sampling of the latent variable. The CPP-VAE’s decoder then maps this latent representation to a vector of size through a few fully-connected layers. We then apply a batch normalization  on the representation which then acts as the first token to the caption decoder.
Caption decoder is also a single layer GRU network with a hidden size of 1024. Its first token is the representation generated by the CPP-VAE’s decoder, while the rest of tokens are represented by the words in the corresponding caption. To decode the caption, we use a teacher forcing technique during training. At each time-step, the network chooses with probability whether to use its own output at the previous time-step or the ground-truth token as input. We initialize , and decrease it linearly at each training epoch such that, after a certain number of epochs, the model becomes completely autoregressive, i.e., uses only its own output as input to the next time-step. Note, at test time, motions are generated completely autoregressively, i.e., with .
Appendix 0.J Diverse Text Generation Related Work
There are a number of studies which utilize generative models for language modeling. For instance,  uses VAEs and LSTMs in an unconditional language modeling problem where posterior collapse may occur if the VAE is not trained well. To handle the problem of posterior collapse in language modeling, the authors of  try to directly match the aggregated posterior to the prior. It is discussed that this can be considered an extension of variational autoencoders with a regularization when maximizing mutual information, addressing the posterior collapse issue. VAEs are also used for language modeling in . It was observed that for language modeling with VAEs it is hard to find a good balance between language modeling and representation learning. To improve the training of VAEs in such scenarios, the authors of  first pretrain the inference network in an autoencoder fashion such that the inference network learns a good representation of the data in a deterministic manner. Then, they train the whole VAE while considering a weight for the KL term during training. However, the second step modifies the way VAEs optimize the variational lower bound. The proposed technique also prevents the model from being trained end-to-end.
Unlike these approaches, our method considers the case of conditional sequence (text) generation where the conditioning signal (the image to be captioned in our case) is strong enough such that the caption generator can rely solely on that.
A recent work  proposes to separate the diversification from generation when it comes to sequence generation and language modeling. The diversification stage uses a mixture of experts (MoE) to sample different binary masks on the source sequence for diverse content selection. The generation stage uses a standard encoder-decoder model given each selected content from the source sequence. While shown to be effective in generating diverse sequences, it relies heavily on the selection part, where one need to select the information in the source that is more important to generate the target sequence. Thus, the diversity of the generated target sequence depends on the diversity of the selected parts of the source sequence. Similarly, the authors of  utilize MoE for the task of diverse machine translation. While this task is considered to be diverse text generation and shown to be highly successful in generating diverse translations of each source sentence, it relies on the availablity of the a stochastic dataset, i.e., having access to multiple target sequences for each source sentence during training.
While these approaches are successful in generating diverse sentences given the conditioned sequence, unlike our approach that works with deterministic datasets, they assume having access to a stochastic dataset.
Appendix 0.K Ablation Study on Diverse Image Captioning
In addition to the experiments in the main paper, in Table 8, we also evaluate our approach, as well as the autoregressive baseline and the CVAE, in terms of BLEU score for BLEU1, BLEU2, BLEU3, and BLEU4 of generated captions at test time. For the autoregressive baseline, the model generates one caption per image, thus, it is straightforward to compute the BLEU scores. For the CVAE, we consider the best BLEU score among all sampled captions according the the best matching ground-truth caption. For our model, we consider the caption from mode, i.e., the one sampled from . Although the caption sampled from CPP-VAE is not chosen based on the best match with the ground-truth caption (similar to CVAE), it shows promising quality in terms of BLEU scores. For the sake of completeness and fairness, we also provide the results with best of captions for our approach as well.
|Conditional VAE (best of captions)||0.44||0.38||0.20||0.17|
|CPP-VAE (caption from mode)||0.44||0.37||0.20||0.14|
|CPP-VAE (best of captions)||0.45||0.39||0.23||0.18|
The results in Table 8 clearly shows the effectiveness of sampling from mode in our approach. In this case, one could simply rely on the mode of the distribution to achieve a reasonably high quality caption.
Appendix 0.L Human Motion Prediction Qualitative Results
Here we provide a number of qualitative results on diverse human motion prediction on the Human3.6M dataset. As can be seen in Figures 7 to 12, the motions generated by our approach are diverse and natural, and mostly within the context of the observed motion.
Appendix 0.M Diverse Image Captioning Qualitative Results
In this section, we provide a number of qualitative examples of captions generated by our approach. Illustrated in Figures 13 to 19, there are five different ground-truth captions per image. However, as mentioned in the paper, during training we only utilize one (i.e., training with a deterministic dataset). While captions generated by our approach are diverse, they all describe the image adequately. Note that it is a feature of our approach to generate a caption from the mode of its distribution, usually achieving a good descriptive caption. This is also evidenced by the quantitative results in Table 8 where the BLEU scores for the caption from mode is relatively high compared to other baselines. Note that for the conditional VAE, all sampled captions are identical, despite sampling multiple latent variables. Therefore, we provide only one caption for this baseline.