The significant advancements in deep generative models bring impressive results in a wide range of domains such as image synthesis, text generation, and video prediction. Despite the huge success, unconstrained generation is still a few steps away from practical applications since it lacks intuitive and handy mechanisms to incorporate human manipulation into the generation process. In view of this incapability, conditional and controllable generative models have received an increasing amount of attention. Most existing work achieves controllability by conditioning the generation on the attribute, text, user inputs or scene graph[40, 36, 42, 16]. However, regardless of the considerable progress in still image generation, controllable video generation is yet to be well explored.
Typically, humans create a video through breaking down the entire story into separate scenes, taking shots for each scene individually, and finally merging every piece of footage to form the final edit. This requires a smooth transition across not only frames but also different video clips, posing constraints on both start- and end-frames within a video sequence so as to align with the preceding and subsequent context. We introduce point-to-point video generation (p2p generation) that controls the generation process with two control points—the targeted start- and end-frames. Enforcing consistency on the two control points allows us to regularize the context of the generated intermediate frames, and it also provides a straightforward strategy for merging multiple videos. Moreover, in comparison with standard video generation setting , which requires a consecutive sequence for initial frames, p2p generation only needs a pair of individual frames. Such a setting is more accessible in real-world scenarios, , generating videos from images with similar content crawled on the Internet. Finally, p2p generation is preferable to attribute-based methods for more sophisticated video generation tasks that involve hard-to-describe attributes. Attribute-based methods heavily depend on the available attributes provided in the datasets, whereas p2p generation can avoid the burden of collecting and annotating meticulous attributes.
Point-to-point generation has two major challenges: i) The control point consistency (CPC) should be achieved without the sacrifice of generation quality and diversity. ii) The generation with various length should all satisfy the control point consistency. Following the recent progress in video generation and future frame prediction, we introduce a global descriptor, which carries information about the targeted end-frame, and a time counter, which provides temporal hints for dynamic length generation to form a conditional variational encoder (CVAE ). In addition, to balance between generation quality, diversity, and CPC, we propose to maximize the modified variational lower bound of conditional data likelihood. Besides, we inject an alignment loss to ensure the latent space in the encoder and decoder aligns with each other. We further present the skip-frame training strategy to reinforce our model to be more time-counter-aware. Our model adjusts its generation procedure accordingly, and thus achieves better CPC. Extensive experiments are conducted on Stochastic Moving MNIST (or SM-MNIST) [29, 3], Weizmann Human Action , and Human3.6M (3D skeleton data)  to evaluate the effectiveness of the proposed method. A series of qualitative results further highlight the merits of p2p generation and the capability of our model.
2 Related Work
Several generative models for video generation have been proposed subsequent to [26, 29]. And the topic of controllability on video generation has also been explored in recent years [19, 23, 12, 9, 38, 10]. In this section, we start by reviewing methods in video generation. Then we briefly go through the methods that tackle the problem of the controllability regarding to video generation.
A number of approaches use GANs  or apply adversarial loss during training for generating videos [1, 20, 22, 24, 27, 32, 33]. For example, Mathieu  leverages a multi-scale framework along with an adversarial loss for better prediction. Vondrick  use a generator with two pathways to predict the foreground and background, and then use a discriminator to distinguish between the generated video and real video.
Video generation can also be tackled by learning how to transform from observed frames to the synthetic future frames [5, 15, 21, 33, 37]. Another strategy is to decompose a video into a static part that can be shared along (content) and the varying part (motion) to describe the dynamics in the video [4, 11, 30, 31, 35]. For example, Denton  use two encoders for motion and content, to encode the frames and introduces an adversarial loss on motion encoder to achieve disentanglement .
Several methods rely on VAE  to capture the uncertain nature in videos [2, 3, 6, 10, 18, 20, 34, 39]. Babaeizadeh  extend  with variational inference framework such that their model can predict multiple frames of plausible futures on real-world data. Walker  extract the high-level information (, pose) and use VAE to predict the possible future movements of human. Jayaraman  predicts the most certain frame first and breaks down the original problem such that the predictor can complete the semantic sub-goals coherently.
While the methods mentioned above achieve good results on video prediction, the generation process is often uncontrollable and hence leads to unconstrained outputs. In order to preserve the ability of generating diversified outputs while achieving control point consistency, we manage to build upon VAE for point-to-point video generation.
Controllability in Video Generation.
On the other hand, several methods attempt to guide the video generation process in different settings. Hu  use an image and a motion stroke to synthesize the video. The method of  conditions on the start frame and a trajectory provided by user to steer the generated appearance and the motion for the next frames. Text or language features can also be used as the instruction to produce the outputs [19, 23, 38]. The attribute-based approach proposed by He  exploits the attributes (, identity, action) in the dataset for transient control. Although these existing methods all provide freedom in controlling the generated video, they come with some limitation. Conditioning on language would suffer from its ambiguous nature, which does not allow precise control . Attribute control, on the other hand, depends on the data labels and might not be available for every dataset. User provided input is intuitive but requires supervision that must come along with ground truth during training. In contrast, our method i) only conditions on the target frame which can be acquired without any cost, ii) can incorporate the more detailed description of the control points (, the precise look and action of a person, or joints of a skeleton) to provide exact control, and iii) can be trained in a fully unsupervised fashion. The advantage over previous methods in having the controllability of start- and target-frames motivates us to develop our solution to point-to-point video generation.
Given a pair of control points (the targeted start- and end-frames ) and the generation length , we aim to generate a sequence with the specified length such that their start- and end-frames are consistent with the control points. To maintain quality and diversity in p2p generation, we present a conditional video generation model (Sec. 3.2) that maximizes the modified variational lower bound (Sec. 3.3). To further improve CPC under various lengths, we propose a novel skip-frame training strategy (Sec. 3.4) and a latent alignment loss (Sec. 3.5).
3.1 VAE and Video Generation
Variational Autoencoder (VAE) leverages a simple prior(, Gaussian) and a complex likelihood
(, a neural network) on latent variableto maximize the data likelihood , where . A variational neural network is introduced to approximate the intractable latent posterior , allowing joint optimization over and ,
The intuition behind the inequality is to reconstruct data with latent variable sampled from the posterior , simultaneously minimizing the KL-divergence between the prior and posterior .
Video generation commonly adopts VAE framework accompanied by a recurrent model (, LSTM), where the VAE handles generation process and the recurrent model captures the dynamic dependencies in sequential generation. However, in VAE, the simple choice for prior such as a fixed Gaussian is confined to drawing samples randomly at each timestep regardless of temporal dependencies across frames. Accordingly, existing works resort to parameterizing the prior with a learnable function conditioned on previous frames . The variational lower bound throughout the entire sequence is:
In comparison with a standard VAE, the former term describes the reconstruction sampled from the posterior conditioned on data up to the current frame. The latter term ensures that the prior conditioned on data up to the previous frame does not deviate from the posterior. Meanwhile, it also serves as a regularization on the learning of posterior. In this work, we inherit and modify the network architecture of  and adapt for p2p generation.
3.2 Global Descriptor and Time Counter
For a deep network to achieve p2p generation under various length, i) the model should be aware of the information of control points and ii) the model should be able to perceive time lapse and generate the targeted end-frame at the designated timestep. While the targeted start-frame is already fed as an initial frame, we adopt a straightforward strategy to incorporate the control points into the model at every timestep by feeding features encoded from the targeted end-frame to our model. Besides, to enforce our model to be aware of when to generate the targeted end-frame given the generation length , we introduce a time counter , where indicates the beginning of the sequence and indicates reaching the targeted end-frame. As shown in Fig. 2(a), and are modeled by a shared-weight encoder and two different LSTMs, and
is modeled by the third LSTM along with a decoder to map latent vectors to image space. The inference process during training at timestepis shown as
During test time, as we have no access to current , the latent variable is sampled from the prior distribution ,
Recall that the KL divergence term in (2) enforces the alignment between and , allowing the prior to serve as a proxy of the posterior during test time. Besides, by introducing the global descriptor and time counter , (2) is extended to a variational lower bound of conditional data likelihood , where denotes the conditioning on the targeted end-frame and time counter. In addition, we further propose a latent space alignment loss within and to mitigate the mismatch between encoding and decoding process, as shown in (6).
3.3 Control Point Consistency on Prior
Although introducing the time counter and the global descriptor of control points provides the model with capability of achieving CPC, we are not able to further reinforce the generated end-frame to conform to the targeted end-frame. While the conditioning happens to be a part of the reconstruction objective, naively increasing the weight at timestep in the reconstruction term of (2), , , results in unstable training behavior and degradation of generation quality and diversity. To tackle this problem, we propose to separate out the CPC from the reconstruction loss on the posterior and pose it on the prior. The modified lower bound of conditional data likelihood with a learnable prior is,
While the first two terms are the same as the bound of conditional VAE (CVAE), the third term of the above formulation benefits a more flexible tuning on the behavior of the additionally-introduced condition without degrading the maximum likelihood estimate in the first term.
3.4 Skip-Frame Training
A well-functioning p2p generation model should be aware of the time counter in order to achieve CPC under various length. However, most video datasets have a fixed frame rate. As a result, the model may exploit the fixed frequency across frames and ignore the time counter. We introduce skip-frame training to further enhance the model to be more aware of the time counter. Basically, we randomly drop frames while computing the reconstruction loss and KL divergence (the first two terms in (5)). The LSTMs are hence forced to take time counter into consideration so as to handle the random skipping in the recurrence. Such adaption in the maximum likelihood estimate of posterior further incorporates CPC into the learning of posterior.
3.5 Final Objective
To summarize, our final objective that maximizing the modified variational lower bound of conditional data likelihood under a skip-frame training strategy is,
where , , , and
are hyperparameters to balance between KL regularization, CPC, and latent space alignment respectively, andis a constant that determines the rate of skip-frame training.
|Method||SSIM ( indicates confidence interval)||PSNR ( indicates confidence interval)|
|+ C + A|
|Method||SSIM ( indicates confidence interval)||PSNR ( indicates confidence interval)|
|+ C + A|
|+ C + A|
To evaluate the effectiveness of our method, we conduct qualitative and quantitative analysis on three datasets: Stochastic Moving MNIST , Weizmann Action , and Human3.6M  to measure the CPC, quality and diversity. The following section is organized as follows: we start by stating the datasets in Sec. 4.1
and evaluation metrics in Sec.4.2; then the quantitative results are shown in 4.3-4.6; finally, the qualitative results are presented in Sec. 4.7.
We evaluate our methods on three common testbeds: “Stochastic Moving MNIST” is introduced by  which is a modified version from . The training sequence is generated by sampling one or two digits from the training set of MNIST, and then the trajectory is formed by sampling starting locations within the frame and an initial velocity vector, . Velocity vector will be re-sampled each time the digits reach the border. “Weizmann Action” dataset contains 90 videos of 9 people performing 10 actions. We center-crop each frame by the bounding box from  and follow the setting in  to form the training and test sets. “Human3.6M” is a large-scale dataset with 3.6 million 3D human poses captured by 11 professional actors, providing more than 800 sequences in total. We use normalized 3D skeletons of 17 joints for experiments. Following , we use subjects number 1, 5, 6, 7, and 8 for training and subjects 9 and 11 for testing.
4.2 Evaluation Metrics
We measure the structural similarity (SSIM) and peak signal-to-noise aatio (PSNR) for Stochastic Moving MNIST and Weizmann Action following [5, 2, 3]. For Human3.6M, we calculate mean squared error (MSE) following . To assess the learning of and , we adopt the concept of  by introducing Sampling and Reconstruction metric (referred to as “S-” and “R-”), where the evaluation is performed on generation from prior and posterior respectively. For each test sequence, we generate 100 samples and compute the following metrics in confidence interval:
Control Point Consistency (S-CPC): We compute the mean SSIM/PSNR/MSE between the generated end-frame and the targeted end-frame since CPC should be achieved for all samples.
Diversity (S-Div): Adapting the concept from 
, we compute the variance of SSIM/PSNR across all samples with the ground-truth sequences as reference. For MSE, we calculate the variance of difference between generated and ground-truth sequences instead because MSE only measures the distance between joints while ignoring their relative positions, which will result in biased estimation for diversity.
4.3 Quantitative Results
We show quantitative analysis on generation quality, diversity, and CPC over three datasets – SM-MNIST, Weizmann, and Human3.6M – in Table 1, Table 2, and Table 3, respectively. From R-Best we know that the posteriors learn well in all setting. In all tables, the model with CPC+Alignment losses (+C+A) outperforms model with only CPC loss (+C) in S-CPC. This shows the effectiveness of alignment loss. Recall from Sec. 3.2 that there are two LSTMs that separate the encoder and decoder, the alignment loss aligns the two latent spaces to alleviate the mismatch between the encoding and the decoding process. Moreover, in all tables, the model (Ours) with skip-frame training further improves over +C+A in S-CPC, where the gain mainly results from a better usage of time counter. Finally, S-CPC gain in Weizmann is less than SM-MNIST and Human3.6M since unlike the latter two, its data are captured in non-clear background with visible noise that is more challenging for CPC.
For generation quality, all three tables show comparable results in S-Best, highlighting that our method is able to maintain the quality while achieving CPC. Besides, the S-Best in Table 3 demonstrates an interesting finding that Ours not only achieves extremely superior performance in S-CPC but also in S-Best. The main reason is that Human3.6M contains 3D skeletons with highly diverse actions, giving rise to considerably flexible generation. A long generation may easily deviate from the others, causing high S-Best error, while p2p generation gradually converges to the targeted end-frames, confining the S-Best error (more details in Sec. 4.5).
For generation diversity, our method attains either comparable or better performance in Table 1 and Table 3. This proves that our method generate diverse samples while reaching the same targeted end-frame. On the contrary, our method suffers from a larger performance drop on S-Div in Table 2. This is expected since Weizmann data often involve video sequences with unvarying actions, walking in a fixed speed, and therefore, posing constraints at the end-frame significantly reduces the possibility of the generation, thus leading to low generation diversity. Overall, we demonstrate a significant improvement on CPC with our method while reaching comparable generation quality and diversity with the baseline.
4.4 CPC in Generation with Various Length
We show the CPC performance of all models under generation of different lengths on Human3.6M dataset in Fig. 3. First of all, our methods achieve CPC under various lengths even though the model has only seen the sequence with length around , showing that our models generalize well to all generation length. It is worth noting that with skip-frame training (red line), our model achieves CPC even further compared with other variations since it is able to leverage the information provided from the time counter. However, our method performs a bit worse at length 10 comparing to longer lengths because the model has less time budget for planning its trajectory toward targeted end-frames and the fact that training data do not contain any sequences with length less than 20.
4.5 Diversity Through Time
We further evaluate the diversity of our methods by investigating its behaviour through time in Fig. 4. First, the downward trend can be observed around the end of the green line. This shows that it tries to reach the targeted end-frame as the time-counter approaches the end. But with the skip-frame training (red line), the curve shows that the diversity becomes higher and higher around the middle segment and converges when it is around the start- and end-frame. This demonstrates that our full model knows exactly its status such as how far it is to the end-frame or the remaining time budgets it has and thus can plan ahead very well to achieve CPC. Furthermore, since it perceives well about the time budgets before the arrival of termination, it can “go wild”, , explore all possible trajectories while still being capable of getting back to the targeted end-frame on time.
4.6 CPC Weight on Prior vs. Posterior
Finally, we assess the effect of posing different CPC weight on prior versus posterior by comparing the quality and diversity in SSIM (Fig. 5). With different weights, the behavior of diversity for both and is comparable. Nonetheless, CPC on (blue line) does not result in degradation throughout all CPC weights in comparison to posing CPC on . This shows that our method is more robust to different CPC weight in terms of generation quality and diversity.
4.7 Qualitative Results
Generation with various length. In Fig. 6, we show in various examples across all three datasets that our model maintains high CPC for all lengths while producing diverse results (highlighted in red panel from timestamp to ).
Multiple control-points generation. In Fig. 7, we show the generated videos with multiple control points. The first row highlights transition across different attributes or actions (i.e., “run” to “skip” in Weizmann dataset). The second and third rows show two generated videos with the same set of multiple control points (i.e., stand; sit and lean to the left side). Note that these are two unique videos with diverse frames in transitional timestamps. By placing each control point as a breakpoint in a long generation, we can achieve fine-grained controllability directly from frame exemplars. More examples including a few failure cases are shown in supplementary materials.
Loop generation. In Fig. 8, we show that our method can be used to generate infinite looping videos by forcing the targeted start- and end-frames to be the same.
In this paper, we propose point-to-point (p2p) generation that controls the generation process with two control points—the targeted start- and end-frames—to provide better controllability in video generation. To achieve control point consistency (CPC) while maintaining generation quality and diversity, we propose to maximize the modified variational lower bound for conditional video generation model, followed by a novel skip-frame training strategy and a latent space alignment loss to further reinforce CPC. Extensive quantitative analysis is conducted to demonstrate the effectiveness of our model, along with qualitative results that highlight the merits of p2p generation and the capability of our model. Overall, our work opens up a new dimension in video generation that is promising for further exploration.
-  S. Aigner and M. Körner. Futuregan: Anticipating the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing autoencoder gans. arXiv preprint arXiv:1810.01325, 2018.
-  M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. In Proceedings of the International Conference on Learning Representations, 2018.
E. Denton and R. Fergus.
Stochastic video generation with a learned prior.
Proceedings of the International Conference on Machine Learning, 2018.
-  E. L. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, pages 4414–4423, 2017.
-  C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems, pages 64–72, 2016.
-  K. Fragkiadaki, J. Huang, A. Alemi, S. Vijayanarasimhan, S. Ricco, and R. Sukthankar. Motion prediction under multimodality with conditional stochastic networks. arXiv preprint arXiv:1705.02082, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
-  L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. Transactions on Pattern Analysis and Machine Intelligence, 29(12):2247–2253, December 2007.
-  Z. Hao, X. Huang, and S. Belongie. Controllable video generation with sparse trajectories. In , pages 7854–7863, 2018.
-  J. He, A. Lehrmann, J. Marino, G. Mori, and L. Sigal. Probabilistic video generation using holistic attribute control. In Proceedings of the European Conference on Computer Vision (ECCV), pages 452–467, 2018.
-  J.-T. Hsieh, B. Liu, D.-A. Huang, L. F. Fei-Fei, and J. C. Niebles. Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems, pages 515–524, 2018.
-  Q. Hu, A. Waelchli, T. Portenier, M. Zwicker, and P. Favaro. Video synthesis from a single image and motion stroke. arXiv preprint arXiv:1812.01874, 2018.
-  C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014.
-  D. Jayaraman, F. Ebert, A. A. Efros, and S. Levine. Time-agnostic prediction: Predicting predictable video frames. In Proceedings of the International Conference on Learning Representations, 2019.
-  X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems, pages 667–675, 2016.
-  J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1219–1228, 2018.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations, 2014.
-  A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
-  Y. Li, M. R. Min, D. Shen, D. E. Carlson, and L. Carin. Video generation from text. arXiv preprint arXiv:1710.00421, 2017.
-  X. Liang, L. Lee, W. Dai, and E. P. Xing. Dual motion gan for future-flow embedded video prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1744–1752, 2017.
-  Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 4463–4471, 2017.
-  W. Lotter, G. Kreiman, and D. Cox. Unsupervised learning of visual structure using predictive generative networks. In Workshop Track of International Conference on Learning Representations, 2016.
-  T. Marwah, G. Mittal, and V. N. Balasubramanian. Attentive semantic video generation using captions. In Proceedings of the IEEE International Conference on Computer Vision, pages 1426–1434, 2017.
-  M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In Proceedings of the International Conference on Learning Representations, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the International Conference on Learning Representations, 2016.
-  M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
M. Saito, E. Matsumoto, and S. Saito.
Temporal generative adversarial nets with singular value clipping.In Proceedings of the IEEE International Conference on Computer Vision, pages 2830–2839, 2017.
-  K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
-  N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. In Proceedings of the International Conference on Machine Learning, pages 843–852, 2015.
-  S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018.
-  R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. In Proceedings of the International Conference on Learning Representations, 2017.
-  C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances in Neural Information Processing Systems, pages 613–621, 2016.
-  C. Vondrick and A. Torralba. Generating the future with adversarial transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1020–1028, 2017.
-  J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose futures. In Proceedings of the IEEE International Conference on Computer Vision, pages 3332–3341, 2017.
O. Wiles, A. Koepke, and A. Zisserman.
Self-supervised learning of a facial attribute embedding from video.In Proceedings of the British Machine Vision Conference, 2018.
-  T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1316–1324, 2018.
-  T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, pages 91–99, 2016.
-  S. Yamamoto, A. Tejero-de Pablos, Y. Ushiku, and T. Harada. Conditional video generation using action-appearance captions. arXiv preprint arXiv:1812.01261, 2018.
-  X. Yan, A. Rastogi, R. Villegas, K. Sunkavalli, E. Shechtman, S. Hadap, E. Yumer, and H. Lee. Mt-vae: Learning motion transformations to generate multimodal human dynamics. In Proceedings of the European Conference on Computer Vision (ECCV), pages 265–281, 2018.
-  X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
-  D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee. Diversity-sensitive conditional generative adversarial networks. In Proceedings of the International Conference on Learning Representations, 2019.
-  J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
Appendix A Overview
The supplementary material is organized as follows:
First, we provide an overview video (video link: https://drive.google.com/open?id=1kS9f2oNGFPO_hp7iWZmvtXLPnOrhl9qW), which briefly summarizes our work. Second, we provide more quantitative results on all datasets: SM-MNIST, Weizmann Human Action, and Human3.6M in Sec. B. Furthermore in Sec. C, we present more qualitative evaluations with respect to i) “Generation with various length” in Figs. 14-16 (more examples at https://drive.google.com/open?id=1ueQHNx56MWoqL9ilHjZuBZourg4VrbKc); ii) “Multiple control-points generation” in Fig. 17 (more examples at https://drive.google.com/open?id=1OUOd2LjmKwHwVpRwldUEIgvzfpWucYjt); iii) “Loop generation” in Fig. 18 (more examples at https://drive.google.com/open?id=1kb8PCIR2_lkE1JS6NlwyglxKlChSBSbF). Finally, the implementation details are described in Sec. D.
Appendix B Quantitative Results
b.1 Performance Under Various Length
In this section, we investigate control point consistency, generation quality and diversity under generation of different lengths on SM-MNIST, Weizmann Action, and Human3.6M dataset (refer to Sec. 4.4 in the main paper).
Control Point Consistency (S-CPC):
In Fig. 9, we show the performance of CPC on the three datasets, where for SM-MNIST and Weizmann (the first and the second column), the higher (SSIM) the better, and for Human3.6M (the last column), the lower (MSE) the better. Our method (red line) significantly outperforms other baselines on all datasets, while different components of our method including CPC on prior, latent space alignment, and skip-frame training all introduce performance gain.
In Fig. 10, we demonstrate that our method is able to sustain the generation quality on the three datasets, with the higher (SSIM) the better for SM-MNIST and Weizmann (the first and the second column), and the lower (MSE) the better for Human3.6M (the last column). Our method (red line) achieves superior quality on Human3.6M since its data contain 3D skeletons with highly diverse actions and imposing a targeted end-frame largely confines the S-Best error (more details mentioned in Sec. 4.5 in the main paper). On the other hand, for SM-MNIST and Weizmann, our method only suffers from marginal performance drop in comparison with other baselines. We point out that the generation quality in SM-MNIST gradually declines with increasing generation length since the two digits are prone to overlapping with each other in a longer sequence, resulting in blurry generation after the encounter. This can be potentially solved by representation disentanglement [31, 4, 30, 11, 35], which is out of scope of this paper and left to future work. Overall, we establish that our method attains comparable generation quality while achieving CPC.
Finally, we show the generation diversity on the three datasets in Fig. 11, where for all columns, the higher (SSIM or MSE) the better. We can observe that our method (red line) reaches superb and comparable performance on Human3.6M and SM-MNIST dataset respectively. On the contrary, Weizmann dataset involves video sequences with steady and fixed-speed action and hence tremendously reduces the possibility of generation if posing constraint at the end-frame (red line in the middle column). All in all, regardless of the limitation of dataset itself, our method is capable of generating diverse sequences and simultaneously achieving CPC.
b.2 Performance Through Time
In this section, we perform a more detailed analysis on generation quality and diversity through time (refer to Sec. 4.5 in the main paper).
In Fig. 12, we show the generation quality at each timestep on the three datasets, with the higher (SSIM) the better for SM-MNIST and Weizmann (the first and second columns), and the lower (MSE) the better for Human3.6M (the last column). We can observe a consistent trend across methods and datasets that the quality progressively decreases as the timestep grows. This is expectable since the generated sequences will step-by-step deviate from the ground truth and induce compounding error as the generation is gradually further from the given start-frame. Remarkably, for all methods taking CPC into consideration (orange, green, and red lines), there is a strong comeback on the generation quality at the end of the sequence since achieving CPC ensures that the generated end-frame converges to the targeted end-frame, thus leading to the results with better S-Best at the last timestep. Finally, the quality boost at the end-frame is lower in Weizmann dataset (the middle column) since unlike the other two (the first and the last columns), its data are captured in noisy background, posing more challenges to CPC and consequently causing lower quality at the end frame.
In Fig. 13, we demonstrate the generation diversity through time on the three datasets, with the higher (SSIM or MSE) the better in all columns. A consistent trend is shared across all datasets (all columns) in our method (red line) where the diversity is high in the intermediate frames but reaches zero at the two control points—the targeted start- and end-frames. This suggests that our method is able to plan ahead, generate high-diversity frames at the timestep far from the end, and finally converge to the targeted end-frame with zero-approaching diversity. In addition, we point out that the diversity curve of Weizmann dataset (the middle column) indicates a slightly worse performance in comparison to the results on the other two datasets (the first and third columns) since Weizmann data is featured by unvarying actions, , walking in a fixed speed, that immensely reduces the potential diversity at the intermediate frames.
Appendix C Qualitative Results
Generation with various length.
In Fig. 14, Fig. 15, and Fig. 16, we demonstrate the generation results with various lengths on SM-MNIST, Weizmann Action, and Human3.6M datasets. For more generated examples, please see https://drive.google.com/open?id=1ueQHNx56MWoqL9ilHjZuBZourg4VrbKc.
Multiple control-points generation.
In Fig. 17, given multiple targeted start- and end-frames, we show our model’s ability to merge multiple generated clips into a longer video. For more generated examples, please see https://drive.google.com/open?id=1OUOd2LjmKwHwVpRwldUEIgvzfpWucYjt.
In Fig. 18, by setting the targeted start- and end-frame to be the same, we can achieve loop generation. For more generated examples, please see https://drive.google.com/open?id=1kb8PCIR2_lkE1JS6NlwyglxKlChSBSbF.
Appendix D Implementation Details
We provide the training details and network architecture in this section.
d.1 Training Details
We implement our model in PyTorch. For SM-MNIST and Weizmann Action the input and output image size is, and for Human3.6M the input comprises the joint positions of size . Note that while our p2p generation models are fed with the targeted end-frames, the baseline method SVG , which is not CPC-aware, is introduced with one additional frame such that all methods are compared under the same number of input frames. For the reconstruction loss in , we use -loss. All models are trained with Adam optimizer, learning rate of , and batch size of for SM-MNIST, Weizmann Action and Human3.6M respectively. The weights in the full objective function and other details regarding each dataset are summarized as follows:
For the weights in , we set , , , . And the length of training sequences is .
For the weights in , we set , , , . The length of training sequences is for Weizmann Action and we augment the dataset by flipping each sequence so that our model can learn to generate action sequences that proceed toward both directions.
For the weights in the objective function : , , , . The length of training sequences for Human3.6M is . Besides, we speed up the training sequences to since the adjacent frames in the original sequences are often too similar to each other, which may prevent the model from learning diverse actions.
d.2 Network Architecture
The networks for three datasets all contain the following main components: i) posterior , ii) prior , and iii) generator . The encoder is shared by , and the global descriptor. We choose DCGAN 
as the backbone of our encoder and decoder for SM-MNIST and Weizmann Action, and choose multilayer perceptron (MLP) for Human3.6M. The hyper-parameters for the decoder, encoder,, and for each dataset are listed below:
For the networks we set , ; one-layer, hidden units for , one-layer, hidden units for , two-layer, hidden units for .
We use , ; one-layer, hidden units for , one-layer, hidden units for , two-layer, hidden units for .
The networks have , ; one-layer, hidden units for , one-layer, hidden units for , two-layer, hidden units for . The encoder MLP consists of 2 residual layers with hidden size of , followed by one fully-connected layer and activated by function; the decoder MLP is the mirrored version of the encoder but without in the output layer.