A Two-Stream Variational Adversarial Network for Video Generation

12/03/2018 ∙ by Ximeng Sun, et al. ∙ Boston University berkeley college 0

Video generation is an inherently challenging task, as it requires the model to generate realistic content and motion simultaneously. Existing methods generate both motion and content together using a single generator network, but this approach may fail on complex videos. In this paper, we propose a two-stream video generation model that separates content and motion generation into two parallel generators, called Two-Stream Variational Adversarial Network (TwoStreamVAN). Our model outputs a realistic video given an input action label by progressively generating and fusing motion and content features at multiple scales using adaptive motion kernels. In addition, to better evaluate video generation models, we design a new synthetic human action dataset to bridge the difficulty gap between over-complicated human action datasets and simple toy datasets. Our model significantly outperforms existing methods on the standard Weizmann Human Action and MUG Facial Expression datasets, as well as our new dataset.



There are no comments yet.


page 1

page 3

page 7

page 8

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite great progress being made in static image generation using methods such as GANs/VAEs ([5, 13, 24, 26, 28]), generation of pixel-level video has yet to achieve similarly impressive results. Some previous methods have succeeded on simplified toy data [35, 33, 39] but others have struggled on large, highly complex human action datasets e.g. UCF101 [32]. The problem is very challenging due to the high dimensionality of data and the need to effectively model both spatial content and temporal dynamics.

Content and motion are two complementary aspects of video composition. To generate a video of a certain action (see examples in Fig. 1), the model must not only re-create the appearance of both foreground and background in each frame, but also produce the action-relevant movement consistently across the entire sequence. In pixel-level video prediction [20, 33, 36] or generation, existing methods [37, 29, 33] directly extend image generation models, i.e. generate the entire video or each frame from a unified spatio-temporal latent code without separating content and motion, and result in unsatisfying generations. MC-Net [36] and MoCoGAN [35]

attempt to separate content and motion by introducing separate encoders or latent spaces, but they simply concatenate the motion and content codes as input to a single generator that predicts each frame. This requires the generator network to approximate an overly complicated function with both content and motion in one vector, instead of separating the decoding via two streams for more accurate modeling.

Figure 1: Our proposed Two-Stream Variational Adversarial Network (TwoStreamVAN) for action-conditional video generation. We separate content and motion generation into two parallel streams by taking in two codes and sampled from content and motion latent distributions ( and ), respectively. Motion is combined with content via our multi-scale motion fusion mechanism. We show samples generated by TwoStreamVAN trained on our new large-scale SynAction dataset.

In this paper we propose a novel video generation approach, called Two-Stream Variational Adversarial Network (TwoStreamVAN), with two generators that decode the separate content and motion embeddings (Fig. 1

). Rather than over-estimate the single generator’s ability, we introduce two parallel generators that process content and motion separately and fuse them together to predict the next frame. We design the fusion approach based on our intuition that motion can be represented as consistent refinements of content. Specifically, the motion between adjacent frames usually happens within a local window. Thus, we define the motion and content fusion as a refinement of each pixel’s value through adaptive convolutional kernels applied to its local neighborhood. Our design of the adaptive convolution kernels is inspired by single-scale fusion approaches for video frame interpolation 

[22, 23], but overcomes several of their limitations via a novel multi-scale approach.

A key advantage of our two-stream generator model is its ability to learn the parameters of each stream separately and thus more accurately and efficiently. We greatly reduce the high cost of learning the sequence’s behaviour by relying solely on the content stream for image reconstruction, rather than learning both streams with the video-level task. In addition, we can better fit each specific dataset by adjusting the sample ratio of content and motion according to their relative difficulty.

We evaluate our approach on two standard video generation benchmarks, Weizmann Human Action [9], MUG Facial Expression [1]. To test the abilities of the model, we construct a new dataset of more complex actions. Current video datasets are either too easy (e.g. Moving MNIST [33] and Shape Motion [39, 35]) or too complicated (e.g. UCF-101 [32] and HMDB-51 [17]) for testing the state-of-the-art generation models. We propose a large synthetic human action dataset, called Syn-Action, which contains 6,000 unique videos with 10 different actors performing 20 different actions (e.g. running, squatting, etc.), created using a library of video game actions, Mixamo [6].

To summarize, we make the following contributions: We propose a model for generating videos conditioned on action classes which separates the content and motion generation into two streams; design a multi-scale motion fusion mechanism and a more efficient dual-task learning scheme; and create a large-scale synthetic video generation dataset of moderate difficulty available to the whole community. We evaluate our proposed model on three video datasets both quantitatively and qualitatively, and demonstrate performance superior to several strong baselines.

2 Related Work

Generative Models. VAEs [16, 11, 10, 40] and GANs [8, 24, 26, 5] are two conceptually different trends of deep generative models in image generation. VAE provides probabilistic descriptions of observations in latent spaces, and could give interpretable latent variables, but it might generate blurry and unrealistic images in its vanilla form. GAN proposes an adversarial training paradigm with its discriminator to encourage the generation of crisper images. However, it suffers from both mode collapse [41] and generating unexpected bizarre artifacts. Larsen et al. [18] and Makhzani et al[19] combine VAE and GAN in the training and propose a Variational Adversarial Network (VAN) to learn an interpretable latent space as well as to generate realistic images. In the light of VAN’s success in image generation, we construct both content and motion streams with a VAN to solve the video-level generation problem.

Video Generation

is a challenging instance of unsupervised learning for video tasks. VideoGAN 


adopts two neural networks to generate foreground and background separately with the assumption of static background. However the assumption does not hold in the general case. TGAN 

[29] removes this assumption and divides the generation into two steps: a temporal generator is used to generate correlated latent codes for all frames and an image generator is used to decode each code into a single frame. Instead of starting from a general spatiotemporal latent code as in VideoGAN and TGAN, MoCoGAN [35] enhances the performance by introducing separate content and motion latent codes. Apart from all GAN models above, VideoVAE [12] shows the VAE’s ability to produce video, proposing a structured latent space and an encoder-generator architecture to generate the video recurrently. Inspired by content-motion separation and construction of latent spaces in previous works, we are the first to separate content and motion in VAE latent spaces. We further introduce content and motion generators to model spatial contents and temporal dynamics respectively. Note that, although TGAN contains two generators, our model is very different from theirs in two aspects. First, the two generators in our model are parallel, while those in TGAN are sequential. Second, TGAN adopts a single image generator to decode each frame from a spatiotemporal latent code, while we design two generators to solve spatial and temporal generation separately.

Multi-scale Motion Estimation and Prediction. Motions at multiple scales always appear in the real-world videos. In the estimation of motions between adjacent frames, [34, 27, 14] build image pyramids and achieve good optical flow estimations in both supervised and unsupervised settings. Instead of first predicting the optical flow and then applying wrappings to the current frame in generative tasks, Xue et al[39] uses pure VAE method to generate motion kernels at multiple scales from the difference map of neighbor frames. However, their motion kernels are to convolve with the entire feature map. When variable motions happen in different areas within the same frame, it is hard to interpret the meaning of such general motion kernels for the overall image. Furthermore, [39] only predicts the next frame from the current. With no need to infer the content, maintain the motion consistency and minimize accumulated errors in the sequence prediction, [39]

’s task is less complicated than the video generation. In this paper, we thoroughly consider the spatial-variance of motions and produce motion kernels specific for each location and scale. Also, we design our model to generate the video sequence without receiving any visual clue as inputs.

Figure 2: We propose a Two-Stream Variational Adversarial Network to generate the next frame from the current frame . At training time, Content and Motion Encoders ( and ) generate latent distributions and by viewing and the difference map respectively. From latent distributions, we sample the content latent vector and and employ a convLSTM to get encoding the history of temporal dynamics. Content and Motion Generators ( and ) take in and and then decode a content hidden layer and motion kernels along with a motion mask at each scale. , and are served as inputs to our defined multi-scale motion fusion (Fig. 3). Image and Video Discriminators ( and ) encourage the model to generate both realistic content and motion.

3 Approach

We define action-conditioned video generation as follows. Suppose we have different actions. For each action , let be a short video clip containing frames and be the set of all videos in this class. We seek a function to generate a plausible video from a latent vector :


We further separate content and motion, such that consists of two independent content and motion codes ( and with ).

MoCoGAN and MC-Net separate content and motion in the embedding space, but then simply concatenate the two embeddings and force a single generator to learn an over-complicated decoding function that maps them to the full video. Instead of using a single generator, we decompose the content and motion generations via their dedicated generator subnets. We propose a novel Two-Stream Variational Adversarial Network (Fig. 2), containing two separate VAN streams (combining Conditional VAE [31] and Conditional GAN [21]) with interactions at several stages. Each VAN stream contains an encoder, a generator and a discriminator.

The Content VAN Stream consists of a Content Encoder , a Content Generator and an Image Discriminator . approximates a conditional latent distribution of content by observing a single frame . The approximated latent distribution encodes the spatial information of . decodes a content vector sampled from the content distribution into a frame. helps to generate a realistic single frame via GAN training.

Similarly, the Motion VAN Stream consists of a Motion Encoder , a Motion Generator and a Video Discriminator . approximates a conditional latent distribution of motion by observing the difference map between neighbor frames. The approximated latent distribution encodes temporal changes in . A convLSTM [38] between and generates the motion embedding from a sequence of ’s sampled from approximated motion distributions at all previous time steps. Instead of decoding motion along with content in a single generator, we introduce to generate adaptive convolution kernels and motion masks at all scales from vectors and , and use them to refine content hidden layers (see Sec. 3.1). encourages to generate realistic motion for action by vanilla GAN training as well as auxiliary action classification [24].

3.1 Multi-scale Motion Fusion

Figure 3: Motion fusion for location on its local window , at scale . Take kernel size for example. is the content hidden layer from . and are convolutional kernels and the motion mask from . We first recover the 2D kernel from its flattened form . Then we compute an intermediate content feature by convolving with . Finally, we update with guided by to get the new hidden layer .

At pixel , the motion usually happens within a local window between adjacent frames. Video frame interpolation models [22, 23] proposed to fuse motion with the static image via an adaptive convolution. They refined the pixel value to the value in the next frame by convolving a large patch-wise kernel with the local patch centered at . Adopting a single fusion at the full resolution of the image, they set the kernel size proportionally to the largest motion in the dataset, to cover all motions. There are three drawbacks of this approach: 1) the method is computationally expensive due to the huge number of parameters in large kernels; 2) it is ineffective for representing motion at all scales simultaneously; 3) it requires prior knowledge of the motion statistics of each dataset to set the kernel size.

To overcome the drawbacks above, we propose a new multi-scale fusion approach guided by motion masks. We apply adaptive convolutions not only to the image, but also to the content hidden layers of multiple scales in . For each single scale, we only apply small kernels, independent of the dataset-specific largest motion, which significantly reduces our computational cost. learns to separate motion of different scales at the corresponding hidden layers: large motions come from low resolution layers and small motions from high resolution layers. We also introduce a motion mask that defines the area where motion happens between neighboring frames. learns to deactivate certain areas via these masks if no motion exists at that location and scale. This allows us to apply our fusion scheme at all resolutions without any prior knowledge about the motion statistics of the dataset.

To generate precise kernels and motion masks , needs to know the details of both motion and content, especially around the motion’s area. Hence, we compute an outer-product of the content embedding and motion embedding and provide it as input to .

We also assume that motion of different scales is captured in different channels of the motion feature map, and thus use only a fraction of channels of the motion feature map to compute and at the current scale . Other channels are reserved for generating motion features for larger resolutions.

Let be the kernel size of adaptive convolutions, and and be the spatial size and the channel dimension of the content hidden layer at scale . generates adaptive convolution kernels to refine the content hidden layer . For each location , we first recover a 2D convolution kernel from its flattened form and then convolve with the local window to produce an intermediate content representation :


In adaptive convolutions, needs to generate parameters when the kernel size is . Instead of setting proportional to the biggest motion in the dataset, we propose to use small kernels ( or ) at all scales, to capture both large and small motions. Therefore, multi-scale motion fusion greatly reduces computational costs by leveraging small kernels, and in our experiments we show its effectiveness for capturing motion at different scales in each video.

The motion mask is generated along with adaptive convolution kernels . Each entry is in . We update the content feature vector with guided by :


where is the new content map. To preserve between neighboring frames, our approach only needs to deactivate . This relaxes the requirement for to learn a kernel with only 1 in the center and 0s in other entries, which results in faster and better convergence. As shown in Sec. 4.5.2, masks for small resolutions are deactivated in small-motion areas. Hence, we simply apply our proposed fusion to all scales during generation without considering the dataset’s motion statistics.

3.2 Learning

We introduce an alternating dual-task learning scheme. Specifically, the Content Stream is learned via image reconstruction, while the Motion Stream is learned via video prediction. We alternate training, such that each stream is trained while the other is fixed.

Content Learning.

The Content Stream solely focuses on reconstructing the current frame without modeling motion. Therefore, image-level reconstruction is adequate to train the entire Content Stream and is also computationally cheaper than learning it together with the Motion Stream via the video-level task. Furthermore, if videos have high complexity, the Content Stream can be more easily pre-trained than the non-separable content-motion learning done via a single generator.

learns to distinguish the real image from fake images. Larsen et al. [18] observes that discriminating based on samples from the approximated latent distribution in addition to samples solely from the true latent distribution gives better results, since samples from the approximated distribution look more realistic. Thus, there are two kinds of negative examples: one is sampled from and the other is sampled from .

For each update, we define the content GAN loss , the content VAE loss and the overall content loss to update and as follows:


Meanwhile, we use to update :

Motion Learning.

We train the Motion Stream to predict 10 frames after observing the first frame. During training, the Motion Stream reconstructs the whole sequence recurrently. For every time step , it reconstructs the current frame from .

Similarly to Content Learning, we also have and , from and , as two different kinds of fake videos. However, since we refine the content hidden layer at each scale , we introduce a loss between the refined content hidden layer of the previous frame and the content layer of the current frame . In addition to vanilla GAN training,

predicts the action category with an auxiliary classifier 


In this task, we define the motion GAN loss , the motion VAE loss and the overall loss to update and as follows:


where is the difference map between neighboring frames. Meanwhile, we use to update :


To stabilize the learning as in Larsen et al. [18], GAN loss is only back-propagated to the generators and not to the encoders in both Content and Motion Learning.

We provide implementation details in the supplementary material.

3.3 Generating a Video at Test Time

While the training relies on observing the ground truth, at test time, generating a video of any desired length begins from sampling in the latent space. We generate the first frame from a randomly sampled content vector , and then generate the following frames from the content embedding of the last frame as well as a current motion embedding computed by the convLSTM recurrently. To generate at each time step, the convLSTM updates its hidden state by the embedding of the last difference map computed by (if available) and an extra motion vector . Additionally, our method can easily adapt to cases where we generate videos with a specified starting frame by replacing with the content embedding of computed by .

4 Experiments

4.1 Datasets

4.1.1 Syn-Action Dataset

To compensate for the deficiencies of over-complicated large human action datasets (e.g. UCF-101 [32] and HMDB-51 [17]) or simple toy synthetic datasets (e.g. Moving MNIST [33] and Shape Motion [39, 35]), we build a large synthetic human action dataset of moderate difficulty, Syn-Action Dataset, specifically for the video generation task. With a powerful game engine Unity, we create 6000 unique videos with 10 actors performing 20 different actions.

Every synthetic action is akin to real human actions but easy to distinguish from other actions. For every action class, we pick 2 unique synthetic action models from Mixamo [6]. We then apply synthetic action models on 10 different characters. To further increase the diversity of the dataset, we use 5 different backgrounds and 3 different recording viewpoints, i.e. the left, right and the frontal view of the actor. With a single actor performing different recognizable actions (kicking, hooking, etc.) in the scene, Syn-Action achieves appropriate complexity to examine a model’s ability to generate realistic content and motion.

We provide each video with four different annotations: actor identity, action class, background and viewpoint. However, we only use the action class to generate videos.

Dataset Weizmann (# action = 10) MUG (# action = 6) Syn-Action (# action = 20)
Metric Acc IS Acc IS Acc IS
MoCoGAN [35] 4.38 0.31 58.52 1.78 0.17 5.03 2.90 0.36 12.75
SGVAN 4.34 0.04 73.73 1.79 0.13 5.29 2.98 0.18 16.43
TwoStreamVAN (C) 4.36 0.12 69.83 1.79 0.10 5.42 2.97 0.17 16.53
TwoStreamVAN (M) 4.31 0.29 55.99 1.79 0.11 5.32 2.97 0.15 16.79
TwoStreamVAN 4.40 0.05 77.11 1.79 0.09 5.48 2.99 0.09 18.27
VideoVAE * [12] - 4.37 0.11 70.10 - - - - - - - -
TwoStreamVAN * 4.45 0.02 83.74 1.77 0.04 5.65 2.99 0.06 18.86
Exp Bound 4.50 0.01 88.94 1.79 0.01 5.91 3.00 0.01 19.85
Math Bound 4.50 0.00 90.00 1.79 0.00 6.00 3.00 0.00 20.00
Table 1: Quantitative Results on Weizmann, MUG and Syn-Action Datasets. * indicates the video starts with a specific frame. For Acc, and IS, the higher value is better; while for , the lower value is better. Compared with all baselines, our TwoStreamVAN model achieves the best results on all metrics.

4.1.2 Standard Datasets

In addition to our proposed Syn-Action Dataset, we evaluate our model on two other standard datasets: Weizmann Human Action [9] and MUG Facial Expression [1]. The Weizmann Human Action Dataset contains 90 videos of 9 actors performing 10 different actions. MUG Facial Expression Dataset [1] contains 3528 videos with 52 actors performing 6 different facial expressions.

With these three datasets, we cover a large range of motion, from large human actions (e.g. running, jumping) to subtle facial expressions (e.g. happiness, disgust) and include both periodic and non-periodic motion.

4.2 Evaluation Metrics

Quantitative evaluation of generative models remains a challenging problem, and there is no consensus on the measurement which best evaluates the realism and diversity of the generated results. Thus, instead of just relying on a single measurement, we utilize four different metrics to examine both the realism and diversity of generated videos: Classification Accuracy (Acc), Inception Score (IS) [30], Inter-Entropy [12] and Intra-Entropy [12] , where is the video for evaluation and is the action predicted by a classifier. All these metrics utilize a pre-trained classifier for evaluation. Because there is no universal classifier available for all video datasets, we train a classifier separately on each dataset. We show the classifier’s performance by computing the same metrics on each test set, which only consists of real videos. We call these values the Experimental Bound, in addition to the Mathematical Bound. To make a fair comparison, we compute metrics on 10-frame video clips generated by each model.

4.3 Baselines

We compare against several existing works to show our model’s superiority in generating videos of the given action.

For existing works, we compare to MoCoGAN111We use categorical MoCoGAN implemented by its authors. [35] and VideoVAE222Due to VideoVAE’s non-public implementation, we only compare with quantitative results on Weizmann Dataset reported in the paper. [12], which are the current state-of-the-art.

We also design several ablated variants of TwoStreamVAN to examine key components of our model:

SGVAN adopts a single generator to generate a single frame from the concatenation of content and motion vectors. Other parts are the same as in TwoStreamVAN. This comparison explores the ability of our parallel and .

TwoStreamVAN(C) removes the content code from ’s inputs. It helps to check the necessity of providing a spatial embedding to the motion generation.

TwoStreamVAN(M) applies the motion fusion to content hidden layers at multiple scales without the guidance of motion masks. This comparison helps us to examine the effectiveness of motion masks.

4.4 Results

Quantitative Results.

We compute the quantitative metrics of all baselines and our TwoStreamVAN (see Table. 1) on Weizmann, MUG and Syn-Action Datasets. We train a normal action classifier on MUG and Syn-Action Datasets, and train a classifier to distinguish each actor-action pair on Weizmann to compare with VideoVAE. However, we still report the accuracy of action classification since the video generation is only conditioned on the given action. Note that classifiers are not shared among different datasets. Thus, comparisons among datasets are meaningless.

From the action accuracy, we observe that MoCoGAN almost lost control of the given action and did not generate the correct action in most videos, while our TwoStreamVAN results in over accuracy on all datasets. TwoStreamVAN improves the Inception Scores of MoCoGAN by , and on Weizmann, MUG and Syn-Action Datasets respectively. Meanwhile, TwoStreamVAN achieves both higher and lower , indicating that TwoStreamVAN generates more diverse and more realistic videos than MoCoGAN. Compared with VideoVAE on Weizmann, our model also results in better IS value, and .

Compared to the SGVAN, TwoStreamVAN(C) and TwoStreamVAN(M), our model pushes all metrics even closer to their bounds. These results reveal that our full model benefits from all key components in our design, namely the two parallel generators, the content input provided to the motion generator and the guidance of the motion mask in fusion.

(a) Visualization of videos generated by Our Model
(b) Randomly sampled frames from different models
Figure 4: We provide five generated videos of TwoStreamVAN on Weizmann Human Action, MUG Facial Expression and our new Syn-Action datasets. Our model generates realistic and correct motions of the given action. We further randomly sample 16 frames from generated videos of TwoStreamVAN and MoCoGAN. TwoStreamVAN results in better content quality with fewer distortions in every single frame. More qualitative results are presented in the supplementary material.
Qualitative Results.

We visualize videos generated by our TwoStreamVAN model. For each dataset, we provide 3 generated videos conditioned on the given action class and 2 generated videos starting from a specified frame (Fig. 3(a)). Our TwoStreamVAN model succeeds in generating correct motions for different given actions.

To evaluate the quality of content generation, we randomly sample 16 generated frames from TwoStreamVAN and MoCoGAN’s results respectively (Fig. 3(b)). Even though MoCoGAN generated crisp frames, it suffered from severe distortions or bizarre artifacts in the content generation across all three datasets. In comparison, TwoStreamVAN yields more realistic content generation.

User Study on Syn-Action.

The Syn-Action Dataset requires video generation models to handle more diverse human action videos. To further test the ability of TwoStreamVAN and MoCoGAN, we conduct a user study via AMTurk [4]

. We first ask users to choose the better-looking one (AB Test) from a pair of videos generated by two models (2000 pairs in total). Then we tell users the target action and let them choose again. For these two questions, we compute the mean answer and bootstrap the Standard Deviation of user preference for two models (Table. 

2). Over users prefer TwoStreamVAN in both cases, indicating that our model generates visually satisfying videos.

Tasks Better Looking Better Looking — Action
Table 2: User Study on Syn-Action Dataset. Over users prefer TwoStreamVAN in two tasks.

4.5 Ablation Studies

4.5.1 Multi-scale v.s. Single-scale Motion Fusion

To examine the effectiveness of multi-scale motion fusion, we train four TwoStreamVAN models on Weizmann Human Action Dataset, in which we apply fusions at 1, 2, 3 and 4 scales respectively. We increase fusion scales from the highest resolution to the lowest resolution. In each fusion, we implement motion kernels with the fixed size . Moreover, we train a model where we apply large motion kernels with on the output image from the Content Stream to imitate the single fusion used in Video Frame Interpolation works [22, 23].

In Table. 3, it is not surprising that the performance drops when we reduce from 17 to 5 with a single fusion on the image. As we increase fusion scales, IS value, and recover and finally outperform those of the model with single large fusion on the full resolution image.

To further analyze the multi-scale fusion, we measure of large motions, e.g. running (in Table. 3). More scales such fusion is applied at, lower is, indicating more realistic large motions are generated. We pick similar videos (Fig. 5) generated by different models and zoom in at the actor’s legs, where the large motion happens. When we only apply the fusion with kernel size at the highest 2 resolutions, the model fails to tackle the large motion around legs and results in blobs. After increasing fusion layers, TwoStreamVAN finally generates the even sharper outline than the model using the single large fusion.

scales n IS (run)
1 17 75.89 4.38 0.05 0.104
1 5 74.95 4.39 0.07 0.166
2 5 74.57 4.38 0.07 0.096
3 5 77.83 4.40 0.05 0.062
4 5 77.11 4.40 0.05 0.059
Table 3: Quantitative results of models varying in the kernel size and the number of scales to apply fusion. For single fusion, the performance drops by reducing the kernel size. After applying fusion at multiple scales, the performance recovers and finally out-performs the single fusion with large kernels.
Figure 5: Qualitative comparison among similar sequences generated by single fusion with large kernels and multi-scale motion fusion with small kernels. When we apply fusion at 3 or 4 scales, our model generates the sharpest and clearest outlines of the runner’s leg among all models.
Figure 6: Qualitative Comparison of TwoStreamVAN(M) and TwoStreamVAN. Motion masks improve the generation at few motion areas (background for Weizmann and Syn-Action and eye’s round for MUG).
Figure 7: We overlay motion mask at each scale with the current frame . Motion masks at lower scales are only activated at large-motion area, while a large area on the motion mask at the highest scale is activated to tackle small changes between neighbor frames.

4.5.2 Visualization of Motion Masks

We already show that motion masks boost the quantitative performance of TwoStreamVAN (see Sec. 4.4). To examine how motion masks help, we visualize random generated frames from TwoStreamVAN(M) and TwoStreamVAN (Fig. 6). We observe that TwoStreamVAN(M) does a worse job in small-motion areas. On Weizmann Human Action and Syn-Action datasets, it messes up background patterns. On MUG Facial Expression Dataset, it generates unexpected brick patterns around eyes. Our full TwoStreamVAN does not suffer from these problems with the help of motion masks. This observation is consistent with our claim (in Sec. 3.1) that motion masks help to preserve static pixel values during the generation.

To show correlations between activated areas on masks and the actual motion locations/scales, we overlay the motion mask at each scale with the current frame (Fig. 7). For low resolutions, the mask is only activated at large-motion areas, e.g. torso for bending, arms and legs for jack-jumping and arms for waving. For the highest resolution, the activation covers the background area to overcome some small changes, e.g. lighting changes or small camera movements. Thanks to deactivation of small-motion areas in motion masks of lower resolution, we can safely apply such motion fusion to all scales.

5 Conclusion

In this paper, we propose a novel Two-Stream Variational Adversarial Network to tackle the conditional video generation problem. We decompose the content and motion via their dedicated VAN streams and thus bring in a more efficient dual-task learning mechanism. Despite of the separate content and motion generations, we fuse motion with content at multiple scales via adaptive convolutions guided by motion masks. To better evaluate video generation models, we build a large synthetic human action dataset. In experiments, our model achieves the best quantitative and qualitative results among the current state-of-the-art works and several variants of our models across three different datasets. Furthermore, our in-depth analysis has revealed that our multi-scale fusion outperforms the single-scale fusion and that motion masks stabilize the small-motion generation and enhance the overall performance.


  • [1] N. Aifanti, C. Papachristou, and A. Delopoulos. The mug facial expression database. In Image analysis for multimedia interactive services (WIAMIS), 2010 11th international workshop on, pages 1–4. IEEE, 2010.
  • [2] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer.

    Scheduled sampling for sequence prediction with recurrent neural networks.

    In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
  • [3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    , pages 41–48. ACM, 2009.
  • [4] M. Buhrmester, T. Kwang, and S. D. Gosling. Amazon’s mechanical turk: A new source of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3–5, 2011.
  • [5] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
  • [6] S. Corazza and E. Gambaretto. Real time generation of animation-ready 3d character models, Feb. 25 2014. US Patent 8,659,596.
  • [7] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    , pages 249–256, 2010.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [9] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. Transactions on Pattern Analysis and Machine Intelligence, 29(12):2247–2253, December 2007.
  • [10] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. International Conference on Machine Learning, 2015.
  • [11] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville. Pixelvae: A latent variable model for natural images. International Conference on Learning Representations, 2017.
  • [12] J. He, A. Lehrmann, J. Marino, G. Mori, and L. Sigal. Probabilistic video generation using holistic attribute control.

    European Conference on Computer Vision

    , 2018.
  • [13] S. Hong, D. Yang, J. Choi, and H. Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 7986–7994, 2018.
  • [14] J. Y. Jason, A. W. Harley, and K. G. Derpanis. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In European Conference on Computer Vision, pages 3–10. Springer, 2016.
  • [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
  • [16] D. P. Kingma and M. Welling. Auto-encoding variational bayes. International Conference on Learning Representations, 2014.
  • [17] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2011.
  • [18] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. International Conference on Machine Learning, 2016.
  • [19] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. International Conference on Learning Representations, 2016.
  • [20] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations, 2016.
  • [21] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [22] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive convolution. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 3, 2017.
  • [23] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive separable convolution. In IEEE International Conference on Computer Vision, 2017.
  • [24] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the 34th International Conference on Machine Learning, pages 2642–2651, 2017.
  • [25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer.

    Automatic differentiation in pytorch.

    In NIPS-W, 2017.
  • [26] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. International Conference on Learning Representations, 2016.
  • [27] A. Ranjan and M. J. Black. Optical flow estimation using a spatial pyramid network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, page 2. IEEE, 2017.
  • [28] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In Proceedings of The 33rd International Conference on Machine Learning, 2016.
  • [29] M. Saito, E. Matsumoto, and S. Saito.

    Temporal generative adversarial nets with singular value clipping.

    In IEEE International Conference on Computer Vision (ICCV), volume 2, page 5, 2017.
  • [30] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
  • [31] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
  • [32] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [33] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852, 2015.
  • [34] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018.
  • [35] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [36] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. ICLR, 2017.
  • [37] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613–621, 2016.
  • [38] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.
  • [39] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, pages 91–99, 2016.
  • [40] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
  • [41] Z. Zhou, H. Cai, S. Rong, Y. Song, K. Ren, W. Zhang, Y. Yu, and J. Wang. Activation maximization generative adversarial nets. International Conference on Learning Representations, 2018.


A   Implementation Details

We implement our model using PyTorch [25]. We use Xavier initialization [7] for each layer and use the Adam optimizer [15] with initial learning rate , first decay rate and second decay rate . We train our model for a total of 500K iterations with batch size 16 which takes 2 days on a TITAN V GPU to finish, and the ratio between Content and Motion Learnings is 3:2. To generate more complicated content in Syn-Action Dataset, we pre-train the Content Stream for 300K iterations by the image reconstruction task. At the test time, we heat up the network for two time steps before generating videos on each dataset.

Inspired by the curriculum learning approach [3] and the scheduled sampling mechanism [2], we design the motion learning as follows. We introduce a very simple learning task at a very early stage, where the Motion Stream is trained to predict the next frame sorely from the current frame with no need of modeling the history. This task is gradually replaced by the sequence training task using the scheduled sampling strategy such that at the beginning the model is trained for one-step prediction providing the entire history, while by the end of training the model is fully auto-regressive.

We provide hyper-parameters of the model and loss functions for Weizmann Human Action 

[9], MUG Facial Expression [1] and Syn-Action Datasets (Table. 4). Since we adopt the scheduled sampling mechanism in the Motion Learning, we slowly increase along with the process of the scheduled sampling, to restrict the KL divergence between the approximated latent distribution and the real latent distribution within a reasonable range. This helps to stabilize the motion sampling at the test time. Our implementation will be available.

B   Details of Experimental Setup

B.1   Data Spliting and Pre-processing

We use three datasets to evaluate our model and other baselines: Weizmann Human Action [9], MUG Facial Expression [1] and Syn-Action datasets.

Weizmann Human Action. Following [12], We use the first for the training and save the last for the test.

MUG Facial Expression. We use of the entire dataset for the training and save for the test.

Syn-Action Dataset. We use of the whole dataset for the training and save for the test.

On all datasets, we crop the video centered at the actor or the face. To augment data, we further crop the video with a random small offset before down-sampling each frame to at each iteration. We adjust the frame sampling rate based on action types to make motion observable between adjacent frames.

B.2   Definition of evaluation metrics

Let be the generated video and be the label for , which is assigned by the pre-train classifier. We introduce definitions of Classification Accuracy, Inter-Entropy , Intra-Entropy and Inception Score (IS) and explain how they measure the diversity and realism of generative models.

Classification Accuracy (Acc) is the accuracy of the action classification on the generated videos. Assuming that the classifier is nearly perfect, the higher classification accuracy indicates that the model generates more recognizable videos of the correct class.

Inter-Entropy is the entropy of the marginal distribution obtained from all videos:


If all classes are equally represented in the generated samples, achieves its maximum value. Therefore, higher indicates the model generates more diverse results.

Intra-Entropy is the entropy of the conditional class distribution of a single video :


More confident the classifier is to predict its class, lower is, and thus more realistic the video is. In this paper, we report the average to evaluate the overall realism of the generated videos.

Inception Score (IS) is widely adopted to evaluate generative models. In video-level task, it measures the KL divergence between the conditional label distribution and the marginal distribution :


Inception Score favors a higher and a lower . So it measures both the realism and diversity of the generated videos.

Content Loss Motion Loss Model Arch
Weizmann 7 10 10 512 100
MUG 5 10 10 512 100
Syn-Action 7 10 10 1024 100
Table 4: Hyper-Parameters for Weizmann Human Action, MUG Facial Expression and Syn-Action Datasets. and are the dimensionality of content and motion latent spaces

C   More Visualization Results

In this section, we provide more qualitative visualizations of generated videos from TwoStreamVAN and MoCoGAN [35] on each dataset (Fig. 8 for Weizmann Human Action [9], Fig. 9 for MUG Facial Expression [1], and Fig. 10 & 11 for our SynAction Dataset). We provide 2 videos as examples for each given action class. We recommend readers to view the video version 333https://youtu.be/76JS7N5aMSw of this visual comparison in the supplementary material.

Figure 8: We provide 2 videos of one action class generated by TwoStreamVAN and MoCoGAN respectively on Weizmann Human Action Dataset.
Figure 9: We provide 2 videos of one action class generated by TwoStreamVAN and MoCoGAN respectively on MUG Facial Expression.
Figure 10: We provide 2 videos of one action class generated by TwoStreamVAN and MoCoGAN respectively on the first 10 classes of SynAction.
Figure 11: We provide 2 videos of one action class generated by TwoStreamVAN and MoCoGAN respectively on the second 10 classes of SynAction.