1 Introduction11footnotetext: Work partially done during internship with Adobe Research.
Modeling the dynamics of human motion — both facial and full body motion — is a fundamental problem in computer vision, graphics, and machine intelligence, with applications ranging from virtual characters[1, 2], video-based animation and editing [3, 4, 5], and human-robot interfaces . Human motion is known to be highly structured and can be modeled as a sequence of atomic units that we refer to as motion modes. A motion mode captures the short-term temporal dynamics of a human action (e.g., smiling or walking), including its related stylistic attributes (e.g., how wide is the smile, how fast is the walk). Over the long-term, a human action sequence can be segmented into a series of motion modes with transitions between them (e.g., a transition from a neutral expression to smiling to laughing). This structure is well known (referred to as basis motions  or walk cycles) and widely used in computer animation.
This paper leverages this structure to learn to generate human motion sequences, i.e., given a short human action sequence (present motion mode), we want to synthesize the action going forward (future motion mode). We hypothesize that (1) each motion mode can be represented as a low-dimensional feature vector, and (2) transitions between motion modes can be modeled astransformations of these features. As shown in Figure 1, we present a novel model termed Motion Transformation Variational Auto-Encoders (MT-VAE) for learning motion sequence generation. Our MT-VAE is implemented using an LSTM encoder-decoder that embeds each short sub-sequence into a feature vector that can be decoded to reconstruct the motion. We further assume that the transition between current and future modes can be captured by a certain transformation. In the paper, we demonstrate that the proposed MT-VAE learns a motion feature representation in an unsupervised way.
A challenge with human motion is that it is inherently multimodal, i.e., the same initial motion mode could transition into different motion modes (e.g., a smile could transition to a frown, or a smile while looking left, or a wider smile, etc.). A deterministic model would not be able to learn these variations and may collapse to a single-mode distribution. Our MT-VAE supports a stochastic sampling of the feature transformations to generate multiple plausible output motion modes from a single input. This allows us to model transitions that may be rare (or potentially absent) in the the training set.
We demonstrate our approach on both facial and full human body motions. In both domains, we conduct extensive ablation studies and comparisons with previous work showing that our generation results are more plausible (i.e., better preserve the structure of human dynamics) and diverse (i.e., explore multiple motion modes). We further demonstrate applications like 1) analogy-based motion transfer (e.g., transferring the act of smiling from one pose to another pose) and 2) future video synthesis (i.e., generating multiple possible future videos given input frames with human motions). Our key contributions are summarized as follows:
We propose a generative motion model that consists of a sequence-level motion feature embedding and feature transformations, and show that it can be trained in an unsupervised manner.
We show that stochastically sampling the transformation space is able to generate future motion dynamics that are diverse and plausible.
We demonstrate applications of the learned model to challenging tasks like motion transfer and future video synthesis for both facial and human body motions.
2 Related Work
Understanding and modeling human motion dynamics has been a long-standing problem for decades [8, 9, 10]. Due to the high dimensionality of video data, early work mainly focused on learning hierarchical spatio-temporal representations for video event and action recognition [11, 12, 13]
. In recent years, predicting and synthesizing motion dynamics using deep neural networks has become a popular research topic. Walker et al., Fischer et al.  learn to synthesize dense flow in the future from a single image. Walker et al.  extended the deterministic prediction framework by modeling the flow uncertainty using variational auto-encoders. Chao et al. 
proposed a recurrent neural network to generate movement of 3D human joints from a single observation with a 3D in-network projection layer. Taking one step further, Villegas et al., Walker et al.  explored hierarchical structure (e.g., 2D human joints) for motion prediction in the future using recurrent neural networks. Li et al. 
proposed an auto-conditional recurrent framework to generate long-term human motion dynamics through time. Besides human motion, face synthesis and editing is another interesting topic in vision and graphics. Methods for reenacting and interpolating face sequences in video have been developed[3, 21, 22, 23] based on a 3D morphable face representation . Very recently, Suwajanakorn et al.  introduced a speech-driven face synthesis system that learns to generate lip motions with a recurrent neural network.
Besides the flow representation, motion synthesis has been explored in a broader context, namely, video generation. For example, synthesizing video sequence in the future from a single or multiple video frames as initialization. Early works employed patch-based method for short-term video generation using mean squared mean squared loss  or perceptual loss . Given an atomic action as additional condition, previous works extended with action-conditioned (i.e., rotation, location, etc) architectures that enable better semantic control in video generation [27, 28, 29, 30]. Due to the difficulty in holistic video frame prediction, the idea of disentangling video factors into motion and content is explored in [31, 32, 33, 34, 35, 36]. Video generation has also been approached with architectures that output multinomial distribution vectors over the possible pixel values for each pixel in the generated frame .
The notion of feature transformations has also been exploited for other tasks. Mikolov et al.  showcased the composition additive property of word vectors learned in an unsupervised way from language data; Kulkarni et al. , Reed et al.  suggested that additive transformation can be achieved via reconstruction or prediction task by learning from parallel paired image data. In the video domain, Wang et al.  studied a transformation-aware representation for semantic human action classification; Zhou et al.  investigated time-lapse video generation given additional class labels.
Multimodal conditional generation has recently been explored for images [43, 44], sketch drawings , natural language [46, 47], and video prediction [48, 49]. As noted in previous work, learning to generate diverse and plausible visual data is very challenging for the following reasons: first, mode collapse may occur without one-to-many pairs. Collecting sequence data where one-to-many pairs exist is non-trivial. Second, posterior collapse could happen when the generation model is based on a recurrent neural network.
3 Problem Formulation and Methods
We start by giving an overview of our problem. We are given a sequence of observations , where is a dimensional vector representing the observation at time . These observations encode the structure of the moving object and can be represented in different ways, for e.g., as keypoint locations or shape and pose parameters. Changes in these observations encode the motion that we are interested in modeling. We refer to the entire sequence as a motion mode. Given a motion mode, , we aim to build a model that is capable of predicting a future motion mode, , where represents the predicted -th step in the future, i.e., . We first start with a discussion of two potential baseline models that could be used for this task (Section 3.1), and then present our method (Section 3.2).
Prediction LSTM for Sequence Generation.
Figure 2(a) shows a simple encoder-decoder LSTM [50, 25] as a baseline for the motion prediction task. At time , the encoder LSTM takes the motion as input and updates its internal representation. After going through the entire motion mode , it outputs a fixed-length feature as an intermediate representation. We initialize the internal representation of decoder LSTM using the feature computed. At time of the decoding stage, the decoder LSTM predicts the motion . This way, the decoder LSTM gradually predicts the entire motion mode in the future within steps. We denote the encoder LSTM as function and the decoder LSTM as function . As a design choice, we initialize the decoder LSTM with additional input for smoother prediction.
Vanilla VAE for Sequence Generation.
As the deterministic LSTM model fails to reflect the multimodal nature of human motion, we consider a statistical model, , parameterized by . Given the observed sequenceinstead of a single outcome. To model the multimodality (i.e., can transition to different ’s), a latent variable (sampled from prior distribution) is introduced to capture the inherent uncertainty. The future sequence is generated as follows:
Sample latent variable ;
Given and , generate a sequence of length : ;
In Eq. 1, is referred as an auxiliary posterior that approximates the true posterior . Specifically, the prior is assumed to be . The posteriorand , respectively. Intuitively, the first term in Eq. 1 regularizes the auxiliary posterior with prior . The second term can be considered as an auto-encoding loss, where we refer to as an encoder or recognition model, and as a decoder or generation model.
As shown in Figure 2(b), the vanilla VAE model adopts similar LSTM encoder and decoder for sequence processing. In contrast to Prediction LSTM model, the vanilla VAE decoder takes both motion feature and latent variable into account. Ideally, this allows to generate diverse motion sequences by drawing different samples from the latent space. However, the semantic role of the latent variable in this vanilla VAE model is not straight-forward and may not effectively represent long-term trends (e.g., dynamics in a specific motion mode or during change of modes).
3.2 Motion-to-Motion Transformations in Latent Space
To further improve motion sequence generation beyond vanilla VAE, we propose to explicitly enforce the structure of motion modes in the latent space. We assume that (1) each motion mode can be represented as low-dimensional feature vector, and (2) transitions between motion modes can be modeled as transformations of these features. Our design is also supported by early studies on hierarchical motion modeling and prediction [8, 54, 55].
We present a Motion Transformation VAE (or MT-VAE) (Fig. 2(c)) with four components:
An LSTM encoder maps the input sequences into motion features through and , respectively.
A latent encoder computes the transformation in the latent space by concatenating motion features and . Here, indicates the latent space dimension.
A latent decoder synthesizes the motion feature in the future from latent transformation and current motion feature via .
An LSTM decoder synthesizes the future sequence given motion feature: .
Similar to the Prediction LSTM, we use an LSTM encoder/decoder to map motion modes into feature space. The MT-VAE further maps these features into latent transformations and stochastically samples these transformations. As we demonstrate, this change makes the model more expressive and leads to more plausible results. Finally, in the sequence decoding stage of MT-VAE, we feed the synthesized motion feature as input to the decoder LSTM, with internal state initialized using the same motion feature with an additional input .
3.3 Additive Transformations in Latent Space
Although MT-VAE explicitly models motion transformations in latent space, this space might be unconstrained because the transformations are computed from vector concatenation of motion features and in our latent encoder . To better regularize the transformation space, we present an additive variant of MT-VAE, that is depicted in Figure 2(d). To distinguish between the two variants, we call the previous model MT-VAE (concat) and this model MT-VAE (add), respectively. Our model is inspired by recent success of deep analogy-making methods [40, 31] where a relation (or transformation) between two examples can be represented as a difference in the embedding space. In this model, we strictly constrain the latent encoding and decoding steps as follows:
Our latent encoder computes the difference between two motion features and via ; then it maps the difference feature into a transformation in the latent space via .
Our latent decoder reconstructs the difference feature from latent variable and current motion feature via .
Finally, we apply a simple additive interaction to reconstruct the motion feature via ;
In step one, we infer the latent variable using from the difference of and (instead of a applying a linear layer on concatenated vectors). Intuitively, the latent code is expected to capture the mode transition from the current motion to the future motion rather than a concatenation of two modes. In step two, we reconstruct the transformation from the latent variable via where is obtained from recognition model. In this design, the feature difference is dependent on both latent transformation and current motion feature . Alternatively, we can make our latent decoder context-free by removing input from motion feature . This way, the latent decoder is supposed to hallucinate the motion difference solely from the latent space. We provide this ablation study in Section 4.1.
Besides the architecture-wise regularization, we introduce two additional objectives while training our model.
As mentioned previously, our training objective in Eq. 1 is composed of a KL term and a reconstruction term at each frame. The KL term regularizes the latent space, while the reconstruction term ensures that the data can be explained by our generative model. However, we do not have direct regularization in the feature space. We therefore introduce a cycle-consistency loss in Eq. 2 (for MT-VAE (concat)) and Eq. 3 (for MT-VAE (add)). Figure 3 illustrates the cycle consistency in details.
In our preliminary experiments, we also investigated a consistency loss with a bigger cycle (involving the actual motion sequences) during training but we found it ineffective as a regularization term in our setting. We hypothesize that vanishing or exploding gradients make the cycle-consistency objective less effective, which is a known issue when training recurrent neural networks.
Specific to our motion generation task, we introduce a motion coherence loss in Eq. 4 that encourages a smooth transition in velocity in the first steps of prediction. We define the velocity and when . Intuitively, such loss prevents the generated sequence from deviating too far from the future sequence sampled from the prior.
Finally, we summarize our overall loss in Eq. 5, where and are two balancing hyper-parameters for cycle consistency and motion coherence, respectively.
The evaluation is conducted on the datasets involving two representative human motion modeling tasks: Affect-in-the-wild (Aff-Wild)  for facial motions and Human3.6M  for full body motions. The Aff-Wild dataset contains more than 400 video clips (2,000 minutes in total) collected from Youtube with natural facial expression and head motion patterns. To better focus on face motion modeling (e.g., expressions and head movements), we leveraged the 3D morphable face model [58, 24] (e.g., face identity, face expression, and pose) in our experiments. We fitted 198-dim identity coefficients, 29-dim expression coefficients, and 6-dim pose parameters to each frame with a pre-trained 3DMM-CNN  model, followed by a face fitting algorithm  based on optimization. This disentangled representation allows us to study face motion modeling without being distracted by unrelated factors such as facial identity, background scene, and illumination of the environment. We trained our model with 80% of the data on the expression and pose parameters since these are the main factors that change over time. Human3.6M is a large-scale database containing more than 800 human motion sequences captured by 11 professional actors (3.6 million frames in total) in an indoor environment. For experiments on Human3.6M, we used the raw 2D trajectories of 32 keypoints and further normalized the data into coordinates within the range . We used subjects number 1, 5, 6, 7, and 8, for training and tested on subjects 9 and 11. We used 5% of the training data for the purpose of model validation.
Our MT-VAE model consists of four components: sequence encoder network, sequence decoder network, latent encoder network, and latent decoder network. We build our sequence encoder and decoder using Long Short-term Memory units (LSTMs)
. We used 1-layer LSTM with 1,024 hidden units for both networks. For experiments on Aff-Wild dataset, the input to our sequence encoder is the 35-dimensional expression-pose representation (29 expression and 6 pose parameters) per timestep and we recursively predict the future parameters using our sequence decoder. For experiments on Human3.6M dataset, we used the 64-dimensional xy-coordinate representation (32 joints with 2 coordinates each joint) instead. Given past and future motion features extracted from our sequence encoder network, we build three fully-connected layers with skip connections within our latent encoding network. We adopted a similar architecture (three fully-connected layers with skip connections) for our latent decoder network. For all the models (including baselines), we fixed the bottleneck latent dimension to be 512 and found this configuration is sufficient to generate both face and full-body motions.
We used ADAM  for optimization in all experiments. For training, we used a mini-batch size of 256 and learning rate of 0.0001 with default ADAM settings (e.g., ). For experiments on Aff-Wild, we trained models to predict 32 steps in the future given a varying number of observed frames between 8 and 16. For experiments on Human3.6M, we trained models to predict 64 steps in the future given a varying number of observed frames between 10 and 20. To stabilize the training, we applied layer normalization  in both LSTMs and fully-connected layers. To encourage our latent variable to capture motion patterns, we applied the KL annealing technique  during training, in which we gradually increased the weight of KL term from 0 to 1. For experiments on Aff-Wild only, we applied dropout of ratio 0.8 to both sequence encoder and decoder networks to learn more robust features.
We used Prediction LSTM  as a deterministic baseline. Similar model has been used in previous work for learning dynamics of human motion [63, 17]. We implemented the vanilla VAE model  as our stochastic baseline. Similar model has been utilized in [33, 16, 19] for stochastic flow prediction from a single image. During training, we used distance as the reconstruction term. We conducted extensive hyper-parameter search for vanilla VAE and our MT-VAE variants by enumerating smoothing window , motion ratio , cycle loss ratio . All models achieve the best performance with and . Specifically, the best-performing MT-VAE (add) takes the hyper-parameter , while all other models take the hyper-parameter .
Please visit the website for more visualizations: https://goo.gl/2Q69Ym.
4.1 Multimodal Motion Generation
We evaluate our model’s capacity to generate diverse and plausible future motion patterns for a given sequence on the Aff-Wild and Human3.6M test sets. Given sequence as initialization, we generated multiple motion trajectories in the future using our proposed sampling and generation process. For the Prediction LSTM model, we only sample one motion trajectory in the future since the predicted future is deterministic.
We evaluate our model and baselines quantitatively using the minimum squared error metric and conditional log-likelihood metric, which have been used in evaluating conditional generative models [43, 16, 53, 48]. As defined in Eq. 6, Reconstruction minimum squared error (or R-MSE) measures the squared error of the closest reconstruction to ground-truth when sampling latent variables from the recognition model. This is a measure of the quality of reconstruction given both current and future sequences. As defined in Eq. 7, Sampling minimum squared error (or S-MSE) measures the squared error of the closest sample to ground-truth when sampling latent variables from prior. This is a measure of how close our samples are to the reference future sequences.
In terms of generation diversity and quality, a good generative model is expected to achieve low R-MSE and S-MSE values, given sufficient number of samples. Note that posterior collapse issue is usually featured by low S-MSE but high R-MSE, as latent sampled from the recognition model is being ignored to some extent. In addition, we measure the test conditional log-likelihood of the ground-truth sequences under our model via Parzen window estimation (with a bandwidth determined based on the validation set). We believe that Parzen window estimation is a reasonable approach for our setting as the dimensionality of data (sequence of keypoints) is not too high (unlike in the case of high-resolution videos). For each example, we used samples to compute R-MSE metric, and samples to compute S-MSE and conditional log-likelihood metrics. On Aff-Wild, we evaluate the models on 32-step expression coefficients prediction (29 32 = 928 dimensions in total). On Human3.6M, we evaluate the models on 64-step 2D joints prediction (64 64 = 4096 dimensions in total). Please note that such measurements are approximate, as we do not evaluate the model performance for every sub-sequence (e.g., essentially, every frame can serve as a starting point). Instead, we repeat the evaluations every 16 frames on Aff-Wild dataset and every 100 frames on Human3.6M dataset.
As we see in Table 1, data-driven approaches that simply repeat the motion computed from last-step velocity or averaged over the observed sequence performed poorly on both datasets. In contrast, the Prediction LSTM  baseline greatly reduces the S-MSE metric compared to simple data-driven approaches, due to the deep sequence encoder and decoder architecture in modeling more complex motion dynamics through time. Among all three models using latent variables, our MT-VAE (add) model achieve the best quantitative performance. Compared to MT-VAE (concat) that adopts vector concatenation, our additive version achieves lower reconstruction error with similar sampling eror. This suggests that the MT-VAE (add) model is able to regularize the learning of motion transformation further.
We provide qualitative side-by-side comparisons across different models in Figure 4. For Aff-Wild, we render 3D face models using the generated expression-pose parameters along with the original identity parameters. For Human3.6M, we directly visualize the generated 2D keypoints. As shown in the generated sequences, our MT-VAE model is able to generate multiple diverse and plausible sequences in the future. In comparison, the sequences generated by Vanilla VAE are less realistic. For example, given a sitting down motion (lower-left part in Fig. 4) as initialization, the vanilla model fails to predict the motion trend (sitting down), while creating some artifacts (e.g., scale change) in the future prediction. Also note that MT-VAE produces more natural transitions from the last observed frame to the first generated one (see mouth shapes in the face motion examples and distances between two legs in full-body examples). This demonstrates that MT-VAE learns a more robust and structure-preserving representation of motion sequences compared to other baselines.
Crowd-sourced Human Evaluations.
We conducted crowd-sourced human evaluations via Amazon Mechanical Turk (AMT) on 50 videos (10 Turkers per video) from Human3.6M dataset. This evaluation presents the past action, and 5 generated future actions for each method to a human evaluator and asks the person to select the most (1) realistic and (2) diverse results. In this evaluation, we also added comparisons to a recently published work  on stochastic video prediction, which we refer to as SVG. Table 2 presents the percentage of users who selected each method for each task. The Prediction LSTM produces the most realistic but the least diverse result; Babaeizadeh et al.  produces the most diverse but the least realistic result; Our MT-VAE model (we use the additive variant here) achieves a good balance between realism and diversity.
|Metric||Vanilla VAE ||SVG ||Our MT-VAE (add)||Pred LSTM |
|Method Metric||R-MSE (test)||S-MSE (test)||Test CLL ()|
|MT-VAE (add)||0.75 0.01||2.87 0.05||1.141 0.009|
|MT-VAE (add) w/o Motion Coherence||1.01 0.02||2.93 0.04||1.012 0.014|
|MT-VAE (add) w/o Cycle Consistency||1.18 0.03||2.71 0.05||0.927 0.019|
|MT-VAE (add) Context-free Decoder||0.31 0.05||4.05 0.05||1.299 0.007|
We analyze variations of our MT-VAE (add) models on Human3.6M. As we see in Table 3, removing the cycle consistency or motion coherence results in a drop in reconstruction performance. This shows that cycle consistency and motion coherence encourage the motion feature to preserve motion structure and hence be more discriminative in nature. We also evaluate a context-free version of the MT-VAE (add) model, where the the transformation vector is not conditioned on input feature . This version produces poor S-MSE value since it is challenging for the additive latent decoder to hallucinate transformation vector solely from latent variable .
4.2 Analogy-based Motion Transfer
We evaluate our model on an additional task of transfer by analogy. In this analogy-making experiment, we are given three motion sequences A, B (which is the subsequent motion of A), and C (which is a different motion sequence). The objective is to recognize the transition from A to B and transfer it to C. This experiment can demonstrate whether our learned latent space models the mode transition across motion sequences. Moreover, this task has numerous graphics applications like transferring expressions and their styles, video dubbing, gait style transfer, and video-driven animation .
In this experiment, we compare Prediction LSTM, Vanilla VAE, and our MT-VAE variants. For the stochastic models, we compute the latent variable from motion sequence A and B via the latent encoder, i.e., , and then decode using motion sequence C as . For Prediction LSTM model, we directly performed the analogy-making in the feature space since there is no notion of a latent space in that model. As shown in Figure 5, our MT-VAE model is able to combine the transformation learned from A to B transitions with the structure in sequence C. The other baselines failed at either adapting the mode transition from A to B or preserving the structure in C. The analogy-based motion transfer task is significantly more challenging than motion generation, since the combination of three reference motion sequences A, B, and C may never appear in the training data. Yet, our model is able to synthesize realistic motions. Please note that motion modes may not explicitly correspond to semantic motions, as we learn the motion transformation in an unsupervised manner.
4.3 Towards Multimodal Hierarchical Video Generation
As an application, we showcase that our multimodal motion generation framework can be directly used for generating diverse and realistic pixel-level video frames in the future. We trained the keypoint-conditioned image generation model  that takes both previous image frame A and predicted motion structure B (e.g., rendered face or human joints) as input and hallucinates image C by combining the image content adapted from A but with motion adapted from B. In Figure 6, we show a comparison of video generated in a deterministic way by Prediction LSTM (i.e., single future), and in a stochastic way driven by the predicted motion sequence (i.e., multiple futures) from our MT-VAE (add) model. We use our generated motion sequences for performing video generation experiments on the Aff-Wild (with 8 input frames observed) and Human3.6M (with 16 input frames observed).
Our goal in this work is to learn a conditional generative model for human motions. This is an extremely challenging problem in the general case and can require significant amount of training data to generate realistic results. Our work demonstrates that this can be accomplished with minimal supervision by enforcing a strong structure on the problem. In particular, we model long-term human dynamics as a set of motion modes with transitions between them, and construct a novel network architecture that strongly regularizes this space and allows for stochastic sampling. We have demonstrated that this same idea can be used to model both facial and full body motion, independent of the representation used (i.e., shape parameters, keypoints).
We thank Zhixin Shu and Haoxiang Li for their assistance with face tracking and fitting codebase. We thank Yuting Zhang, Seunghoon Hong, and Lajanugen Logeswaran for helpful comments and discussions. This work was supported in part by Adobe Research Fellowship to X. Yan, a gift from Adobe, ONR N00014-13-1-0762, and NSF CAREER IIS-1453651.
-  de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. ACM Trans. Graph. 27(3) (August 2008) 98:1–98:10
-  Beeler, T., Hahn, F., Bradley, D., Bickel, B., Beardsley, P., Gotsman, C., Sumner, R.W., Gross, M.: High-quality passive facial performance capture using anchor frames. ACM Trans. Graph. 30(4) (July 2011) 75:1–75:10
-  Yang, F., Wang, J., Shechtman, E., Bourdev, L., Metaxas, D.: Expression flow for 3d-aware face component transfer. In: ACM Transactions on Graphics (TOG). Volume 30., ACM (2011) 60
-  Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: What makes tom hanks look like tom hanks. In: ICCV. (2015) 3952–3960
-  Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36(4) (2017) 95
-  Sermanet, P., Lynch, C., Hsu, J., Levine, S.: Time-contrastive networks: Self-supervised learning from multi-view observation. arXiv preprint arXiv:1704.06888 (2017)
-  Rose, C., Guenter, B., Bodenheimer, B., Cohen, M.F.: Efficient generation of motion transitions using spacetime constraints. In: SIGGRAPH. (1996)
-  Bregler, C.: Learning and recognizing human dynamics in video sequences. In: CVPR, IEEE (1997) 568–574
-  Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: null, IEEE (2003) 726
-  Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. PAMI 29(12) (2007) 2247–2253
-  Laptev, I.: On space-time interest points. International journal of computer vision 64(2-3) (2005) 107–123
-  Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: CVPR, IEEE (2011) 3169–3176
-  Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, IEEE (2012) 1290–1297
-  Walker, J., Gupta, A., Hebert, M.: Dense optical flow prediction from a static image. In: ICCV, IEEE (2015) 2443–2451
-  Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., Van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852 (2015)
Walker, J., Doersch, C., Gupta, A., Hebert, M.:
An uncertain future: Forecasting from static images using variational autoencoders.In: ECCV. (2016)
-  Chao, Y.W., Yang, J., Price, B., Cohen, S., Deng, J.: Forecasting human dynamics from static images. In: CVPR. (2017)
-  Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: ICML. (2017)
-  Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: Video forecasting by generating pose futures. In: ICCV, IEEE (2017) 3352–3361
-  Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H.: Auto-conditioned recurrent networks for extended complex human motion synthesis. In: ICLR. (2018)
-  Yang, F., Bourdev, L., Shechtman, E., Wang, J., Metaxas, D.: Facial expression editing in video using a temporally-smooth factorization. In: CVPR, IEEE (2012) 861–868
-  Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: Real-time face capture and reenactment of rgb videos. In: CVPR. (2016) 2387–2395
-  Averbuch-Elor, H., Cohen-Or, D., Kopf, J., Cohen, M.F.: Bringing portraits to life. ACM Transactions on Graphics (Proceeding of SIGGRAPH Asia 2017) 36(6) (2017) 196
-  Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, ACM Press/Addison-Wesley Publishing Co. (1999) 187–194
-  Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using lstms. In: ICML. (2015) 843–852
-  Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: ICLR. (2016)
-  Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: International Conference on Artificial Neural Networks, Springer (2011) 44–51
-  Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: NIPS. (2015)
-  Finn, C., Goodfellow, I.J., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: NIPS. (2016)
-  Yang, J., Reed, S.E., Yang, M.H., Lee, H.: Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In: NIPS. (2015) 1099–1107
-  Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. ICLR 1(2) (2017) 7
-  Denton, E.L., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: NIPS. (2017) 4417–4426
-  Xue, T., Wu, J., Bouman, K., Freeman, B.: Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In: NIPS. (2016) 91–99
-  Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NIPS. (2016) 613–621
-  Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: Mocogan: Decomposing motion and content for video generation. In: CVPR. (2018)
-  Wichers, N., Villegas, R., Erhan, D., Lee, H.: Hierarchical long-term video prediction without supervision. In: ICML
-  Kalchbrenner, N., Oord, A.v.d., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K.: Video pixel networks. arXiv preprint arXiv:1610.00527 (2016)
-  Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS. (2013) 3111–3119
-  Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS. (2015) 2539–2547
-  Reed, S.E., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogy-making. In: NIPS. (2015) 1252–1260
-  Wang, X., Farhadi, A., Gupta, A.: Actions~ transformations. In: CVPR. (2016) 2658–2667
-  Zhou, Y., Berg, T.L.: Learning temporal transformations from time-lapse videos. In: ECCV, Springer (2016) 262–277
-  Sohn, K., Yan, X., Lee, H.: Learning structured output representation using deep conditional generative models. In: NIPS. (2015) 3483–3491
Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O.,
Toward multimodal image-to-image translation.In: NIPS. (2017) 465–476
-  Ha, D., Eck, D.: A neural representation of sketch drawings. In: ICLR. (2018)
-  Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349 (2015)
-  Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing, E.P.: Controllable text generation. arXiv preprint arXiv:1703.00955 (2017)
-  Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. In: ICLR. (2018)
-  Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: ICML. (2018)
-  Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8) (1997) 1735–1780
-  Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR. (2014)
-  Gregor, K., Danihelka, I., Graves, A., Rezende, D., Wierstra, D.: Draw: A recurrent neural network for image generation. In: ICML. (2015) 1462–1471
-  Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2image: Conditional image generation from visual attributes. In: ECCV, Springer (2016) 776–791
-  Smith, K.A., Vul, E.: Sources of uncertainty in intuitive physics. Topics in cognitive science 5(1) (2013) 185–199
-  Lan, T., Chen, T.C., Savarese, S.: A hierarchical representation for future action prediction. In: ECCV, Springer (2014) 689–704
-  Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., Kotsia, I.: Aff-wild: Valence and arousal in-the-wild challenge
-  Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. PAMI 36(7) (2014) 1325–1339
Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.:
A 3d face model for pose and illumination invariant face recognition, Genova, Italy, IEEE (2009)
-  Tran, A.T., Hassner, T., Masi, I., Medioni, G.: Regressing robust and discriminative 3d morphable models with a very deep neural network. In: CVPR. (2017)
-  Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.Z.: Face alignment across large poses: A 3d solution. In: CVPR. (2016) 146–155
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
-  Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV, IEEE (2015) 4346–4354