1 Introduction
^{1}^{1}footnotetext: Work partially done during internship with Adobe Research.Modeling the dynamics of human motion — both facial and full body motion — is a fundamental problem in computer vision, graphics, and machine intelligence, with applications ranging from virtual characters
[1, 2], videobased animation and editing [3, 4, 5], and humanrobot interfaces [6]. Human motion is known to be highly structured and can be modeled as a sequence of atomic units that we refer to as motion modes. A motion mode captures the shortterm temporal dynamics of a human action (e.g., smiling or walking), including its related stylistic attributes (e.g., how wide is the smile, how fast is the walk). Over the longterm, a human action sequence can be segmented into a series of motion modes with transitions between them (e.g., a transition from a neutral expression to smiling to laughing). This structure is well known (referred to as basis motions [7] or walk cycles) and widely used in computer animation.This paper leverages this structure to learn to generate human motion sequences, i.e., given a short human action sequence (present motion mode), we want to synthesize the action going forward (future motion mode). We hypothesize that (1) each motion mode can be represented as a lowdimensional feature vector, and (2) transitions between motion modes can be modeled as
transformations of these features. As shown in Figure 1, we present a novel model termed Motion Transformation Variational AutoEncoders (MTVAE) for learning motion sequence generation. Our MTVAE is implemented using an LSTM encoderdecoder that embeds each short subsequence into a feature vector that can be decoded to reconstruct the motion. We further assume that the transition between current and future modes can be captured by a certain transformation. In the paper, we demonstrate that the proposed MTVAE learns a motion feature representation in an unsupervised way.A challenge with human motion is that it is inherently multimodal, i.e., the same initial motion mode could transition into different motion modes (e.g., a smile could transition to a frown, or a smile while looking left, or a wider smile, etc.). A deterministic model would not be able to learn these variations and may collapse to a singlemode distribution. Our MTVAE supports a stochastic sampling of the feature transformations to generate multiple plausible output motion modes from a single input. This allows us to model transitions that may be rare (or potentially absent) in the the training set.
We demonstrate our approach on both facial and full human body motions. In both domains, we conduct extensive ablation studies and comparisons with previous work showing that our generation results are more plausible (i.e., better preserve the structure of human dynamics) and diverse (i.e., explore multiple motion modes). We further demonstrate applications like 1) analogybased motion transfer (e.g., transferring the act of smiling from one pose to another pose) and 2) future video synthesis (i.e., generating multiple possible future videos given input frames with human motions). Our key contributions are summarized as follows:

We propose a generative motion model that consists of a sequencelevel motion feature embedding and feature transformations, and show that it can be trained in an unsupervised manner.

We show that stochastically sampling the transformation space is able to generate future motion dynamics that are diverse and plausible.

We demonstrate applications of the learned model to challenging tasks like motion transfer and future video synthesis for both facial and human body motions.
2 Related Work
Understanding and modeling human motion dynamics has been a longstanding problem for decades [8, 9, 10]. Due to the high dimensionality of video data, early work mainly focused on learning hierarchical spatiotemporal representations for video event and action recognition [11, 12, 13]
. In recent years, predicting and synthesizing motion dynamics using deep neural networks has become a popular research topic. Walker et al.
[14], Fischer et al. [15] learn to synthesize dense flow in the future from a single image. Walker et al. [16] extended the deterministic prediction framework by modeling the flow uncertainty using variational autoencoders. Chao et al. [17]proposed a recurrent neural network to generate movement of 3D human joints from a single observation with a 3D innetwork projection layer. Taking one step further, Villegas et al.
[18], Walker et al. [19] explored hierarchical structure (e.g., 2D human joints) for motion prediction in the future using recurrent neural networks. Li et al. [20]proposed an autoconditional recurrent framework to generate longterm human motion dynamics through time. Besides human motion, face synthesis and editing is another interesting topic in vision and graphics. Methods for reenacting and interpolating face sequences in video have been developed
[3, 21, 22, 23] based on a 3D morphable face representation [24]. Very recently, Suwajanakorn et al. [5] introduced a speechdriven face synthesis system that learns to generate lip motions with a recurrent neural network.Besides the flow representation, motion synthesis has been explored in a broader context, namely, video generation. For example, synthesizing video sequence in the future from a single or multiple video frames as initialization. Early works employed patchbased method for shortterm video generation using mean squared mean squared loss [25] or perceptual loss [26]. Given an atomic action as additional condition, previous works extended with actionconditioned (i.e., rotation, location, etc) architectures that enable better semantic control in video generation [27, 28, 29, 30]. Due to the difficulty in holistic video frame prediction, the idea of disentangling video factors into motion and content is explored in [31, 32, 33, 34, 35, 36]. Video generation has also been approached with architectures that output multinomial distribution vectors over the possible pixel values for each pixel in the generated frame [37].
The notion of feature transformations has also been exploited for other tasks. Mikolov et al. [38] showcased the composition additive property of word vectors learned in an unsupervised way from language data; Kulkarni et al. [39], Reed et al. [40] suggested that additive transformation can be achieved via reconstruction or prediction task by learning from parallel paired image data. In the video domain, Wang et al. [41] studied a transformationaware representation for semantic human action classification; Zhou et al. [42] investigated timelapse video generation given additional class labels.
Multimodal conditional generation has recently been explored for images [43, 44], sketch drawings [45], natural language [46, 47], and video prediction [48, 49]. As noted in previous work, learning to generate diverse and plausible visual data is very challenging for the following reasons: first, mode collapse may occur without onetomany pairs. Collecting sequence data where onetomany pairs exist is nontrivial. Second, posterior collapse could happen when the generation model is based on a recurrent neural network.
3 Problem Formulation and Methods
We start by giving an overview of our problem. We are given a sequence of observations , where is a dimensional vector representing the observation at time . These observations encode the structure of the moving object and can be represented in different ways, for e.g., as keypoint locations or shape and pose parameters. Changes in these observations encode the motion that we are interested in modeling. We refer to the entire sequence as a motion mode. Given a motion mode, , we aim to build a model that is capable of predicting a future motion mode, , where represents the predicted th step in the future, i.e., . We first start with a discussion of two potential baseline models that could be used for this task (Section 3.1), and then present our method (Section 3.2).
3.1 Preliminaries
Prediction LSTM for Sequence Generation.
Figure 2(a) shows a simple encoderdecoder LSTM [50, 25] as a baseline for the motion prediction task. At time , the encoder LSTM takes the motion as input and updates its internal representation. After going through the entire motion mode , it outputs a fixedlength feature as an intermediate representation. We initialize the internal representation of decoder LSTM using the feature computed. At time of the decoding stage, the decoder LSTM predicts the motion . This way, the decoder LSTM gradually predicts the entire motion mode in the future within steps. We denote the encoder LSTM as function and the decoder LSTM as function . As a design choice, we initialize the decoder LSTM with additional input for smoother prediction.
Vanilla VAE for Sequence Generation.
As the deterministic LSTM model fails to reflect the multimodal nature of human motion, we consider a statistical model, , parameterized by . Given the observed sequence
, the model estimates a probability for the possible future sequence
instead of a single outcome. To model the multimodality (i.e., can transition to different ’s), a latent variable (sampled from prior distribution) is introduced to capture the inherent uncertainty. The future sequence is generated as follows:
Sample latent variable ;

Given and , generate a sequence of length : ;
Following previous work on VAEs [51, 43, 52, 53, 16, 33, 19], the objective is to maximize the variational lowerbound of the conditional logprobability :
(1) 
In Eq. 1, is referred as an auxiliary posterior that approximates the true posterior . Specifically, the prior is assumed to be . The posterior
is a multivariate Gaussian distribution with mean and variance
and , respectively. Intuitively, the first term in Eq. 1 regularizes the auxiliary posterior with prior . The second term can be considered as an autoencoding loss, where we refer to as an encoder or recognition model, and as a decoder or generation model.As shown in Figure 2(b), the vanilla VAE model adopts similar LSTM encoder and decoder for sequence processing. In contrast to Prediction LSTM model, the vanilla VAE decoder takes both motion feature and latent variable into account. Ideally, this allows to generate diverse motion sequences by drawing different samples from the latent space. However, the semantic role of the latent variable in this vanilla VAE model is not straightforward and may not effectively represent longterm trends (e.g., dynamics in a specific motion mode or during change of modes).
3.2 MotiontoMotion Transformations in Latent Space
To further improve motion sequence generation beyond vanilla VAE, we propose to explicitly enforce the structure of motion modes in the latent space. We assume that (1) each motion mode can be represented as lowdimensional feature vector, and (2) transitions between motion modes can be modeled as transformations of these features. Our design is also supported by early studies on hierarchical motion modeling and prediction [8, 54, 55].
We present a Motion Transformation VAE (or MTVAE) (Fig. 2(c)) with four components:

An LSTM encoder maps the input sequences into motion features through and , respectively.

A latent encoder computes the transformation in the latent space by concatenating motion features and . Here, indicates the latent space dimension.

A latent decoder synthesizes the motion feature in the future from latent transformation and current motion feature via .

An LSTM decoder synthesizes the future sequence given motion feature: .
Similar to the Prediction LSTM, we use an LSTM encoder/decoder to map motion modes into feature space. The MTVAE further maps these features into latent transformations and stochastically samples these transformations. As we demonstrate, this change makes the model more expressive and leads to more plausible results. Finally, in the sequence decoding stage of MTVAE, we feed the synthesized motion feature as input to the decoder LSTM, with internal state initialized using the same motion feature with an additional input .
3.3 Additive Transformations in Latent Space
Although MTVAE explicitly models motion transformations in latent space, this space might be unconstrained because the transformations are computed from vector concatenation of motion features and in our latent encoder . To better regularize the transformation space, we present an additive variant of MTVAE, that is depicted in Figure 2(d). To distinguish between the two variants, we call the previous model MTVAE (concat) and this model MTVAE (add), respectively. Our model is inspired by recent success of deep analogymaking methods [40, 31] where a relation (or transformation) between two examples can be represented as a difference in the embedding space. In this model, we strictly constrain the latent encoding and decoding steps as follows:

Our latent encoder computes the difference between two motion features and via ; then it maps the difference feature into a transformation in the latent space via .

Our latent decoder reconstructs the difference feature from latent variable and current motion feature via .

Finally, we apply a simple additive interaction to reconstruct the motion feature via ;
In step one, we infer the latent variable using from the difference of and (instead of a applying a linear layer on concatenated vectors). Intuitively, the latent code is expected to capture the mode transition from the current motion to the future motion rather than a concatenation of two modes. In step two, we reconstruct the transformation from the latent variable via where is obtained from recognition model. In this design, the feature difference is dependent on both latent transformation and current motion feature . Alternatively, we can make our latent decoder contextfree by removing input from motion feature . This way, the latent decoder is supposed to hallucinate the motion difference solely from the latent space. We provide this ablation study in Section 4.1.
Besides the architecturewise regularization, we introduce two additional objectives while training our model.
Cycle Consistency.
As mentioned previously, our training objective in Eq. 1 is composed of a KL term and a reconstruction term at each frame. The KL term regularizes the latent space, while the reconstruction term ensures that the data can be explained by our generative model. However, we do not have direct regularization in the feature space. We therefore introduce a cycleconsistency loss in Eq. 2 (for MTVAE (concat)) and Eq. 3 (for MTVAE (add)). Figure 3 illustrates the cycle consistency in details.
(2)  
(3) 
In our preliminary experiments, we also investigated a consistency loss with a bigger cycle (involving the actual motion sequences) during training but we found it ineffective as a regularization term in our setting. We hypothesize that vanishing or exploding gradients make the cycleconsistency objective less effective, which is a known issue when training recurrent neural networks.
Motion Coherence.
Specific to our motion generation task, we introduce a motion coherence loss in Eq. 4 that encourages a smooth transition in velocity in the first steps of prediction. We define the velocity and when . Intuitively, such loss prevents the generated sequence from deviating too far from the future sequence sampled from the prior.
(4) 
Finally, we summarize our overall loss in Eq. 5, where and are two balancing hyperparameters for cycle consistency and motion coherence, respectively.
(5) 
4 Experiments
Datasets.
The evaluation is conducted on the datasets involving two representative human motion modeling tasks: Affectinthewild (AffWild) [56] for facial motions and Human3.6M [57] for full body motions. The AffWild dataset contains more than 400 video clips (2,000 minutes in total) collected from Youtube with natural facial expression and head motion patterns. To better focus on face motion modeling (e.g., expressions and head movements), we leveraged the 3D morphable face model [58, 24] (e.g., face identity, face expression, and pose) in our experiments. We fitted 198dim identity coefficients, 29dim expression coefficients, and 6dim pose parameters to each frame with a pretrained 3DMMCNN [59] model, followed by a face fitting algorithm [60] based on optimization. This disentangled representation allows us to study face motion modeling without being distracted by unrelated factors such as facial identity, background scene, and illumination of the environment. We trained our model with 80% of the data on the expression and pose parameters since these are the main factors that change over time. Human3.6M is a largescale database containing more than 800 human motion sequences captured by 11 professional actors (3.6 million frames in total) in an indoor environment. For experiments on Human3.6M, we used the raw 2D trajectories of 32 keypoints and further normalized the data into coordinates within the range . We used subjects number 1, 5, 6, 7, and 8, for training and tested on subjects 9 and 11. We used 5% of the training data for the purpose of model validation.
Architecture Design.
Our MTVAE model consists of four components: sequence encoder network, sequence decoder network, latent encoder network, and latent decoder network. We build our sequence encoder and decoder using Long Shortterm Memory units (LSTMs)
[50]. We used 1layer LSTM with 1,024 hidden units for both networks. For experiments on AffWild dataset, the input to our sequence encoder is the 35dimensional expressionpose representation (29 expression and 6 pose parameters) per timestep and we recursively predict the future parameters using our sequence decoder. For experiments on Human3.6M dataset, we used the 64dimensional xycoordinate representation (32 joints with 2 coordinates each joint) instead. Given past and future motion features extracted from our sequence encoder network, we build three fullyconnected layers with skip connections within our latent encoding network. We adopted a similar architecture (three fullyconnected layers with skip connections) for our latent decoder network. For all the models (including baselines), we fixed the bottleneck latent dimension to be 512 and found this configuration is sufficient to generate both face and fullbody motions.
Implementation Details.
We used ADAM [61] for optimization in all experiments. For training, we used a minibatch size of 256 and learning rate of 0.0001 with default ADAM settings (e.g., ). For experiments on AffWild, we trained models to predict 32 steps in the future given a varying number of observed frames between 8 and 16. For experiments on Human3.6M, we trained models to predict 64 steps in the future given a varying number of observed frames between 10 and 20. To stabilize the training, we applied layer normalization [62] in both LSTMs and fullyconnected layers. To encourage our latent variable to capture motion patterns, we applied the KL annealing technique [46] during training, in which we gradually increased the weight of KL term from 0 to 1. For experiments on AffWild only, we applied dropout of ratio 0.8 to both sequence encoder and decoder networks to learn more robust features.
We used Prediction LSTM [18] as a deterministic baseline. Similar model has been used in previous work for learning dynamics of human motion [63, 17]. We implemented the vanilla VAE model [48] as our stochastic baseline. Similar model has been utilized in [33, 16, 19] for stochastic flow prediction from a single image. During training, we used distance as the reconstruction term. We conducted extensive hyperparameter search for vanilla VAE and our MTVAE variants by enumerating smoothing window , motion ratio , cycle loss ratio . All models achieve the best performance with and . Specifically, the bestperforming MTVAE (add) takes the hyperparameter , while all other models take the hyperparameter .
Please visit the website for more visualizations: https://goo.gl/2Q69Ym.
4.1 Multimodal Motion Generation


We evaluate our model’s capacity to generate diverse and plausible future motion patterns for a given sequence on the AffWild and Human3.6M test sets. Given sequence as initialization, we generated multiple motion trajectories in the future using our proposed sampling and generation process. For the Prediction LSTM model, we only sample one motion trajectory in the future since the predicted future is deterministic.
Quantitative Evaluations.
We evaluate our model and baselines quantitatively using the minimum squared error metric and conditional loglikelihood metric, which have been used in evaluating conditional generative models [43, 16, 53, 48]. As defined in Eq. 6, Reconstruction minimum squared error (or RMSE) measures the squared error of the closest reconstruction to groundtruth when sampling latent variables from the recognition model. This is a measure of the quality of reconstruction given both current and future sequences. As defined in Eq. 7, Sampling minimum squared error (or SMSE) measures the squared error of the closest sample to groundtruth when sampling latent variables from prior. This is a measure of how close our samples are to the reference future sequences.
RMSE  (6)  
SMSE  (7) 
In terms of generation diversity and quality, a good generative model is expected to achieve low RMSE and SMSE values, given sufficient number of samples. Note that posterior collapse issue is usually featured by low SMSE but high RMSE, as latent sampled from the recognition model is being ignored to some extent. In addition, we measure the test conditional loglikelihood of the groundtruth sequences under our model via Parzen window estimation (with a bandwidth determined based on the validation set). We believe that Parzen window estimation is a reasonable approach for our setting as the dimensionality of data (sequence of keypoints) is not too high (unlike in the case of highresolution videos). For each example, we used samples to compute RMSE metric, and samples to compute SMSE and conditional loglikelihood metrics. On AffWild, we evaluate the models on 32step expression coefficients prediction (29 32 = 928 dimensions in total). On Human3.6M, we evaluate the models on 64step 2D joints prediction (64 64 = 4096 dimensions in total). Please note that such measurements are approximate, as we do not evaluate the model performance for every subsequence (e.g., essentially, every frame can serve as a starting point). Instead, we repeat the evaluations every 16 frames on AffWild dataset and every 100 frames on Human3.6M dataset.
As we see in Table 1, datadriven approaches that simply repeat the motion computed from laststep velocity or averaged over the observed sequence performed poorly on both datasets. In contrast, the Prediction LSTM [18] baseline greatly reduces the SMSE metric compared to simple datadriven approaches, due to the deep sequence encoder and decoder architecture in modeling more complex motion dynamics through time. Among all three models using latent variables, our MTVAE (add) model achieve the best quantitative performance. Compared to MTVAE (concat) that adopts vector concatenation, our additive version achieves lower reconstruction error with similar sampling eror. This suggests that the MTVAE (add) model is able to regularize the learning of motion transformation further.
Qualitative Results.
We provide qualitative sidebyside comparisons across different models in Figure 4. For AffWild, we render 3D face models using the generated expressionpose parameters along with the original identity parameters. For Human3.6M, we directly visualize the generated 2D keypoints. As shown in the generated sequences, our MTVAE model is able to generate multiple diverse and plausible sequences in the future. In comparison, the sequences generated by Vanilla VAE are less realistic. For example, given a sitting down motion (lowerleft part in Fig. 4) as initialization, the vanilla model fails to predict the motion trend (sitting down), while creating some artifacts (e.g., scale change) in the future prediction. Also note that MTVAE produces more natural transitions from the last observed frame to the first generated one (see mouth shapes in the face motion examples and distances between two legs in fullbody examples). This demonstrates that MTVAE learns a more robust and structurepreserving representation of motion sequences compared to other baselines.
Crowdsourced Human Evaluations.
We conducted crowdsourced human evaluations via Amazon Mechanical Turk (AMT) on 50 videos (10 Turkers per video) from Human3.6M dataset. This evaluation presents the past action, and 5 generated future actions for each method to a human evaluator and asks the person to select the most (1) realistic and (2) diverse results. In this evaluation, we also added comparisons to a recently published work [49] on stochastic video prediction, which we refer to as SVG. Table 2 presents the percentage of users who selected each method for each task. The Prediction LSTM produces the most realistic but the least diverse result; Babaeizadeh et al. [48] produces the most diverse but the least realistic result; Our MTVAE model (we use the additive variant here) achieves a good balance between realism and diversity.
Metric  Vanilla VAE [48]  SVG [49]  Our MTVAE (add)  Pred LSTM [18] 
Realism (%)  19.2  23.8  26.4  30.6 
Diversity (%)  51.6  22.3  26.1 
Method Metric  RMSE (test)  SMSE (test)  Test CLL () 

MTVAE (add)  0.75 0.01  2.87 0.05  1.141 0.009 
MTVAE (add) w/o Motion Coherence  1.01 0.02  2.93 0.04  1.012 0.014 
MTVAE (add) w/o Cycle Consistency  1.18 0.03  2.71 0.05  0.927 0.019 
MTVAE (add) Contextfree Decoder  0.31 0.05  4.05 0.05  1.299 0.007 
Ablation Study.
We analyze variations of our MTVAE (add) models on Human3.6M. As we see in Table 3, removing the cycle consistency or motion coherence results in a drop in reconstruction performance. This shows that cycle consistency and motion coherence encourage the motion feature to preserve motion structure and hence be more discriminative in nature. We also evaluate a contextfree version of the MTVAE (add) model, where the the transformation vector is not conditioned on input feature . This version produces poor SMSE value since it is challenging for the additive latent decoder to hallucinate transformation vector solely from latent variable .
4.2 Analogybased Motion Transfer
We evaluate our model on an additional task of transfer by analogy. In this analogymaking experiment, we are given three motion sequences A, B (which is the subsequent motion of A), and C (which is a different motion sequence). The objective is to recognize the transition from A to B and transfer it to C. This experiment can demonstrate whether our learned latent space models the mode transition across motion sequences. Moreover, this task has numerous graphics applications like transferring expressions and their styles, video dubbing, gait style transfer, and videodriven animation [22].
In this experiment, we compare Prediction LSTM, Vanilla VAE, and our MTVAE variants. For the stochastic models, we compute the latent variable from motion sequence A and B via the latent encoder, i.e., , and then decode using motion sequence C as . For Prediction LSTM model, we directly performed the analogymaking in the feature space since there is no notion of a latent space in that model. As shown in Figure 5, our MTVAE model is able to combine the transformation learned from A to B transitions with the structure in sequence C. The other baselines failed at either adapting the mode transition from A to B or preserving the structure in C. The analogybased motion transfer task is significantly more challenging than motion generation, since the combination of three reference motion sequences A, B, and C may never appear in the training data. Yet, our model is able to synthesize realistic motions. Please note that motion modes may not explicitly correspond to semantic motions, as we learn the motion transformation in an unsupervised manner.
4.3 Towards Multimodal Hierarchical Video Generation
As an application, we showcase that our multimodal motion generation framework can be directly used for generating diverse and realistic pixellevel video frames in the future. We trained the keypointconditioned image generation model [18] that takes both previous image frame A and predicted motion structure B (e.g., rendered face or human joints) as input and hallucinates image C by combining the image content adapted from A but with motion adapted from B. In Figure 6, we show a comparison of video generated in a deterministic way by Prediction LSTM (i.e., single future), and in a stochastic way driven by the predicted motion sequence (i.e., multiple futures) from our MTVAE (add) model. We use our generated motion sequences for performing video generation experiments on the AffWild (with 8 input frames observed) and Human3.6M (with 16 input frames observed).
5 Conclusions
Our goal in this work is to learn a conditional generative model for human motions. This is an extremely challenging problem in the general case and can require significant amount of training data to generate realistic results. Our work demonstrates that this can be accomplished with minimal supervision by enforcing a strong structure on the problem. In particular, we model longterm human dynamics as a set of motion modes with transitions between them, and construct a novel network architecture that strongly regularizes this space and allows for stochastic sampling. We have demonstrated that this same idea can be used to model both facial and full body motion, independent of the representation used (i.e., shape parameters, keypoints).
Acknowledgements.
We thank Zhixin Shu and Haoxiang Li for their assistance with face tracking and fitting codebase. We thank Yuting Zhang, Seunghoon Hong, and Lajanugen Logeswaran for helpful comments and discussions. This work was supported in part by Adobe Research Fellowship to X. Yan, a gift from Adobe, ONR N000141310762, and NSF CAREER IIS1453651.
References
 [1] de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multiview video. ACM Trans. Graph. 27(3) (August 2008) 98:1–98:10
 [2] Beeler, T., Hahn, F., Bradley, D., Bickel, B., Beardsley, P., Gotsman, C., Sumner, R.W., Gross, M.: Highquality passive facial performance capture using anchor frames. ACM Trans. Graph. 30(4) (July 2011) 75:1–75:10
 [3] Yang, F., Wang, J., Shechtman, E., Bourdev, L., Metaxas, D.: Expression flow for 3daware face component transfer. In: ACM Transactions on Graphics (TOG). Volume 30., ACM (2011) 60
 [4] Suwajanakorn, S., Seitz, S.M., KemelmacherShlizerman, I.: What makes tom hanks look like tom hanks. In: ICCV. (2015) 3952–3960
 [5] Suwajanakorn, S., Seitz, S.M., KemelmacherShlizerman, I.: Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36(4) (2017) 95
 [6] Sermanet, P., Lynch, C., Hsu, J., Levine, S.: Timecontrastive networks: Selfsupervised learning from multiview observation. arXiv preprint arXiv:1704.06888 (2017)
 [7] Rose, C., Guenter, B., Bodenheimer, B., Cohen, M.F.: Efficient generation of motion transitions using spacetime constraints. In: SIGGRAPH. (1996)
 [8] Bregler, C.: Learning and recognizing human dynamics in video sequences. In: CVPR, IEEE (1997) 568–574
 [9] Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: null, IEEE (2003) 726
 [10] Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as spacetime shapes. PAMI 29(12) (2007) 2247–2253
 [11] Laptev, I.: On spacetime interest points. International journal of computer vision 64(23) (2005) 107–123
 [12] Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: CVPR, IEEE (2011) 3169–3176
 [13] Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, IEEE (2012) 1290–1297
 [14] Walker, J., Gupta, A., Hebert, M.: Dense optical flow prediction from a static image. In: ICCV, IEEE (2015) 2443–2451
 [15] Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., Van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852 (2015)

[16]
Walker, J., Doersch, C., Gupta, A., Hebert, M.:
An uncertain future: Forecasting from static images using variational autoencoders.
In: ECCV. (2016)  [17] Chao, Y.W., Yang, J., Price, B., Cohen, S., Deng, J.: Forecasting human dynamics from static images. In: CVPR. (2017)
 [18] Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate longterm future via hierarchical prediction. In: ICML. (2017)
 [19] Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: Video forecasting by generating pose futures. In: ICCV, IEEE (2017) 3352–3361
 [20] Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H.: Autoconditioned recurrent networks for extended complex human motion synthesis. In: ICLR. (2018)
 [21] Yang, F., Bourdev, L., Shechtman, E., Wang, J., Metaxas, D.: Facial expression editing in video using a temporallysmooth factorization. In: CVPR, IEEE (2012) 861–868
 [22] Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: Realtime face capture and reenactment of rgb videos. In: CVPR. (2016) 2387–2395
 [23] AverbuchElor, H., CohenOr, D., Kopf, J., Cohen, M.F.: Bringing portraits to life. ACM Transactions on Graphics (Proceeding of SIGGRAPH Asia 2017) 36(6) (2017) 196
 [24] Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, ACM Press/AddisonWesley Publishing Co. (1999) 187–194
 [25] Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using lstms. In: ICML. (2015) 843–852
 [26] Mathieu, M., Couprie, C., LeCun, Y.: Deep multiscale video prediction beyond mean square error. In: ICLR. (2016)
 [27] Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming autoencoders. In: International Conference on Artificial Neural Networks, Springer (2011) 44–51
 [28] Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.: Actionconditional video prediction using deep networks in atari games. In: NIPS. (2015)
 [29] Finn, C., Goodfellow, I.J., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: NIPS. (2016)
 [30] Yang, J., Reed, S.E., Yang, M.H., Lee, H.: Weaklysupervised disentangling with recurrent transformations for 3d view synthesis. In: NIPS. (2015) 1099–1107
 [31] Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. ICLR 1(2) (2017) 7
 [32] Denton, E.L., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: NIPS. (2017) 4417–4426
 [33] Xue, T., Wu, J., Bouman, K., Freeman, B.: Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In: NIPS. (2016) 91–99
 [34] Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NIPS. (2016) 613–621
 [35] Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: Mocogan: Decomposing motion and content for video generation. In: CVPR. (2018)
 [36] Wichers, N., Villegas, R., Erhan, D., Lee, H.: Hierarchical longterm video prediction without supervision. In: ICML
 [37] Kalchbrenner, N., Oord, A.v.d., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K.: Video pixel networks. arXiv preprint arXiv:1610.00527 (2016)
 [38] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS. (2013) 3111–3119
 [39] Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS. (2015) 2539–2547
 [40] Reed, S.E., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogymaking. In: NIPS. (2015) 1252–1260
 [41] Wang, X., Farhadi, A., Gupta, A.: Actions~ transformations. In: CVPR. (2016) 2658–2667
 [42] Zhou, Y., Berg, T.L.: Learning temporal transformations from timelapse videos. In: ECCV, Springer (2016) 262–277
 [43] Sohn, K., Yan, X., Lee, H.: Learning structured output representation using deep conditional generative models. In: NIPS. (2015) 3483–3491

[44]
Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O.,
Shechtman, E.:
Toward multimodal imagetoimage translation.
In: NIPS. (2017) 465–476  [45] Ha, D., Eck, D.: A neural representation of sketch drawings. In: ICLR. (2018)
 [46] Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349 (2015)
 [47] Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing, E.P.: Controllable text generation. arXiv preprint arXiv:1703.00955 (2017)
 [48] Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. In: ICLR. (2018)
 [49] Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: ICML. (2018)
 [50] Hochreiter, S., Schmidhuber, J.: Long shortterm memory. Neural computation 9(8) (1997) 1735–1780
 [51] Kingma, D.P., Welling, M.: Autoencoding variational bayes. In: ICLR. (2014)
 [52] Gregor, K., Danihelka, I., Graves, A., Rezende, D., Wierstra, D.: Draw: A recurrent neural network for image generation. In: ICML. (2015) 1462–1471
 [53] Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2image: Conditional image generation from visual attributes. In: ECCV, Springer (2016) 776–791
 [54] Smith, K.A., Vul, E.: Sources of uncertainty in intuitive physics. Topics in cognitive science 5(1) (2013) 185–199
 [55] Lan, T., Chen, T.C., Savarese, S.: A hierarchical representation for future action prediction. In: ECCV, Springer (2014) 689–704
 [56] Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., Kotsia, I.: Affwild: Valence and arousal inthewild challenge
 [57] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. PAMI 36(7) (2014) 1325–1339

[58]
Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.:
A 3d face model for pose and illumination invariant face recognition, Genova, Italy, IEEE (2009)
 [59] Tran, A.T., Hassner, T., Masi, I., Medioni, G.: Regressing robust and discriminative 3d morphable models with a very deep neural network. In: CVPR. (2017)
 [60] Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.Z.: Face alignment across large poses: A 3d solution. In: CVPR. (2016) 146–155
 [61] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 [62] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
 [63] Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV, IEEE (2015) 4346–4354
Comments
There are no comments yet.