Learning Diverse Stochastic Human-Action Generators by Learning Smooth Latent Transitions

12/21/2019 ∙ by Zhenyi Wang, et al. ∙ Duke University University at Buffalo 9

Human-motion generation is a long-standing challenging task due to the requirement of accurately modeling complex and diverse dynamic patterns. Most existing methods adopt sequence models such as RNN to directly model transitions in the original action space. Due to high dimensionality and potential noise, such modeling of action transitions is particularly challenging. In this paper, we focus on skeleton-based action generation and propose to model smooth and diverse transitions on a latent space of action sequences with much lower dimensionality. Conditioned on a latent sequence, actions are generated by a frame-wise decoder shared by all latent action-poses. Specifically, an implicit RNN is defined to model smooth latent sequences, whose randomness (diversity) is controlled by noise from the input. Different from standard action-prediction methods, our model can generate action sequences from pure noise without any conditional action poses. Remarkably, it can also generate unseen actions from mixed classes during training. Our model is learned with a bi-directional generative-adversarial-net framework, which not only can generate diverse action sequences of a particular class or mix classes, but also learns to classify action sequences within the same model. Experimental results show the superiority of our method in both diverse action-sequence generation and classification, relative to existing methods.



There are no comments yet.


page 11

page 12

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Human-action generation is an important task for modeling dynamic behavior of human activities, with vast real applications such as video synthesis [48], action classification [22, 20, 53, 29, 37, 9] and action prediction [31, 49, 2]. Directly generating human actions from scratch is particularly challenging due to the complexity and high-dimensionality of natural scenes. One promising workaround is to first generate easier-to-deal-with skeleton-based action sequences, based on which natural sequence are then rendered. This paper thus focuses on skeleton-based action-sequence generation.

Figure 1: Generating sentences from pure noise, our model can learn smooth latent-frame transitions via an implicit LSTM-based RNN, which are then decoded to an action sequence via a shared decoder. The action sequence endows a flexible implicit distribution induced by the input noise. Our model not only can generate actions whose class are seen during training (e.g., throw, kick), but also can generate actions of unseen mixed classes (e.g., throw+kick).

Skeleton-based human action generation can be categorized into action synthesis (also referred to generation) [24] and prediction. Action synthesis refers to synthesizing a whole action sequence from scratch, with controllable label information; whereas action prediction refers to predicting remaining action-poses given a portion of seed frames. These two tasks are closely related, e.g., the latter can be considered as a conditional variant of the former. In general, however, action synthesis is considered more challenging due to little input information available. Existing action-prediction methods can be categorized into deterministic [31, 4, 49, 17] and stochastic [2, 25, 16]

approaches. Predicted action sequences in deterministic approaches are not associated with randomness; thus, there is no variance once input sub-sequences are given. By contrast, stochastic approaches can induce probability distributions over predicted sequences. In most cases, stochastic (probabilistic) approaches are preferable as they allow one to generate different action sequences conditioned on the same context.

For diverse action generation, models are required to be stochastic so that the synthesis process can be considered as drawing samples from an action-sequence probability spaces. As a result, one approach for action synthesis is to learn a stochastic generative model, which induces probability distributions over the space of action sequences from which we can easily sample. Once such a model is learned, new actions can be generated by merely sampling from the generative model.

Among various deep generative models, the generative adversarial network (GAN) [12] is one of the state-of-the-art methods, with applications on various tasks such as image generation [30], characters creation [19], video generation [44] and prediction [7, 26, 28]

. However, most existing GAN-based methods for action generation consider directly learning frame transitions on the original action space. In other words, these works define action generators with recurrent neural networks (RNNs) that directly produce action sequences

[28, 16, 26, 50]. However, these models are usually difficult to train due to the complexity and high-dimensionality of the action space.

In this paper, we overcome this issue by breaking the generator into two components: a smooth-latent-transition component (SLTC) and a global skeleton-decoder component (GSDC). Figure 1 illustrates the key features of and some results from our model. Specifically,

  • The SLTC is responsible for generating smooth latent frames, each of which corresponds to the latent representation of a generated action-pose. The SLTC is modeled by an implicit LSTM

    , which takes a sequence of independent noise plus a one-hot class vector as input, and outputs a latent-frame sequence. Our method inherits the advantage of RNNs, which could generate diverse length sequences but on a much lower-dimensional latent space, an advantage over existing methods such as


  • The GSDC is responsible for decoding each latent frame to an output action pose, via a shared (global) decoder implemented by a deep neural network (DNN). Note that at this stage, only a mapping from a single latent frame to an action-pose needs to be learned, i.e., no sequence modeling is needed, making generation much simpler.

Our model is learned by adopting the bi-directional GAN framework [10, 8], consisting of a stochastic action generator, an action classifier and an action discriminator. These three networks compete with each other adversarially. At equilibrium, the generator can learn to generate diverse action sequences that match the training data. In addition, the classifier is able to learn to classify both real and synthesized action sequences. Our contributions are summarized as follows:

  • We propose a novel stochastic action sequence generator architecture, which benefits from an ability to learn smooth latent transitions. The proposed architecture eases the training of the RNN-based generator for sequence generation, and at the same time can learn to generate much higher-quality and diverse actions.

  • We propose to learn an action-sequence classifier simultaneously within the bi-directional GAN framework, achieving both action generation and classification.

  • Extensive experiments are conducted, demonstrating the superiority of our proposed framework.

Related Works

Skeleton-based action prediction has been studied for years. One of the most popular methods for human motion prediction (conditioned on a portion of seed action-poses) is based on recurrent neural networks [31, 49, 25]. For skeleton-based action generation, switching linear models [35, 3, 33]

were proposed to model stochastic dynamics of human motions. However, it is difficult to select a suitable number of switching states for best modeling. Furthermore, it usually requires a large amount of training data due to the large model size. Restricted Boltzmann Machine (RBM) also has been applied for motion generation

[40, 39, 41]. However, inference for RBM is known to be particularly challenging. Gaussian-process latent variable models [47, 42] and its variants [46] have been applied for this task. One problem with such methods, however, is that they are not scalable enough to deal with large-scale data.

For deep-learning-based methods, RNNs are probably one of the most successful models

[11]. However, most existing models assume output distributions as Gaussian or Gaussian mixture. Different from our implicit representation, these methods are not expressive enough to capture the diversity of human actions.

In contrast to action prediction, limited work has been done for diverse action generation, apart from some preliminary work. Specifically, the motion graph approach [32]

needs to extract motion primitives from prerecorded data; the diversity and quality of action will be restricted by way of defining the primitives and transitions between the primitives. Variational autoencoder and GAN have also been applied in

[16, 50, 21] for motion generation. However, these methods directly learn motion transitions with an RNN, and the error of current frame will be accumulate into the next frame, thus making it inapplicable to generate long action sequences, especially for aperiodic motions such as eating and drinking.

Another distinction between our model and existing methods is that the latter typically require some seed action-frames as input to a generator [52, 2, 50, 21], which is learned based on the GAN framework; whereas our model is designed to generate action sequence from scratch, and learned based on the bi-direction GAN framework in order to achieve simultaneous action generation and classification.

The Proposed Model

We first illustrate our whole model in Figure 2, followed by detailed descriptions of specific components.

Figure 2: The proposed action-generation model (top), with detailed structures of action sequence generator (), discriminator (). The classifier () is the same as except that it outputs a class label instead of a binary value.

Problem Setup and Challenges

Our training dataset is represented as , where represents one action-pose of dimension ; is the length of the sequence; and is the corresponding one-hot label vector of the sequence***Our model can also be applied to data without labels by simply removing from the generator. We focus on the one with labels.. Our basic goal is to train a stochastic sequence generator using a DNN parametrized by . Hence, given a label and a sequence of random noises (specified latter), is supposed to generate a new action sequence following


where is the length of the sequence that can be specified flexibly in .

Remark 1

Similar to implicit generative models such as GAN, we call the generator form (1) an implicit generator, in the sense that the generated sequence is a random sequence endowing an implicit probability distribution with an unknown density function induced by the random noise . A traditional way of modeling action sequences usually defines

as an RNN, which typically defines a Gaussian distribution (explicit) for

, referred to as explicit modeling. An implicit distribution is typically much more flexible than explicit distributions as the density is not restricted to a particular distribution class.


There are a few challenges. The first one relates to how to define an expressive-enough generator for diverse action generation. We adopt an implicit model of action sequences without explicit form assumption; Thus, the generator benefits from better representation power to generate more sophisticated, higher-quality and diverse action sequences. The second challenge is to find an appropriate generator structure. One straightforward way is to define as an RNN that outputs action sequences directly, similar to [16, 50]. However, it is well known that an RNN with high-dimensional outputs is challenging to train [34]. In recent years, attention [1] and the Transformer model [43] have been developed to enhance/replace RNN-based models. Attention and Transformer are used for addressing the long-term dependency problem in seq2seq-based models. They are not directly applicable to our setting because our model is not a simple seq2seq model, as our inputs are purely random noise. To this end, we propose a novel generator structure, where smooth latent-action transitions are first inferred via an RNN, which are then fed into a shared frame-wise decoder (non-sequential) to map all latent poses to their corresponding action-poses. The detailed structure of the generator is illustrated in Figure 2 and described below.

Stochastic Action-Sequence Generator

Our action-sequence generator consists of two components, SLTC and GSDC. The detailed structure of the generator is illustrated as in Figure 2.

Learning smooth latent transitions

Instead of directly modeling sequential transitions in the action space, we propose to model them in a latent action-sequence space. To this end, we decompose as compositions of an implicit LSTM and a shared frame-wise decoder. The LSTM generator (a.k.a. SLTC) models smooth latent action transitions, and the shared decoder (a.k.a. GSDC) models frame-wise mapping from latent space to action space. Specifically, denote to be the latent representation of an action-pose . We define to be outputs of an implicit LSTM, written as:


where is the input of the LSTM at time (the noise ’s are independent of each other for all ); and is the parameter of the LSTM network. We called (2) implicit LSTM because its input consists of independent noise at each time in both training and testing (generation) stages (please see the generator graph in Figure 2). This generator is different from standard LSTM where the output of previous time will be used as input of current time in the testing stage. In addition, the noise in the input would induce a much more flexible implicit distribution on ; whereas standard LSTM defines an explicit distribution such as Gaussian, restricting the representation power. Another advantage of adopting an implicit LSTM as a latent-frame generator is that the length of a generated action sequence could be induced from the latent space instead of the action space. In general, the dimension of is much smaller than action poses, making the training of the LSTM easier. Finally, modeling latent representations with an LSTM also allows latent transitions to be smooth, that is the desired property of action sequence generation.

To further ease the training of LSTM, we propose a variant whose outputs are defined as the residual latent sequences. That is, instead of modeling as in (2), we define the following generating process:


The shared frame-wise decoder

The second component, GSDC, is a shared frame-wise decoder mapping one latent frame to the corresponding action pose . Specifically, given from SLTC, we have, for all ,

where represents a decoder implemented as any DNN with parameter , mapping an input latent frame to an output action pose . In experiments, we use a simple MLP structure for Dec.

The whole action sequence generator

Stacking the above two components constitutes our implicit action generator. To further enforce smooth transitions, we penalize the generated latent action poses by the changes of consecutive frames, i.e., with the following regularizer, similar to [5]:


where and control the relative importance of the corresponding regularizer term.

Action-Sequence Classifier

Modeling action-sequence generation and classification simultaneously enables information sharing between the generator and classifier, thus it is expected to be able to boost model performance. As a result, we define a classifier with a bi-directional LSTM [36]

, whose outputs are further appended with a fully connected layer and a softmax layer for classification. The purpose of adopting the bi-directional LSTM is to effectively model frame-wise relation from two directions, which has been shown more effective than single direction modeling in sequence models

[18, 38, 13]. Please refer to in Figure 2 for a detailed structure of our sequence classifier.

Action Discriminator and Model Training

Bi-directional GAN based training

The proposed action-sequence generator and classifier constitute a pair of networks that can translate between each other, i.e., inverting the classifier achieves the same goal of the generator. To train these two networks effectively, we borrow ideas from bi-directional GAN, and define a discriminator to play adversarial games with the generator and classifier. Specifically, the action-label pairs come from two sources: one starts from a random label and then generates an action sequence via the generator ; the other starts from a randomly-sampled training action sequence and then generates a label via the classifier . Let be a prior distribution over labels

We adopt a uniform distribution in our method.

; be the implicit distribution induced by the generator; be the empirical action-sequence distribution of the training data; and be the conditional label distribution induced by the classifier given a sequence . Our model updates the generator , the classifier , and the discriminator

alternatively by following the GAN training procedure. Similar to the classifier, the discriminator is also defined by a bidirectional LSTM. The bi-directional GAN is trained to match the joint distributions

and , via the following min-max game:


In addition, motivated by CycleGAN [54] and ALICE [27], a cycle-consistency loss is introduced:


where denotes the cross entropy between two distributions. Combining (4), (Bi-directional GAN based training) and (6) constitutes the final loss of our model.

Figure 3: Latent space with dimension = 2. The trajectories intercept with each other due to some similar frames in different action sequences.
Figure 4: Action diversity of generated latent trajectories. Left: Greeting; Right: Posing.
Figure 5: Diversity of generated action sequences.

Shared frame-wise decoder pretraining

It is useful to pretrain the shared frame-wise decoder with the training data. To this end, we use the conditioned WGAN-GP model [15] to train a generator to generate independent action poses from a given label. The generator, denoted as , corresponds to the shared decoder in our model. To match the input with our frame decoder, we replace the original input with a random sample from a simple distribution , e.g., the standard Gaussian distribution. The discriminator, denoted as , is an auxiliary network to be discarded after pretraining. The objective function is defined as:


where denotes the frame distribution of training data; and controls the magnitude of the gradient penalty to enforce the Lipschitz constraint.

Finally, the whole training procedure of our model is described in Algorithm (See Appendix).


We evaluate our proposed model for diverse action generation on two datasets, in terms of both action-sequence quality and diversity. We also conduct extensive human evaluation for the results generated by different models. Ablation study, implementation details and more results are provided in the Appendix. Code is also made availablehttps://github.com/zheshiyige/Learning-Diverse-Stochastic-Human-Action-Generators-by-Learning-Smooth-Latent-Transitions.

Datasets & Baselines


We adopt the human-3.6m dataset [6] and the NTU dataset [37]. The human-3.6m is a large scale dataset for human activity recognition and analysis. Following [5], we subsample the video frames to 16 fps to obtain more significant action variations. Our model is trained on 10 classes of actions, including Directions, Discussion, Eating, Greeting, Phoning, Posing, Sitting, SittingDown, Walking, and Smoking.

The NTU RGB+D is a large action dataset collected with Microsoft Kinect v.2 cameras [37]. For our purpose, we only use the 2D skeleton locations of 25 major body joints in the corresponding depth/IR frame data. Similar to the human3.6m dataset, we also sample 10 action classes for training and testing, including drinking water, throw, sitting down, wear jacket, standing up, hand waving, kicking something, jump up, make a phone call and cross hands in front. We follow the evaluation protocols of previous literature on this dataset to adopt cross-subject and cross-view recognition accuracy. For the cross-subject evaluation, sequences for training (20 subjects) and testing (20 subjects) come from different subjects. For the cross-view evaluation, the training dataset consists of action sequences collected by two cameras, and the test dataset consists of the remaining data. After splitting and cleaning missing or incomplete sequences, there are 2260 and 1070 action sequences for training and testing, respectively, for cross-subject evaluation; and there are 2213 and 1117 action sequences for training and testing, respectively, for cross-view evaluation.


Generating action sequences from scratch is a relatively less explored field. The most related models to ours we found are the recently proposed generative models EPVA-adv in [51], action generator trained with VAE [16] as well as the model proposed in [5] for the human-action generation. In the experiments, we will compare our model with these three, as well as other specific baselines.

Detailed Results

Figure 6: Randomly selected action sequences generated on human3.6 dataset ( First row: Smoking) and NTU RGBD dataset (Last row: Drinking)


With seed motion Without seed motion
E2E EPVA EPVA-adv E2E EPVA EPVA-adv Habibie et al., 2017 Cai et al., 2018 Ours
0.304 0.305 0.339 0.991 0.996 0.977 0.452 0.419 0.195
0.305 0.326 0.335 0.805 0.806 0.792 0.467 0.436 0.218
Table 1: Comparisons of our model with [16, 5, 51] in terms of Maximum Mean Discrepancy. (The lower the better.)
(a) =0.5, =0.5
(b) =0.3, =0.7
Figure 9: Latent space of mixed classes of actions with different mixing proportions and .
Figure 10: Novel mixed action sequences generated on human3.6 dataset. First row: generated sequence of “Walking”; Second row: generated sequence of “Phoning”; Third row: generated sequence of mixed “Walking” + “Phoning”.

Latent frame transitions

To show the effectiveness of our proposed latent-transition mechanism, we visualize the learned latent representations for selected classes.

Latent trajectories of different classes on the Human-3.6 dataset are plotted in Figure 3, with a latent-frame dimension of 2. More results with higher dimensionalities are provided in the Appendix. It is interesting to observe that for the 2-dimensional-latent-space case, some latent trajectories intercept with each other. This is reasonable because action poses in different action categories might be similar, e.g., smoking (green) and eating (blue) in Figure 3 (left).

To demonstrate the diversity of the learned action generator, we plot multiple latent trajectories for selected classes in Figure 4, all starting from the same initial point. It is clear that as time goes on, the generated latent frames become more diverse, a distinct property lacking in deterministic generators in most action-prediction models such as [31, 49].

To better illustrate the diversity of the action sequences, we compare our model with the three recently proposed models [16, 5, 51]. The mean and variance of each action pose along time for a set of trajectories are plotted in Figure 5. It is remarkable to find that trajectories from our model are much more diverse than the baselines. For a quantitative comparison, the standard derivations for different action classes are listed in the Appendix. More diverse action sequences generation results are provided in the Appendix. All these results indicate the superiority of our model in generating diverse action sequences.

Quality of generated action sequences

We adopt two metrics to measure the quality of the generated actions: action-classification accuracy and the maximum mean discrepancy (MMD) between generated and real action sequences. The latter metric is adapted from measuring the quality of generated images of GAN-based model, and has been used in [45] for measuring the quality of action sequences. For action classification, we adopt the trained classifier from our model to classify both real (testing) actions and a set of randomly generated actions from our model. A good generator should generate action sequences that endow similar or better classification accuracies than real data. We compare our model with a baseline, which uses the same classifier structure but is trained independently on the training data. For every action class, we randomly sample 100 sequences for testing. The classification accuracies are shown in Table 2. It is seen that our model achieves comparable performance on real and generated action sequences, which outperforms the baseline model to a large margin in general. In some cases, the accuracy on generated action sequence is higher than that of real data is because both the real data and the generated data are involved in the training of classifier under the bidirectional GAN framework [8]. The cross-subject and cross-view classification accuracies across different classes are 0.824 and 0.885 for our model respectively.

Split Data drinking throw sitting down wear jacket standing up hand waving kick jump phoning cross hands
cross-view Baseline_T 0.71 0.90 0.92 0.94 0.93 0.73 0.85 0.76 0.87 0.92
Baseline_G 0.65 0.21 0.25 0.38 0.75 0.61 0.66 0.46 0.75 0.13
Ours_T 0.82 0.92 0.94 0.96 0.95 0.76 0.86 0.93 0.84 0.88
Ours_G 0.87 0.94 0.98 0.93 0.95 0.81 0.97 0.79 0.91 0.82
cross-sub Baseline_T 0.76 0.80 0.90 0.90 0.92 0.73 0.6 0.67 0.86 0.74
Baseline_G 0.60 0.15 0.12 0.57 0.74 0.67 0.34 0.26 0.59 0.21
Ours_T 0.82 0.65 0.93 0.95 0.94 0.83 0.74 0.69 0.82 0.81
Ours_G 0.83 0.86 0.92 0.91 0.90 0.87 0.86 0.82 0.81 0.76
Table 2: Cross-view and cross-subject evaluation of classification accuracies on NTU-RGB dataset. Baseline_T means the independently trained classifier testing on real data, and Baseline_G means the independently trained classifier testing on generated sequences. Ours_T means our model testing on real data, and Ours_G means our model testing on generated sequences.

The MMD measures the discrepancy of two distributions based on their samples (the generated and real action sequences in our case). Since our data are represented as sequences, we proposed two sequence-level MMD metrics (more details in Appendix B). Following [45], we vary the bandwidth from to and report the maximum value computed. We compared our model with [16, 5, 51]. [51] contains three models: E2E, EPVA and EPVA-adv. These models require a few seed frames as inputs to the generator. To adapt these model to our setting, we define two variants: 1) given labels and only the first frame as seed input to their models; 2) given only labels but no frames as seed input. The results are reported in Table 1. It is interesting to see that without the need of a seed action-pose, our model obtains much lower MMD scores than the baselines with no seed action-poses. Figure 6 further plots some examples of generated action sequences on the two datasets, which further demonstrates the high quality of our generated actions.

Novel action generation by mixing action classes

Another distinct feature of our model is its ability to generate novel action sequences by specifying the input class variable . One interesting way for this is to mix several action classes such that the elements satisfy and . When feeding such a soft-mixed label into our model, due to the smoothness of the learned latent space, one would expect the generated sequence contains all features from the mixing classes. To demonstrate this, we consider mixing two classes. Figure 9

plots the latent trajectories for different mixing coefficients. As expected, the trajectories of mixing classes smoothly interpolate between the original trajectories. For better visualization, let the first two classes correspond to

walking and phoning. We set and generate the corresponding action sequences by feeding it into the generator. The generated sequences are shown in Figure 10. It is interesting to see that the sequence with mixing classes indeed contains actions with both hands and legs, which correspond to walking and phoning, respectively.

Model Average Score
[16] 2.445
[51] 2.387
[5] 2.847
Ours 3.378
Table 3: Human evaluations.

Human evaluation

We run perceptual studies on Amazon Mechanical Turk (AMT) to assess the realism of generated actions. There were 120 participants in this test for three-round evaluations. Every worker was assigned some collection of evaluation tasks, each of which consists of four videos generated by one of the four models, including [16, 51, 5] and ours. The worker was asked to evaluate each group of videos with a scale from 1 to 5. The higher the score, the more realistic of an action. Table 3 summarizes the results, which clearly shows the superiority of our method over others. Scoring standard and detailed experiment design are provided in the Appendix.

Ablation study

We conduct extensive ablation study to better understand each component of our model, including smoothness term, cycle consistency loss, and residual latent sequence prediction. More details are described in the Appendix.


We propose a new framework for stochastic diverse action generation, which induces flexible implicit distributions on generated action sequences. Different from existing action-prediction methods, our model does not require conditional action poses in the generation process, although it can be easily generalized to this setting. Within the core is a latent-action generator that learns smooth latent transitions, which are then fed to a shared decoder to generate final action sequences. Our model is formulated within the bi-directional GAN framework, which contains a sequence classifier that simultaneously learns action classification. Experiments are conducted on two accessible datasets, demonstrating the effectiveness of the proposed model, and obtaining better results compared to related baseline models.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: Challenges.
  • [2] E. Barsoum, J. Kender, and Z. Liu (2017) HP-gan: probabilistic 3d human motion prediction via gan. External Links: Link Cited by: Introduction, Introduction, Related Works.
  • [3] A. Bissacco (2005) Modeling and learning contact dynamics in human motion. In CVPR, Cited by: Related Works.
  • [4] J. Bütepage (2017) Deep representation learning for human motion prediction and classification. In IEEE CVPR, Cited by: Introduction.
  • [5] H. Cai, C. Bai, Y. Tai, and C. Tang (2018) Deep video generation, prediction and completion of human action sequences. In ECCV, Cited by: Table 4, 1st item, The whole action sequence generator, Datasets, Baselines, Latent frame transitions, Quality of generated action sequences, Human evaluation, Table 1, Table 3.
  • [6] V. O. Catalin Ionescu and C. Sminchisescu (2014) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI. Cited by: Datasets.
  • [7] E. Denton and V. Birodkar (2017) Unsupervised learning of disentangled representations from video. In NIPS, Cited by: Introduction.
  • [8] J. Donahue, P. Krähenbühl, and T. Darrell (2017) Adversarial feature learning. In ICLR, Cited by: Introduction, Quality of generated action sequences.
  • [9] Y. Du, W. Wang, and L. Wang (2015) Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, Cited by: Introduction.
  • [10] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2017) Adversarially learned inference. In ICLR, Cited by: Introduction.
  • [11] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik (2015) Recurrent network models for human dynamics. In ICCV, Cited by: Related Works.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: Introduction.
  • [13] A. Graves (2012) Sequence transduction with recurrent neural networks. In ICML Representation Learning Workshop, Cited by: Action-Sequence Classifier.
  • [14] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola (2012) A kernel two-sample test. JMLR. Cited by: Appendix B.
  • [15] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In NIPS, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Cited by: Shared frame-wise decoder pretraining.
  • [16] I. Habibie, D. Holden, J. Schwarz, J. Yearsley, and T. Komura (2017) A recurrent variational autoencoder for human motion synthesis. In BMVC, Cited by: Table 4, Introduction, Introduction, Related Works, Challenges, Baselines, Latent frame transitions, Quality of generated action sequences, Human evaluation, Table 1, Table 3.
  • [17] F. G. Harvey and C. Pal (2018) Recurrent transition networks for character locomotion. In SIGGRAPH Asia, Cited by: Introduction.
  • [18] Z. Huang, W. Xu, and K. Yu (2015) Bidirectional lstm-crf models for sequence tagging. External Links: Link Cited by: Action-Sequence Classifier.
  • [19] Y. Jin, J. Zhang, M. Li, Y. Tian, and H. Zhu (2017) Towards the automatic anime characters creation with generative adversarial networks. External Links: Link Cited by: Introduction.
  • [20] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid (2017) A new representation of skeleton sequences for 3d action recognition. In CVPR, Cited by: Introduction.
  • [21] M. A. Kiasari, D. S. Moirangthem, and M. Lee (2018) Human action generation with generative adversarial networks. External Links: Link Cited by: Related Works, Related Works.
  • [22] T. S. Kim and A. Reiter (2017) Interpretable 3d human action analysis with temporal convolutional networks. In BNMW CVPR, Cited by: Introduction.
  • [23] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: Appendix C.
  • [24] L. Kovar, M. Gleicher, and F. Pighin (2002) Motion graphs. In SIGGRAPH, Cited by: Introduction.
  • [25] J. N. Kundu, M. Gor, and R. V. Babu (2019) BiHMP-gan: bidirectional 3d human motion prediction gan. In AAAI, Cited by: Introduction, Related Works.
  • [26] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine (2018) Stochastic adversarial video prediction. External Links: Link Cited by: Introduction.
  • [27] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin (2017) ALICE: towards understanding adversarial learning for joint distribution matching. In NIPS, Cited by: Bi-directional GAN based training.
  • [28] X. Liang, L. Lee, W. Dai, and E. P. Xing (2017) Dual motion gan for future-flow embedded video prediction. In ICCV, Cited by: Introduction.
  • [29] J. Liu, A. Shahroudy, D. Xu, and G. Wang (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In ECCV, Cited by: Introduction.
  • [30] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool (2017) Pose guided person image generation. In NIPS, Cited by: Introduction.
  • [31] J. Martinez, M. J. Black, and J. Romero (2017) On human motion prediction using recurrent neural networks. In CVPR, Cited by: Introduction, Introduction, Related Works, Latent frame transitions.
  • [32] J. Min and J. Chai (2012) MotionGraphs++: a compact generative model for semantic motion analysis and synthesis. In ACM TOG, Cited by: Related Works.
  • [33] S. M. Oh, J. M. Rehg, T. Balch, and F. Dellaert (2005) Learning and inference in parametric switching linear dynamical systems. In ICCV, Cited by: Related Works.
  • [34] R. Pascanu, T. Mikolov, and Y. Bengio (2013) On the difficulty of training recurrent neural networks. In ICML, Cited by: Challenges.
  • [35] V. Pavlović, J. M. Rehg, and J. MacCormick (2001) Learning switching linear models of human motion. In NIPS, Cited by: Related Works.
  • [36] M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE TSP. Cited by: Action-Sequence Classifier.
  • [37] A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016-06) NTU rgb+d: a large scale dataset for 3d human activity analysis. In CVPR, Cited by: Introduction, Datasets, Datasets.
  • [38] M. Sundermeyer, T. Alkhouli, J. Wuebker, and H. Ney (2014) Translation modeling with bidirectional recurrent neural networks. In EMNLP, Cited by: Action-Sequence Classifier.
  • [39] I. Sutskever, G. Hinton, and G. Taylor (2008) The recurrent temporal restricted boltzmann machine. In NIPS, Cited by: Related Works.
  • [40] G. W. Taylor, G. E. Hinton, and S. Roweis (2007) Modeling human motion using binary latent variables. In NIPS, Cited by: Related Works.
  • [41] G. W. Taylor and G. E. Hinton (2009) Factored conditional restricted boltzmann machines for modeling motion style. In ICML, Cited by: Related Works.
  • [42] R. Urtasun, D. J. Fleet, A. Geiger, J. Popović, T. J. Darrell, and N. D. Lawrence (2008) Topologically-constrained latent variable models. In ICML, Cited by: Related Works.
  • [43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: Challenges.
  • [44] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In NIPS, Cited by: Introduction.
  • [45] J. Walker, K. Marino, A. Gupta, and M. Hebert (2017) The pose knows: video forecasting by generating pose futures. In ICCV, Cited by: Quality of generated action sequences, Quality of generated action sequences.
  • [46] J. M. Wang, D. J. Fleet, and A. Hertzmann (2007) Multifactor gaussian process models for style-content separation. In ICML, Cited by: Related Works.
  • [47] J. M. Wang, D. J. Fleet, and A. Hertzmann (2008) Gaussian process dynamical models for human motion. IEEE TPAMI. Cited by: Related Works.
  • [48] T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-video synthesis. In NeurIPS, Cited by: Introduction.
  • [49] Y. Wang, L. Gui, X. Liang, and J. M. F. Moura (2018) Adversarial geometry-aware human motion prediction. In ECCV, Cited by: Introduction, Introduction, Related Works, Latent frame transitions.
  • [50] Z. Wang, J. Chai, and S. Xia (2018) Combining recurrent neural networks and adversarial training for human motion modelling, synthesis and control. In https://arxiv.org/pdf/1806.08666.pdf, Cited by: Introduction, Related Works, Related Works, Challenges.
  • [51] N. Wichers, R. Villegas, D. Erhan, and H. Lee (2018) Hierarchical long-term video prediction without supervision. In ICML, Cited by: Table 4, Baselines, Latent frame transitions, Quality of generated action sequences, Human evaluation, Table 1, Table 3.
  • [52] T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In NIPS, Cited by: Related Works.
  • [53] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal gcns for skeleton-based action recognition. In AAAI, Cited by: Introduction.
  • [54] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In ICCV, Cited by: Bi-directional GAN based training.

Appendix A Algorithm

0:  Generator ; Discriminator ; classifier . Training data . Number of updating steps for discriminator; the weight and for the regularization loss; and the weight for the cycle consistency loss.
  Pretrain the shared frame-wise decoder.
  for  to  do
     for  to  do
        Sample minibatch of noise samples .
        Sample minibatch of action sequences from the training data.
        Update the discriminator by stochastic gradient ascent:
     end for
     Sample minibatch of noise samples .

     Update the generator and classifier parameters by stochastic gradient descent:

  end for
Algorithm 1 Stochastic action generation via learning smooth latent transitions.

Appendix B Maximum Mean Discrepancy (MMD)

where and denote generated and real test sequences of action class , respectively; and denote the frame of the generated and real test sequences of action class , respectively; and

is defined as an unbiased MMD estimator


with the Gaussian kernel.

Appendix C Implementation Details

We use Adam optimization algorithm [23] for learning the whole network parameters. The weight of gradient penalty for the WGAN-GP is set to be 10. The learning rate for optimizing the generator and discriminator loss of the WGAN-GP is set to be 0.001. The learning rate is set to be 0.0001 for optimizing the generator and discriminator loss of our sequence generation model. The weight for the cycle consistency loss is set to be 0.1 and the weight and for regularization loss are set to be 0.05 and 0.00005 respectively. The number of hidden units for the fully connected layers is set to 1024 for the LSTM discriminator and classifier. The number of hidden units for the LSTM generator is set to be 256.

Appendix D More Experiment Results

The novel actions mixed with the two actions "Throw" and "Kick" trained on NTU RGB+D dataset are shown in Figure 53. It is also interesting to see that the sequence with mixing classes indeed contains actions with both hands and legs, which correspond to "Throw" and "Kick", respectively. Diverse generated latent space with different latent dimensions are shown in Figure 51. Figure 16 and 20 shows other examples of generated action sequences on the two datasets. Table 4

shows the standard deviation of the distance of the generated action sequences to the mean action for each action class.

Figure 26, 32, 38 shows more diverse action generation results on human3.6 dataset.

For latent dimensions larger than 2, the latent trajectories from different classes are found to be separated quite well. Figure 52 shows more visualization results of latent space in higher dimension. Apart from the impact of tSNE, we suspect in a higher latent space, our model is flexible enough to separate trajectories of different classes, as the input of the generator contains label information.

Appendix E Ablation Study

We run ablation experiments over the components of our model to better understand the effects of each component.

Effect of smooth regularizer:

We remove the regularizer for consecutive frames. Table 5

shows the ablation study for smoothness term in the loss function. "no smoothness" means that we remove both the regularization term on the latent and action space. "only action" means that we only add regularizer on the action space. "only latent" means that we only add regularizer on the latent space. "latent and action" means that we add regularizer on both action and latent space.

The results show that the regularization of both latent space and real action space is better than only regularize the latent or real action space or no regularization at all.

Effect of cycle consistency loss:

We add or remove the classifier in the model, i.e., add or remove the cycle consistency loss. Table 2 shows the results of the ablation study for the cycle consistency loss, which is shown in Equation (6).

Effect of transition residual:

We study the comparison of predicting the latent transition residual and predicting the latent transition directly. Table 6 shows the results of the ablation study for latent transition residual. "direct latent" means that we use LSTM to predict the latent transition directly. "residual latent" means that we use LSTM to predict the residual of latent transition as in Equation (Learning smooth latent transitions).

Appendix F Human Evaluation

Figure 11 shows experiment design for Human Evaluation. At each page, we give five scoring standard with word descriptions as well as video examples to make sure workers share the same grading standard. After that, we provide four sample videos from each model for the blind test.

Figure 11: Human evaluation screenshot for Amazon Mechanical Turk.
(a) Direction
(b) Greeting
(c) Sitting
(d) Eating
Figure 16: Several action sequences generated by training on human3.6 dataset
(a) Hand waving
(b) throw
(c) Sit down
Figure 20: Several action sequences generated by training on NTU RGBD dataset
(a) Eating sequence 1
(b) Eating sequence 2
(c) Eating sequence 3
(d) Eating sequence 4
(e) Eating sequence 5
Figure 26: Diverse action sequences generated by training on human3.6 dataset
(a) Smoking sequence 1
(b) Smoking sequence 2
(c) Smoking sequence 3
(d) Smoking sequence 4
(e) Smoking sequence 5
Figure 32: Diverse action sequences generated by training on human3.6 dataset
(a) Direction sequence 1
(b) Direction sequence 2
(c) Direction sequence 3
(d) Direction sequence 4
(e) Direction sequence 5
Figure 38: Diverse action sequences generated by training on human3.6 dataset
(a) walking
(b) smoking
(c) greeting
(d) posing
(e) walking
(f) smoking
(g) greeting
(h) posing
(i) walking
(j) smoking
(k) greeting
(l) posing
Figure 51: Action diversity of generated latent space with different latent dimensions. The latent space dimension of first row, second row, third row is 2, 6 and 12 respectively.
Figure 52: Latent space with different dimensions. Left: dim = 6; Right: dim = 12.
Figure 53: Novel mixed action sequences generated on NTU RGBD dataset. First row: generated sequence of “Throw”; Second row: generated sequence of “Kick”; Third row: generated sequence of mixed “Throw” + “Kick”.