Learning Bidirectional LSTM Networks for Synthesizing 3D Mesh Animation Sequences

10/04/2018 ∙ by Yi-Ling Qiao, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences Cardiff University 0

In this paper, we present a novel method for learning to synthesize 3D mesh animation sequences with long short-term memory (LSTM) blocks and mesh-based convolutional neural networks (CNNs). Synthesizing realistic 3D mesh animation sequences is a challenging and important task in computer animation. To achieve this, researchers have long been focusing on shape analysis to develop new interpolation and extrapolation techniques. However, such techniques have limited learning capabilities and therefore can produce unrealistic animation. Deep architectures that operate directly on mesh sequences remain unexplored, due to the following major barriers: meshes with irregular triangles, sequences containing rich temporal information and flexible deformations. To address these, we utilize convolutional neural networks defined on triangular meshes along with a shape deformation representation to extract useful features, followed by LSTM cells that iteratively process the features. To allow completion of a missing mesh sequence from given endpoints, we propose a new weight-shared bidirectional structure. The bidirectional generation loss also helps mitigate error accumulation over iterations. Benefiting from all these technical advances, our approach outperforms existing methods in sequence prediction and completion both qualitatively and quantitatively. Moreover, this network can also generate follow-up frames conditioned on initial shapes and improve the accuracy as more bootstrap models are provided, which other works in the geometry processing domain cannot achieve.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Synthesizing high-quality 3D mesh sequences is of great significance in computer graphics and animation. In recent years, many techniques [Bogo et al.2014, Dou et al.2016, Stoll et al.2010]

have been developed to capture 3D shape animations, which are represented by sequences of triangular meshes with detailed geometry. Analyzing such animation sequences for synthesizing new realistic 3D mesh sequences is very useful in practice for the film and game industry. Although deep learning has achieved significant success in synthesizing a variety of media types, directly synthesizing mesh animation sequences by deep learning methods remains unexplored. In this paper, we propose a novel long short-term memory (LSTM) 

[Hochreiter and Schmidhuber1997] architecture to learn from mesh sequences and perform sequence generation, prediction and completion. A major challenge to achieve this is to go beyond individual meshes and understand the temporal relationships among them. Previous work on mesh data tries to perform clustering and shape analysis [Huang, Kalogerakis, and Marlin2015, Sidi et al.2011] on the whole

datasets. However, none of them pay attention to temporal information, which is crucial for animation sequences. Thanks to the development of deep learning methods such as the recurrent neural network (RNN) and its variants LSTM 

[Hochreiter and Schmidhuber1997]

and gated recurrent unit (GRU) 

[Cho et al.2014], one can more easily manipulate sequences. Based on RNNs, impressive results have been achieved in tasks with regard to video, audio and text, e.g movie prediction [Mathieu, Couprie, and LeCun2015, Oh et al.2015], music composition [Lyu et al.2015]

, text generation 

[Vinyals et al.2015] and completion [Melamud, Goldberger, and Dagan2016]. However, applying deep learning methods to triangle meshes is not a trivial task due to their irregular topology and high dimensionality. Researchers often use fully connected networks in text or audio. Different from them, 3D shapes have spatial locality, which is suitable to work with convolutional neural networks (CNNs). However, unlike 2D images, shapes do not have regular topology. Recent effort has been made for lifting 2D CNN to 3D data [Kalogerakis et al.2017], including multi-view [Su et al.2015] or 3D voxel [Riegler, Ulusoy, and Geiger2017, Wu et al.2016] representations. Alternatively, meshes can be treated as graphs, and based on this a recent review [Bronstein et al.2017] summarizes state-of-the-art deep learning methods in spectral and spatial domains. In order to reduce the number of parameters and extract intrinsic features, we utilize a CNN [Duvenaud et al.2015] defined on a shape deformation representation [Gao et al.2017b] that can effectively represent flexible and large-scale deformations. In summary, to analyze 3D mesh animation sequences, we propose a novel bidirectional LSTM architecture combined with mesh convolutions. The main contributions of this paper are:

  1. We propose the first method to cope with mesh animation sequences, which allows generating sequences conditioned on given shapes, completing missing mesh sequences based on keyframes with realism and diversity and improving the generation of mesh sequences as more initial frames are provided. These capabilities significantly advance state-of-the-art techniques.

  2. We design a share-weight bidirectional LSTM architecture that is able to boost performance and generate two sequences in opposite directions. Bidirectional generation also stabilizes training process and helps to complete a sequence in a more natural way.

In the following, we first review relevant work, then presents our feature representation, network architecture, and loss functions. In Experiments section, we show extensive experimental results to justify our design and compare our work with previous work both qualitatively and quantitatively. Finally, we draw conclusions of our work.

Related Work

Sequence Generation with RNNs. The recurrent neural network and its variants, such as LSTM [Hochreiter and Schmidhuber1997] and GRU [Cho et al.2014], have been widely used in dealing with sequential data, including text [Bowman et al.2015, Mikolov et al.2011], video [Mathieu, Couprie, and LeCun2015, Oh et al.2015] and audio [Chung et al.2015, Marchi et al.2014]. [Srivastava, Mansimov, and Salakhudinov2015] learn representations of video by LSTM in an unsupervised manner. PredNet [Lotter, Kreiman, and Cox2016] learns to predict future frames by comparing errors between prediction and observation. [Yu et al.2017] incorporate policy gradients with generative adversarial nets (GAN) [Goodfellow et al.2014] and LSTM to generate sequences. Attempts have also been made to predict video frames using CNNs [Vondrick, Pirsiavash, and Torralba2016]. To avoid predicting videos directly in the high-dimensional pixel space, some work uses high-level abstraction such as human poses [Walker et al.2017, Cai et al.2017] to assist with generation. In the human motion area, researchers utilize RNNs to predict or generate realistic motion sequences. [Fragkiadaki et al.2015] propose an encoder-recurrent-decoder (ERD) to learn spatial embeddings and temporal sequences of videos and motion capture. [Gregor et al.2015] generate image sequences with a sequential variational auto-encoder, where two RNN chains are used to encode and decode the sampled sequences accordingly. However, such approaches that iteratively take the output as input to the next stage could cause error accumulation and make the sequence freeze or diverge. To address this problem, [Li et al.2017] present Auto-Conditioned RNNs (acRNNs) whose inputs are previous output frames interleaved with ground truth. With ground truth frames at the beginning of a sequence, acRNN can also generate output sequences conditioned on given input sequences. [Martinez, Black, and Romero2017] build a sequence-to-sequence architecture which is able to predict multiple actions, but they do not have spatial encoding modules. Using an encoder-decoder structure, [Bütepage et al.2017] extract feature representations of human motion for prediction and classification. [Cai et al.2017]

use GAN and LSTM to generate actions or complete sequences by optimizing the input vector of the GAN.

3D Shape Generation. Generating 3D shapes is an important task in graphics and vision community. Its down-stream applications include shape prediction, reconstruction and sequence completion. Nevertheless, such tasks are more challenging due to the high dimensionality and irregular connectivity of mesh data. Previous work mostly generates 3D shapes via interpolation or extrapolation in parameterized representations. [Huber, Perl, and Rumpf2017] propose to interpolate shapes in a Riemannian shell space. Based on existing shapes, data-driven methods (e.g. [Gao et al.2017a]) can generate realistic samples. However, such traditional methods focusing on shape representations and shape analysis have limited learning capabilities. More recently, [Tan et al.2018a]

propose to use Variational Autoencoders (VAEs) to map mesh models into a latent space and generate new models by decoding latent vectors. Locally deformed shapes can also be generated by a combination of deep learning and sparse regularization 

[Tan et al.2018b]. While these learning based methods can produce new shapes which are more diverse and realistic, the temporal information of mesh animation sequences is not fully explored.

Methodology

Figure 1: Architecture of our network. (a) shows that our network is composed of LSTM module and mesh convolution module . Take the network at time step as an example, the input to is the deformation representation . The interface between and is a fully connected layer, which outputs a low-dimensional vector into . , a stack of transpose convolution layers, mirrors and shares weights with it. The output of is the feature change . gives the predicted feature for time step , which is fed into iteratively. (b) is our bidirectional LSTM. Both chains have the same architecture as in (a), and the only difference is their opposite direction. The forward chain takes the first model as input and the backward chain takes the last. They share weights and their predictions are constrained to match with each other.

Mesh Sequence Representation

Mesh animation sequences are typically represented as a set of meshes with the same vertex connectivity and different vertex positions. Such meshes can be obtained by consistent remeshing or mesh deformation, and become very common nowadays due to the improved scanning and modeling techniques. These animated mesh sequences usually contain large-scale and complex deformations. In this work, we represent shapes using a shape deformation representation [Gao et al.2017b], a state-of-the-art representation which works well for large-scale deformation and suitable for deep learning methods. Assume the mesh sequence dataset contains shapes and each mesh is denoted is as (). We denote as the vertex of the model. represents the deformation gradient defined in each 1-ring vertex neighborhood, which is computed as

(1)

where is the 1-ring neighbors of the vertex of the shape, and is the cotangent weight to avoid discretization bias [Levi and Gotsman2015]. The deformation gradient matrix is decomposed into rotation matrix and scaling matrix : . The difficulty for representing large-scale deformations is that the same rotation matrix is mapped to two rotation axes with opposite directions, and the associated rotation angle can include different number of cycles. To solve this rotation ambiguity problem, a global integer programming based method [Gao et al.2017b] is applied to obtain as-consistent-as-possible assignment which outputs a feature vector . The mesh representation is eventually produced by linearly normalizing each dimension of into  [Tan et al.2018a].

Generative Model

The overall architecture of our approach is illustrated in Fig. 1. In this illustration, we denote LSTM cells as . refers to the mesh convolutional operations [Duvenaud et al.2015, Tan et al.2018b] and represents transpose convolutions. For each convolutional filter, the output at a vertex is computed by a weighted sum of its 1-ring neighbors along with a bias:

(2)

where and are input and output at the vertex, , and are the filter’s weights and bias, is the degree of the vertex, and is the 1-ring neighbor of the vertex. The interface between LSTM module and mesh convolution layers is a fully connected layer. Given the LSTM state and model , we first describe how to generate the next model . First we put into the mesh convolutional sub-network , which outputs a low-dimensional latent vector . After that, is sent to LSTM cell and the output is in the following form: , where represents the updated state and is the updated latent vector. is then passed to transpose mesh convolution . Similar to many sequence generation algorithms, the output of is defined as the difference between the next and current models, instead of to alleviate error accumulation. In the end, the generated model from is simply worked out as . Consecutive models are generated iteratively in the same way. For simplicity, the whole process in one iteration is denoted as
Fig. 1 illustrates the whole process of generating sequential data using our model. Suppose that we already have a set of models . To extend the sequence, we would like to predict its future models . Our method first puts the existing models into the network in their order, lets the LSTM cell update its state to from an initial state . When it comes to the model, the network outputs , which is afterwards treated as the input, and this process repeats for times, leading to the follow-up sequence .

Bidirectional Generation

Sequence generation is a promising while challenging problem in various data forms like video, music and text, not only for the potentially tricky way to exploit temporal information but also about how to obtain enough training data. When the data is scarce for a specific application, which is often the case for 3D model datasets, training can be problematic. However, unlike text, movie and audio, 3D model sequences can be more flexible. On the one hand, the order of 3D shape sequences is less strict, i.e., the inverse of a motion can also be reasonable. On the other hand, there are usually multiple plausible paths between two shapes. Based on those two observations, we propose a bidirectional generation constraint, which avoids restricting results to specific deformation paths, as shown in Fig. 1. From a 3D model dataset, we arbitrarily choose two models as endpoints of two inverse -length sequences such that . Let have opposite initial states , we expect them to generate similar models, satisfying .

Loss Function

In this paper, the loss function is composed of three terms as

(3)

To illustrate this, let the ground truth models be , which are expected to be the results of forward sequence and backward sequence . The reconstruction loss forces both sequences to resemble samples from the dataset. Meanwhile, as described before, bidirectional sequence share weights and have similar outputs, which is ensured by bidirectional loss . Furthermore, contains KL divergence term and

loss to regularize the network. The KL divergence between the low-dimensional vector and Gaussian distribution is computed so as to get a good mapping. Therefore, we have

, where is the posterior distribution and is the Gaussian prior distribution. In experiments, we set .

Experiments

Figure 2: Qualitative results of our method on Dyna [Pons-Moll et al.2015], handstand [Vlasic et al.2008] and horse [Sumner and Popović2004]. We give one source model to the network and it generates the following four shapes. This is the first approach able to generate a whole sequence from only one mesh.
Figure 3: Shape feature change between subsequent frames of different methods. This line graph depicts the amount of feature changes between consecutive frames. The proposed method (blue) has stable and visible changes. Networks without KL loss or BD-training suffer from frozen sequences, and the one without regularization has significant jerk.

Framework Evaluation

We now evaluate the effectiveness of different components in our framework. Bidirectional generation. We propose a share-weight bidirectional LSTM (BD-LSTM) to better utilize temporal information and facilitate sequence completion. Fig. 6 demonstrates that our BD-LSTM can produce results with better diversity. On the other hand, the and terms impose stronger constraints during training and consequently helps predict more accurate sequences. According to the numerical results in Tab. 2, our method is more effective than existing methods [Tan et al.2018a, Gao et al.2017b]. Moreover, our method benefits from multiple initial frames, as well as the bidirectional constraint. In Tab. 1, we show the results if we do not use our BD-LSTM or leave out term. Loss terms. Error accumulation is a common problem in sequence generation tasks [Gregor et al.2015, Li et al.2017, Martinez, Black, and Romero2017]. Generated meshes usually freeze because results tend to stay at an average shape, or even diverge to random results. To address this problem, we use three methods, 1) divergence to regularize the internal distribution, 2) regularization loss to mitigate overfitting, and 3) bidirectional generation to impose an additional constraint. To justify those terms, we train models without one of the three. For unidirectional sequences, we only use one direction of LSTM. The line graphs in Fig. 3 shows representation changes between adjacent frames and . The four networks are trained on Dyna [Pons-Moll et al.2015] for 7000 iterations, and tested on 32 randomly chosen sequences. From the test one can see that without KL or BD-LSTM, the sequence tends to freeze. Meanwhile, regularization helps to reduce jerk. In Tab. Initial frames. Generating a sequence based on initial frames is an important application. In theory, the more bootstrap frames we have, the more knowledge we obtain about the sequence therefore we are supposed to make more accurate prediction. Previous mesh generation approaches, however, are based on interpolation/extrapolation, which can only use two of the existing models (endpoints). Our method can take advantage of all input frames by feeding them into the LSTM. A previous human motion prediction method uses frames to start the recurrent network [Li et al.2017]. We test and in Tab. 2 to show that more initial frames can reduce the distance between prediction and ground truth.

Ours Std. BD-LSTM Unidir. LSTM no
88 123 114 103

per-vertex position error()

Table 1: Average vertex position errors on the Punching dataset [Pons-Moll et al.2015] with different network architectures. We can see that our share-weight BD-LSTM outperforms standard BD-LSTM and unidirectional LSTM. Also we observe a decrease in accuracy if the term is omitted.
(a) Ground Truth (b) [Gao et al.2017b] (c) [Tan et al.2018a] (d) Ours
Figure 4: Comparison with other work on sequence generation. In this experiment, two consecutive frames are sent into the network, and we aim to predict the future shapes. (a) shows the ground truth of relevant shapes; (b) is obtained by using linear extrapolation on the feature [Gao et al.2017b]; (c) is extrapolation on a feature from deep learning [Tan et al.2018a]; (d) is our result. We can see that extrapolation-based generation fails when predicting frames further away. (b) totally fails on the

frame. and (c) also produces abnormal deformation, as highlighted in the red circles. In contrast, our method forms a natural cycle and avoids exceeding the limits (following the horse’s stride).

Figure 5: Conditional generation. Trained with a mixture of different Dyna datasets, our network can output sequences conditioned on the first two input shapes. For examples, on the right of the figure, we feed two fit female shapes into the network. The second frame in (c) lifts her right leg while (d) lifts the left leg. Our model can perceive their differences and predict subsequent motion according to the condition. Similar results can be observed in the fat male example on the left.
Figure 6: Diversified sequence completion. We show the completion results produced by different methods. Source (first) and target (last) shapes are shared among all the sequences. (a) is the ground truth from f_ 50004_jumpingJacks dataset [Pons-Moll et al.2015]; (b)(c)(d) use our BD-LSTM; (e) is the optimization+unidirectional baseline strategy described in the paper; (f), (g) and (h) are interpolation results using [Tan et al.2018a], [Gao et al.2017a] and [Gao et al.2017b] accordingly. We can see that (f)(g)(h) generate almost identical shapes because interpolation follows the shortest path between source and target. Compared to (e), our method can generate diverse, plausible results for users to choose.
Method Punching ShakeArm Handstand Horse
5 10 15 5 10 15 5 10 15 5 10 15
Ours+1 IF 175 156 285 335 381 319 323 527 516 603 869 1328
Ours+3 IF 95 84 107 301 226 290 212 379 428 451 329 671
[Tan et al.2018a] 132 240 457 291 433 688 93 489 797 286 713 1032
[Gao et al.2017b] 294 361 413 391 472 110 487 401 1589 334 1051 1568

per-vertex position error()

Table 2: Comparison of variants of our method and previous work on per-vertex position error (average distance between vertex positions of ground truth and prediction). In this experiment, we observe that more initial frames (IF) will improve the performance. Our method outperforms [Tan et al.2018a] and extrapolation+ [Gao et al.2017b] since they suffer from error accumulation thus the accuracy degrades as the sequence moves on.

Sequence Generation

We now evaluate sequence generation capability of the proposed method. Starting from some initial frames, sequence generation predicts future frames. Generating sequences. As far as we are aware, this is the first work to learn and generate arbitrarily long mesh sequences. Given two initial frames, people used to generate meshes through extrapolation [Tan et al.2018b]. However, simply extrapolating shapes fail to capture long-term temporal information, e.g periodicity of the sequence. With the help of LSTM, our model can record history information and iterate to generate realistic mesh sequences in any length, even if the number of models in the dataset is limited. In the experiment, we feed first two mesh models to the LSTM and let it generate following frames. Qualitative and quantitative results are shown respectively In Fig. 2 and Tab. 2. We compare our model with ground truth as well as previous extrapolation-based methods [Tan et al.2018a, Gao et al.2017b]. Fig. 4 plots the predictions on the , and future frames. We can see that both extrapolation methods fail on the frame, because linearly extending the motion path eventually exceeds the plausible deformation space. In contrast, our method is aware of periodicity of the sequence, and able to return back once reaching the extreme point, producing natural motion cycles. Conditional generation. Another promising application of our method is to generate sequences of various shapes conditioned on the provided initial frames. Previous approaches achieve conditional human motion generation on video [Srivastava, Mansimov, and Salakhudinov2015] and skeletons [Cai et al.2017], but not on 3D shape sequences. To illustrate the effectiveness of our method, we take Dyna [Pons-Moll et al.2015] as an example. In this collection of datasets, there are female/male models of different subjects and actions. All meshes in different datasets have the same number of vertices and share connectivity, so we train our model on a mixture of those datasets. In testing, we feed bootstrap models with a certain body mass index (BMI)/gender/motion as input, and get the following sequences as output. We show our results in Fig. 5. The observation is that our method can generate human shapes in different subjects and gender. Furthermore, even if the first frame is the same, the network can produce different action sequences according to the second frame.

Sequence Completion

We now consider another important application namely sequence completion, which produces in-between shapes given two endpoint frames. Completion based on key frames. Completing a mesh sequence based on given anchors is an important application in animation. In our approach, we clip the target sequence by keyframes. For each segment, we run our bidirectional network by treating two keyframes as endpoints. Once the forward and backward sequences converge at a model, we stitch them to form a whole sequence. Since the computation is identical for each segment, for illustration we show an example of completing one segment constrained on two key frames. Fig 6 shows an example on the Dyna Dataset [Pons-Moll et al.2015]. (f)(b)(h) are all interpolation-based. Those methods generate shapes along the shortest path between them, which are almost still because of high similarity between the first and last models. Generating novel sequence. Previous interpolation-based methods usually adopt a deterministic strategy to complete sequences, and thus result in a monotonous sequence. Our work, however, is able to produce diversified sequence completion results. By assigning a random vector to the LSTM state, the network generates different sequences as shown in Fig. 6. In the real world, there are often more than one possible motions between two static poses and our model can therefore better describe such characteristics in human motion than other generation methods.
To test alternative completion strategies, we also implement an optimization+unidirection [Cai et al.2017] strategy. Given the source model and , we first find the optimal LSTM initial state , where

(4)

After solving the optimization problem with [Hansen and Ostermeier2001], we then compute through Eq. 4. The result is shown in Fig. 6 (e). Compared to interpolation strategies, the optimization+unidirection algorithm can achieve more realistic morphing, but it does not provide diverse possible choices as our BD-approach.

Implementation Details

We use Tensorflow as the framework of our implementation. Experiments are performed on a PC with an Intel Core i7-2600 CPU and an NVIDIA Tesla K40c GPU. We use Adam optimizer 

[Kingma and Ba2014] to update weights, with default Adam parameters as in [Kingma and Ba2014]. For each dataset, we randomly exclude a subsequence, which takes up 20% of the dataset, as a a test set. A training process takes 7000 iterations, lasting for around 8 hours. In each iteration, we generate 8 sequences, each of them containing 32 shapes. For the dataset where motion is slow [Pons-Moll et al.2015], we sample every other model in sequences. For all experiments, the LSTM has 3 layers and 128 hidden dimensions, and we set initial states as . The mesh convolution module is composed of 3 layers with

as the activation function. Transpose convolutions

mirrors and shares the same weights.

Conclusion

In this paper, we propose the first deep architecture to generate mesh animation sequences, which can not only predict future frames given initial frames, but also complete mesh sequences based on key frames and generate sequences conditioned on given shapes. Extensive qualitative and quantitative evaluation demonstrates that our method achieves state-of-the-art generation results, and our completion strategy is also able to produce diverse realistic results.

References

  • [Bogo et al.2014] Bogo, F.; Romero, J.; Loper, M.; and Black, M. J. 2014. FAUST: Dataset and evaluation for 3D mesh registration. In

    Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

    , 3794 –3801.
  • [Bowman et al.2015] Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; and Bengio, S. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
  • [Bronstein et al.2017] Bronstein, M. M.; Bruna, J.; LeCun, Y.; Szlam, A.; and Vandergheynst, P. 2017. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34(4):18–42.
  • [Bütepage et al.2017] Bütepage, J.; Black, M. J.; Kragic, D.; and Kjellström, H. 2017. Deep representation learning for human motion prediction and classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [Cai et al.2017] Cai, H.; Bai, C.; Tai, Y.-W.; and Tang, C.-K. 2017. Deep video generation, prediction and completion of human action sequences. arXiv:1711.08682.
  • [Cho et al.2014] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • [Chung et al.2015] Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A. C.; and Bengio, Y. 2015. A recurrent latent variable model for sequential data. In NIPS, 2980–2988.
  • [Dou et al.2016] Dou, M.; Khamis, S.; Degtyarev, Y.; Davidson, P.; Fanello, S. R.; Kowdle, A.; Escolano, S. O.; Rhemann, C.; Kim, D.; Taylor, J.; Kohli, P.; Tankovich, V.; and Izadi, S. 2016. Fusion4d: Real-time performance capture of challenging scenes. ACM Trans. Graph. 35(4):114:1–114:13.
  • [Duvenaud et al.2015] Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; and Adams, R. P. 2015. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2224–2232.
  • [Fragkiadaki et al.2015] Fragkiadaki, K.; Levine, S.; Felsen, P.; and Malik, J. 2015. Recurrent network models for human dynamics. In IEEE International Conference on Computer Vision (ICCV), 4346–4354.
  • [Gao et al.2017a] Gao, L.; Chen, S.-Y.; Lai, Y.-K.; and Xia, S. 2017a. Data-driven shape interpolation and morphing editing. Computer Graphics Forum 36(8):19–31.
  • [Gao et al.2017b] Gao, L.; Lai, Y.-K.; Yang, J.; Zhang, L.-X.; Kobbelt, L.; and Xia, S. 2017b. Sparse data driven mesh deformation. arXiv preprint arXiv:1709.01250.
  • [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS, 2672–2680.
  • [Gregor et al.2015] Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D. J.; and Wierstra, D. 2015. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.
  • [Hansen and Ostermeier2001] Hansen, N., and Ostermeier, A. 2001. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation 9(2):159–195.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • [Huang, Kalogerakis, and Marlin2015] Huang, H.; Kalogerakis, E.; and Marlin, B. 2015. Analysis and synthesis of 3d shape families via deep-learned generative models of surfaces. Computer Graphics Forum 34(5):25–38.
  • [Huber, Perl, and Rumpf2017] Huber, P.; Perl, R.; and Rumpf, M. 2017. Smooth interpolation of key frames in a riemannian shell space. Computer Aided Geometric Design 52:313–328.
  • [Kalogerakis et al.2017] Kalogerakis, E.; Averkiou, M.; Maji, S.; and Chaudhuri, S. 2017. 3D shape segmentation with projective convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Levi and Gotsman2015] Levi, Z., and Gotsman, C. 2015. Smooth rotation enhanced as-rigid-as-possible mesh animation. IEEE Trans. Vis. Comp. Graph. 21(2):264–277.
  • [Li et al.2017] Li, Z.; Zhou, Y.; Xiao, S.; He, C.; and Li, H. 2017. Auto-conditioned lstm network for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363.
  • [Lotter, Kreiman, and Cox2016] Lotter, W.; Kreiman, G.; and Cox, D. 2016. Deep predictive coding networks for video prediction and unsupervised learning. arXiv:1605.08104.
  • [Lyu et al.2015] Lyu, Q.; Wu, Z.; Zhu, J.; and Meng, H. 2015. Modelling high-dimensional sequences with LSTM-RTRBM: Application to polyphonic music generation. In IJCAI, 4138–4139.
  • [Marchi et al.2014] Marchi, E.; Ferroni, G.; Eyben, F.; Gabrielli, L.; Squartini, S.; and Schuller, B. 2014. Multi-resolution linear prediction based features for audio onset detection with bidirectional lstm neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2164–2168.
  • [Martinez, Black, and Romero2017] Martinez, J.; Black, M. J.; and Romero, J. 2017. On human motion prediction using recurrent neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4674–4683.
  • [Mathieu, Couprie, and LeCun2015] Mathieu, M.; Couprie, C.; and LeCun, Y. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440.
  • [Melamud, Goldberger, and Dagan2016] Melamud, O.; Goldberger, J.; and Dagan, I. 2016. Context2vec: Learning generic context embedding with bidirectional LSTM. In SIGNLL Conference on Computational Natural Language Learning, 51–61.
  • [Mikolov et al.2011] Mikolov, T.; Kombrink, S.; Burget, L.; Černockỳ, J.; and Khudanpur, S. 2011. Extensions of recurrent neural network language model. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5528–5531.
  • [Oh et al.2015] Oh, J.; Guo, X.; Lee, H.; Lewis, R. L.; and Singh, S. 2015. Action-conditional video prediction using deep networks in atari games. In NIPS, 2863–2871.
  • [Pons-Moll et al.2015] Pons-Moll, G.; Romero, J.; Mahmood, N.; and Black, M. J. 2015. Dyna: A model of dynamic human shape in motion. ACM Transactions on Graphics (TOG) 34(4):120.
  • [Riegler, Ulusoy, and Geiger2017] Riegler, G.; Ulusoy, A. O.; and Geiger, A. 2017. Octnet: Learning deep 3D representations at high resolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3.
  • [Sidi et al.2011] Sidi, O.; van Kaick, O.; Kleiman, Y.; Zhang, H.; and Cohen-Or, D. 2011.

    Unsupervised co-segmentation of a set of shapes via descriptor-space spectral clustering.

    ACM Transactions on Graphics (TOG) 30(6).
  • [Srivastava, Mansimov, and Salakhudinov2015] Srivastava, N.; Mansimov, E.; and Salakhudinov, R. 2015. Unsupervised learning of video representations using lstms. In

    International Conference on Machine Learning

    , 843–852.
  • [Stoll et al.2010] Stoll, C.; Gall, J.; de Aguiar, E.; Thrun, S.; and Theobalt, C. 2010. Video-based reconstruction of animatable human characters. ACM Trans. Graph. 29(6):139:1–139:10.
  • [Su et al.2015] Su, H.; Maji, S.; Kalogerakis, E.; and Learned-Miller, E. 2015. Multi-view convolutional neural networks for 3D shape recognition. In IEEE International Conference on Computer Vision, 945–953.
  • [Sumner and Popović2004] Sumner, R. W., and Popović, J. 2004. Deformation transfer for triangle meshes. ACM Transactions on Graphics (TOG) 23(3):399–405.
  • [Tan et al.2018a] Tan, Q.; Gao, L.; Lai, Y.-K.; and Xia, S. 2018a. Variational autoencoders for deforming 3d mesh models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Tan et al.2018b] Tan, Q.; Gao, L.; Lai, Y.-K.; Yang, J.; and Xia, S. 2018b. Mesh-based autoencoders for localized deformation component analysis. In AAAI.
  • [Vinyals et al.2015] Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3156–3164.
  • [Vlasic et al.2008] Vlasic, D.; Baran, I.; Matusik, W.; and Popović, J. 2008. Articulated mesh animation from multi-view silhouettes. ACM Transactions on Graphics (TOG) 27(3):97.
  • [Vondrick, Pirsiavash, and Torralba2016] Vondrick, C.; Pirsiavash, H.; and Torralba, A. 2016. Generating videos with scene dynamics. In NIPS, 613–621.
  • [Walker et al.2017] Walker, J.; Marino, K.; Gupta, A.; and Hebert, M. 2017. The pose knows: Video forecasting by generating pose futures. In IEEE International Conference on Computer Vision (ICCV), 3352–3361.
  • [Wu et al.2016] Wu, J.; Zhang, C.; Xue, T.; Freeman, B.; and Tenenbaum, J. 2016. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In NIPS, 82–90.
  • [Yu et al.2017] Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. SeqGAN: Sequence generative adversarial nets with policy gradient. In AAAI, 2852–2858.