Introduction
Synthesizing highquality 3D mesh sequences is of great significance in computer graphics and animation. In recent years, many techniques [Bogo et al.2014, Dou et al.2016, Stoll et al.2010]
have been developed to capture 3D shape animations, which are represented by sequences of triangular meshes with detailed geometry. Analyzing such animation sequences for synthesizing new realistic 3D mesh sequences is very useful in practice for the film and game industry. Although deep learning has achieved significant success in synthesizing a variety of media types, directly synthesizing mesh animation sequences by deep learning methods remains unexplored. In this paper, we propose a novel long shortterm memory (LSTM)
[Hochreiter and Schmidhuber1997] architecture to learn from mesh sequences and perform sequence generation, prediction and completion. A major challenge to achieve this is to go beyond individual meshes and understand the temporal relationships among them. Previous work on mesh data tries to perform clustering and shape analysis [Huang, Kalogerakis, and Marlin2015, Sidi et al.2011] on the wholedatasets. However, none of them pay attention to temporal information, which is crucial for animation sequences. Thanks to the development of deep learning methods such as the recurrent neural network (RNN) and its variants LSTM
[Hochreiter and Schmidhuber1997]and gated recurrent unit (GRU)
[Cho et al.2014], one can more easily manipulate sequences. Based on RNNs, impressive results have been achieved in tasks with regard to video, audio and text, e.g movie prediction [Mathieu, Couprie, and LeCun2015, Oh et al.2015], music composition [Lyu et al.2015][Vinyals et al.2015] and completion [Melamud, Goldberger, and Dagan2016]. However, applying deep learning methods to triangle meshes is not a trivial task due to their irregular topology and high dimensionality. Researchers often use fully connected networks in text or audio. Different from them, 3D shapes have spatial locality, which is suitable to work with convolutional neural networks (CNNs). However, unlike 2D images, shapes do not have regular topology. Recent effort has been made for lifting 2D CNN to 3D data [Kalogerakis et al.2017], including multiview [Su et al.2015] or 3D voxel [Riegler, Ulusoy, and Geiger2017, Wu et al.2016] representations. Alternatively, meshes can be treated as graphs, and based on this a recent review [Bronstein et al.2017] summarizes stateoftheart deep learning methods in spectral and spatial domains. In order to reduce the number of parameters and extract intrinsic features, we utilize a CNN [Duvenaud et al.2015] defined on a shape deformation representation [Gao et al.2017b] that can effectively represent flexible and largescale deformations. In summary, to analyze 3D mesh animation sequences, we propose a novel bidirectional LSTM architecture combined with mesh convolutions. The main contributions of this paper are:
We propose the first method to cope with mesh animation sequences, which allows generating sequences conditioned on given shapes, completing missing mesh sequences based on keyframes with realism and diversity and improving the generation of mesh sequences as more initial frames are provided. These capabilities significantly advance stateoftheart techniques.

We design a shareweight bidirectional LSTM architecture that is able to boost performance and generate two sequences in opposite directions. Bidirectional generation also stabilizes training process and helps to complete a sequence in a more natural way.
In the following, we first review relevant work, then presents our feature representation, network architecture, and loss functions. In Experiments section, we show extensive experimental results to justify our design and compare our work with previous work both qualitatively and quantitatively. Finally, we draw conclusions of our work.
Related Work
Sequence Generation with RNNs. The recurrent neural network and its variants, such as LSTM [Hochreiter and Schmidhuber1997] and GRU [Cho et al.2014], have been widely used in dealing with sequential data, including text [Bowman et al.2015, Mikolov et al.2011], video [Mathieu, Couprie, and LeCun2015, Oh et al.2015] and audio [Chung et al.2015, Marchi et al.2014]. [Srivastava, Mansimov, and Salakhudinov2015] learn representations of video by LSTM in an unsupervised manner. PredNet [Lotter, Kreiman, and Cox2016] learns to predict future frames by comparing errors between prediction and observation. [Yu et al.2017] incorporate policy gradients with generative adversarial nets (GAN) [Goodfellow et al.2014] and LSTM to generate sequences. Attempts have also been made to predict video frames using CNNs [Vondrick, Pirsiavash, and Torralba2016]. To avoid predicting videos directly in the highdimensional pixel space, some work uses highlevel abstraction such as human poses [Walker et al.2017, Cai et al.2017] to assist with generation. In the human motion area, researchers utilize RNNs to predict or generate realistic motion sequences. [Fragkiadaki et al.2015] propose an encoderrecurrentdecoder (ERD) to learn spatial embeddings and temporal sequences of videos and motion capture. [Gregor et al.2015] generate image sequences with a sequential variational autoencoder, where two RNN chains are used to encode and decode the sampled sequences accordingly. However, such approaches that iteratively take the output as input to the next stage could cause error accumulation and make the sequence freeze or diverge. To address this problem, [Li et al.2017] present AutoConditioned RNNs (acRNNs) whose inputs are previous output frames interleaved with ground truth. With ground truth frames at the beginning of a sequence, acRNN can also generate output sequences conditioned on given input sequences. [Martinez, Black, and Romero2017] build a sequencetosequence architecture which is able to predict multiple actions, but they do not have spatial encoding modules. Using an encoderdecoder structure, [Bütepage et al.2017] extract feature representations of human motion for prediction and classification. [Cai et al.2017]
use GAN and LSTM to generate actions or complete sequences by optimizing the input vector of the GAN.
3D Shape Generation. Generating 3D shapes is an important task in graphics and vision community. Its downstream applications include shape prediction, reconstruction and sequence completion. Nevertheless, such tasks are more challenging due to the high dimensionality and irregular connectivity of mesh data. Previous work mostly generates 3D shapes via interpolation or extrapolation in parameterized representations. [Huber, Perl, and Rumpf2017] propose to interpolate shapes in a Riemannian shell space. Based on existing shapes, datadriven methods (e.g. [Gao et al.2017a]) can generate realistic samples. However, such traditional methods focusing on shape representations and shape analysis have limited learning capabilities. More recently, [Tan et al.2018a]propose to use Variational Autoencoders (VAEs) to map mesh models into a latent space and generate new models by decoding latent vectors. Locally deformed shapes can also be generated by a combination of deep learning and sparse regularization
[Tan et al.2018b]. While these learning based methods can produce new shapes which are more diverse and realistic, the temporal information of mesh animation sequences is not fully explored.Methodology
Mesh Sequence Representation
Mesh animation sequences are typically represented as a set of meshes with the same vertex connectivity and different vertex positions. Such meshes can be obtained by consistent remeshing or mesh deformation, and become very common nowadays due to the improved scanning and modeling techniques. These animated mesh sequences usually contain largescale and complex deformations. In this work, we represent shapes using a shape deformation representation [Gao et al.2017b], a stateoftheart representation which works well for largescale deformation and suitable for deep learning methods. Assume the mesh sequence dataset contains shapes and each mesh is denoted is as (). We denote as the vertex of the model. represents the deformation gradient defined in each 1ring vertex neighborhood, which is computed as
(1) 
where is the 1ring neighbors of the vertex of the shape, and is the cotangent weight to avoid discretization bias [Levi and Gotsman2015]. The deformation gradient matrix is decomposed into rotation matrix and scaling matrix : . The difficulty for representing largescale deformations is that the same rotation matrix is mapped to two rotation axes with opposite directions, and the associated rotation angle can include different number of cycles. To solve this rotation ambiguity problem, a global integer programming based method [Gao et al.2017b] is applied to obtain asconsistentaspossible assignment which outputs a feature vector . The mesh representation is eventually produced by linearly normalizing each dimension of into [Tan et al.2018a].
Generative Model
The overall architecture of our approach is illustrated in Fig. 1. In this illustration, we denote LSTM cells as . refers to the mesh convolutional operations [Duvenaud et al.2015, Tan et al.2018b] and represents transpose convolutions. For each convolutional filter, the output at a vertex is computed by a weighted sum of its 1ring neighbors along with a bias:
(2) 
where and are input and output at the vertex, , and are the filter’s weights and bias, is the degree of the vertex, and is the 1ring neighbor of the vertex. The interface between LSTM module and mesh convolution layers is a fully connected layer.
Given the LSTM state and model , we first describe how to generate the next model .
First we put into the mesh convolutional subnetwork , which outputs a lowdimensional latent vector . After that, is sent to LSTM cell and the output is in the following form: , where represents the updated state and is the updated latent vector. is then passed to transpose mesh convolution . Similar to many sequence generation algorithms, the output of is defined as the difference between the next and current models, instead of to alleviate error accumulation. In the end, the generated model from is simply worked out as . Consecutive models are generated iteratively in the same way. For simplicity, the whole process in one iteration is denoted as
Fig. 1 illustrates the whole process of generating sequential data using our model. Suppose that we already have a set of models . To extend the sequence, we would like to predict its future models . Our method first puts the existing models into the network in their order, lets the LSTM cell update its state to from an initial state . When it comes to the model, the network outputs , which is afterwards treated as the input, and this process repeats for times, leading to the followup sequence .
Bidirectional Generation
Sequence generation is a promising while challenging problem in various data forms like video, music and text, not only for the potentially tricky way to exploit temporal information but also about how to obtain enough training data. When the data is scarce for a specific application, which is often the case for 3D model datasets, training can be problematic. However, unlike text, movie and audio, 3D model sequences can be more flexible. On the one hand, the order of 3D shape sequences is less strict, i.e., the inverse of a motion can also be reasonable. On the other hand, there are usually multiple plausible paths between two shapes. Based on those two observations, we propose a bidirectional generation constraint, which avoids restricting results to specific deformation paths, as shown in Fig. 1. From a 3D model dataset, we arbitrarily choose two models as endpoints of two inverse length sequences such that . Let have opposite initial states , we expect them to generate similar models, satisfying .
Loss Function
In this paper, the loss function is composed of three terms as
(3) 
To illustrate this, let the ground truth models be , which are expected to be the results of forward sequence and backward sequence . The reconstruction loss forces both sequences to resemble samples from the dataset. Meanwhile, as described before, bidirectional sequence share weights and have similar outputs, which is ensured by bidirectional loss . Furthermore, contains KL divergence term and
loss to regularize the network. The KL divergence between the lowdimensional vector and Gaussian distribution is computed so as to get a good mapping. Therefore, we have
, where is the posterior distribution and is the Gaussian prior distribution. In experiments, we set .Experiments
Framework Evaluation
We now evaluate the effectiveness of different components in our framework. Bidirectional generation. We propose a shareweight bidirectional LSTM (BDLSTM) to better utilize temporal information and facilitate sequence completion. Fig. 6 demonstrates that our BDLSTM can produce results with better diversity. On the other hand, the and terms impose stronger constraints during training and consequently helps predict more accurate sequences. According to the numerical results in Tab. 2, our method is more effective than existing methods [Tan et al.2018a, Gao et al.2017b]. Moreover, our method benefits from multiple initial frames, as well as the bidirectional constraint. In Tab. 1, we show the results if we do not use our BDLSTM or leave out term. Loss terms. Error accumulation is a common problem in sequence generation tasks [Gregor et al.2015, Li et al.2017, Martinez, Black, and Romero2017]. Generated meshes usually freeze because results tend to stay at an average shape, or even diverge to random results. To address this problem, we use three methods, 1) divergence to regularize the internal distribution, 2) regularization loss to mitigate overfitting, and 3) bidirectional generation to impose an additional constraint. To justify those terms, we train models without one of the three. For unidirectional sequences, we only use one direction of LSTM. The line graphs in Fig. 3 shows representation changes between adjacent frames and . The four networks are trained on Dyna [PonsMoll et al.2015] for 7000 iterations, and tested on 32 randomly chosen sequences. From the test one can see that without KL or BDLSTM, the sequence tends to freeze. Meanwhile, regularization helps to reduce jerk. In Tab. Initial frames. Generating a sequence based on initial frames is an important application. In theory, the more bootstrap frames we have, the more knowledge we obtain about the sequence therefore we are supposed to make more accurate prediction. Previous mesh generation approaches, however, are based on interpolation/extrapolation, which can only use two of the existing models (endpoints). Our method can take advantage of all input frames by feeding them into the LSTM. A previous human motion prediction method uses frames to start the recurrent network [Li et al.2017]. We test and in Tab. 2 to show that more initial frames can reduce the distance between prediction and ground truth.
Ours  Std. BDLSTM  Unidir. LSTM  no 

88  123  114  103 
pervertex position error()
(a) Ground Truth  (b) [Gao et al.2017b]  (c) [Tan et al.2018a]  (d) Ours 
frame. and (c) also produces abnormal deformation, as highlighted in the red circles. In contrast, our method forms a natural cycle and avoids exceeding the limits (following the horse’s stride).
Method  Punching  ShakeArm  Handstand  Horse  

5  10  15  5  10  15  5  10  15  5  10  15  
Ours+1 IF  175  156  285  335  381  319  323  527  516  603  869  1328 
Ours+3 IF  95  84  107  301  226  290  212  379  428  451  329  671 
[Tan et al.2018a]  132  240  457  291  433  688  93  489  797  286  713  1032 
[Gao et al.2017b]  294  361  413  391  472  110  487  401  1589  334  1051  1568 
pervertex position error()
Sequence Generation
We now evaluate sequence generation capability of the proposed method. Starting from some initial frames, sequence generation predicts future frames. Generating sequences. As far as we are aware, this is the first work to learn and generate arbitrarily long mesh sequences. Given two initial frames, people used to generate meshes through extrapolation [Tan et al.2018b]. However, simply extrapolating shapes fail to capture longterm temporal information, e.g periodicity of the sequence. With the help of LSTM, our model can record history information and iterate to generate realistic mesh sequences in any length, even if the number of models in the dataset is limited. In the experiment, we feed first two mesh models to the LSTM and let it generate following frames. Qualitative and quantitative results are shown respectively In Fig. 2 and Tab. 2. We compare our model with ground truth as well as previous extrapolationbased methods [Tan et al.2018a, Gao et al.2017b]. Fig. 4 plots the predictions on the , and future frames. We can see that both extrapolation methods fail on the frame, because linearly extending the motion path eventually exceeds the plausible deformation space. In contrast, our method is aware of periodicity of the sequence, and able to return back once reaching the extreme point, producing natural motion cycles. Conditional generation. Another promising application of our method is to generate sequences of various shapes conditioned on the provided initial frames. Previous approaches achieve conditional human motion generation on video [Srivastava, Mansimov, and Salakhudinov2015] and skeletons [Cai et al.2017], but not on 3D shape sequences. To illustrate the effectiveness of our method, we take Dyna [PonsMoll et al.2015] as an example. In this collection of datasets, there are female/male models of different subjects and actions. All meshes in different datasets have the same number of vertices and share connectivity, so we train our model on a mixture of those datasets. In testing, we feed bootstrap models with a certain body mass index (BMI)/gender/motion as input, and get the following sequences as output. We show our results in Fig. 5. The observation is that our method can generate human shapes in different subjects and gender. Furthermore, even if the first frame is the same, the network can produce different action sequences according to the second frame.
Sequence Completion
We now consider another important application namely sequence completion, which produces inbetween shapes given two endpoint frames.
Completion based on key frames.
Completing a mesh sequence based on given anchors is an important application in animation. In our approach, we clip the target sequence by keyframes. For each segment, we run our bidirectional network by treating two keyframes as endpoints. Once the forward and backward sequences converge at a model, we stitch them to form a whole sequence. Since the computation is identical for each segment, for illustration we show an example of completing one segment constrained on two key frames. Fig 6 shows an example on the Dyna Dataset [PonsMoll et al.2015]. (f)(b)(h) are all interpolationbased. Those methods generate shapes along the shortest path between them, which are almost still because of high similarity between the first and last models.
Generating novel sequence.
Previous interpolationbased methods usually adopt a deterministic strategy to complete sequences, and thus result in a monotonous sequence. Our work, however, is able to produce diversified sequence completion results. By assigning a random vector to the LSTM state, the network generates different sequences as shown in Fig. 6. In the real world, there are often more than one possible motions between two static poses and our model can therefore better describe such characteristics in human motion than other generation methods.
To test alternative completion strategies, we also implement an optimization+unidirection [Cai et al.2017] strategy. Given the source model and , we first find the optimal LSTM initial state , where
(4) 
After solving the optimization problem with [Hansen and Ostermeier2001], we then compute through Eq. 4. The result is shown in Fig. 6 (e). Compared to interpolation strategies, the optimization+unidirection algorithm can achieve more realistic morphing, but it does not provide diverse possible choices as our BDapproach.
Implementation Details
We use Tensorflow as the framework of our implementation. Experiments are performed on a PC with an Intel Core i72600 CPU and an NVIDIA Tesla K40c GPU. We use Adam optimizer
[Kingma and Ba2014] to update weights, with default Adam parameters as in [Kingma and Ba2014]. For each dataset, we randomly exclude a subsequence, which takes up 20% of the dataset, as a a test set. A training process takes 7000 iterations, lasting for around 8 hours. In each iteration, we generate 8 sequences, each of them containing 32 shapes. For the dataset where motion is slow [PonsMoll et al.2015], we sample every other model in sequences. For all experiments, the LSTM has 3 layers and 128 hidden dimensions, and we set initial states as . The mesh convolution module is composed of 3 layers withas the activation function. Transpose convolutions
mirrors and shares the same weights.Conclusion
In this paper, we propose the first deep architecture to generate mesh animation sequences, which can not only predict future frames given initial frames, but also complete mesh sequences based on key frames and generate sequences conditioned on given shapes. Extensive qualitative and quantitative evaluation demonstrates that our method achieves stateoftheart generation results, and our completion strategy is also able to produce diverse realistic results.
References

[Bogo et al.2014]
Bogo, F.; Romero, J.; Loper, M.; and Black, M. J.
2014.
FAUST: Dataset and evaluation for 3D mesh registration.
In
Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)
, 3794 –3801.  [Bowman et al.2015] Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; and Bengio, S. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
 [Bronstein et al.2017] Bronstein, M. M.; Bruna, J.; LeCun, Y.; Szlam, A.; and Vandergheynst, P. 2017. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34(4):18–42.
 [Bütepage et al.2017] Bütepage, J.; Black, M. J.; Kragic, D.; and Kjellström, H. 2017. Deep representation learning for human motion prediction and classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [Cai et al.2017] Cai, H.; Bai, C.; Tai, Y.W.; and Tang, C.K. 2017. Deep video generation, prediction and completion of human action sequences. arXiv:1711.08682.
 [Cho et al.2014] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
 [Chung et al.2015] Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A. C.; and Bengio, Y. 2015. A recurrent latent variable model for sequential data. In NIPS, 2980–2988.
 [Dou et al.2016] Dou, M.; Khamis, S.; Degtyarev, Y.; Davidson, P.; Fanello, S. R.; Kowdle, A.; Escolano, S. O.; Rhemann, C.; Kim, D.; Taylor, J.; Kohli, P.; Tankovich, V.; and Izadi, S. 2016. Fusion4d: Realtime performance capture of challenging scenes. ACM Trans. Graph. 35(4):114:1–114:13.
 [Duvenaud et al.2015] Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; AspuruGuzik, A.; and Adams, R. P. 2015. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2224–2232.
 [Fragkiadaki et al.2015] Fragkiadaki, K.; Levine, S.; Felsen, P.; and Malik, J. 2015. Recurrent network models for human dynamics. In IEEE International Conference on Computer Vision (ICCV), 4346–4354.
 [Gao et al.2017a] Gao, L.; Chen, S.Y.; Lai, Y.K.; and Xia, S. 2017a. Datadriven shape interpolation and morphing editing. Computer Graphics Forum 36(8):19–31.
 [Gao et al.2017b] Gao, L.; Lai, Y.K.; Yang, J.; Zhang, L.X.; Kobbelt, L.; and Xia, S. 2017b. Sparse data driven mesh deformation. arXiv preprint arXiv:1709.01250.
 [Goodfellow et al.2014] Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS, 2672–2680.
 [Gregor et al.2015] Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D. J.; and Wierstra, D. 2015. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.
 [Hansen and Ostermeier2001] Hansen, N., and Ostermeier, A. 2001. Completely derandomized selfadaptation in evolution strategies. Evolutionary computation 9(2):159–195.
 [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 [Huang, Kalogerakis, and Marlin2015] Huang, H.; Kalogerakis, E.; and Marlin, B. 2015. Analysis and synthesis of 3d shape families via deeplearned generative models of surfaces. Computer Graphics Forum 34(5):25–38.
 [Huber, Perl, and Rumpf2017] Huber, P.; Perl, R.; and Rumpf, M. 2017. Smooth interpolation of key frames in a riemannian shell space. Computer Aided Geometric Design 52:313–328.
 [Kalogerakis et al.2017] Kalogerakis, E.; Averkiou, M.; Maji, S.; and Chaudhuri, S. 2017. 3D shape segmentation with projective convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Levi and Gotsman2015] Levi, Z., and Gotsman, C. 2015. Smooth rotation enhanced asrigidaspossible mesh animation. IEEE Trans. Vis. Comp. Graph. 21(2):264–277.
 [Li et al.2017] Li, Z.; Zhou, Y.; Xiao, S.; He, C.; and Li, H. 2017. Autoconditioned lstm network for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363.
 [Lotter, Kreiman, and Cox2016] Lotter, W.; Kreiman, G.; and Cox, D. 2016. Deep predictive coding networks for video prediction and unsupervised learning. arXiv:1605.08104.
 [Lyu et al.2015] Lyu, Q.; Wu, Z.; Zhu, J.; and Meng, H. 2015. Modelling highdimensional sequences with LSTMRTRBM: Application to polyphonic music generation. In IJCAI, 4138–4139.
 [Marchi et al.2014] Marchi, E.; Ferroni, G.; Eyben, F.; Gabrielli, L.; Squartini, S.; and Schuller, B. 2014. Multiresolution linear prediction based features for audio onset detection with bidirectional lstm neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2164–2168.
 [Martinez, Black, and Romero2017] Martinez, J.; Black, M. J.; and Romero, J. 2017. On human motion prediction using recurrent neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4674–4683.
 [Mathieu, Couprie, and LeCun2015] Mathieu, M.; Couprie, C.; and LeCun, Y. 2015. Deep multiscale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440.
 [Melamud, Goldberger, and Dagan2016] Melamud, O.; Goldberger, J.; and Dagan, I. 2016. Context2vec: Learning generic context embedding with bidirectional LSTM. In SIGNLL Conference on Computational Natural Language Learning, 51–61.
 [Mikolov et al.2011] Mikolov, T.; Kombrink, S.; Burget, L.; Černockỳ, J.; and Khudanpur, S. 2011. Extensions of recurrent neural network language model. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5528–5531.
 [Oh et al.2015] Oh, J.; Guo, X.; Lee, H.; Lewis, R. L.; and Singh, S. 2015. Actionconditional video prediction using deep networks in atari games. In NIPS, 2863–2871.
 [PonsMoll et al.2015] PonsMoll, G.; Romero, J.; Mahmood, N.; and Black, M. J. 2015. Dyna: A model of dynamic human shape in motion. ACM Transactions on Graphics (TOG) 34(4):120.
 [Riegler, Ulusoy, and Geiger2017] Riegler, G.; Ulusoy, A. O.; and Geiger, A. 2017. Octnet: Learning deep 3D representations at high resolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3.

[Sidi et al.2011]
Sidi, O.; van Kaick, O.; Kleiman, Y.; Zhang, H.; and CohenOr, D.
2011.
Unsupervised cosegmentation of a set of shapes via descriptorspace spectral clustering.
ACM Transactions on Graphics (TOG) 30(6). 
[Srivastava, Mansimov, and
Salakhudinov2015]
Srivastava, N.; Mansimov, E.; and Salakhudinov, R.
2015.
Unsupervised learning of video representations using lstms.
In
International Conference on Machine Learning
, 843–852.  [Stoll et al.2010] Stoll, C.; Gall, J.; de Aguiar, E.; Thrun, S.; and Theobalt, C. 2010. Videobased reconstruction of animatable human characters. ACM Trans. Graph. 29(6):139:1–139:10.
 [Su et al.2015] Su, H.; Maji, S.; Kalogerakis, E.; and LearnedMiller, E. 2015. Multiview convolutional neural networks for 3D shape recognition. In IEEE International Conference on Computer Vision, 945–953.
 [Sumner and Popović2004] Sumner, R. W., and Popović, J. 2004. Deformation transfer for triangle meshes. ACM Transactions on Graphics (TOG) 23(3):399–405.
 [Tan et al.2018a] Tan, Q.; Gao, L.; Lai, Y.K.; and Xia, S. 2018a. Variational autoencoders for deforming 3d mesh models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 [Tan et al.2018b] Tan, Q.; Gao, L.; Lai, Y.K.; Yang, J.; and Xia, S. 2018b. Meshbased autoencoders for localized deformation component analysis. In AAAI.
 [Vinyals et al.2015] Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3156–3164.
 [Vlasic et al.2008] Vlasic, D.; Baran, I.; Matusik, W.; and Popović, J. 2008. Articulated mesh animation from multiview silhouettes. ACM Transactions on Graphics (TOG) 27(3):97.
 [Vondrick, Pirsiavash, and Torralba2016] Vondrick, C.; Pirsiavash, H.; and Torralba, A. 2016. Generating videos with scene dynamics. In NIPS, 613–621.
 [Walker et al.2017] Walker, J.; Marino, K.; Gupta, A.; and Hebert, M. 2017. The pose knows: Video forecasting by generating pose futures. In IEEE International Conference on Computer Vision (ICCV), 3352–3361.
 [Wu et al.2016] Wu, J.; Zhang, C.; Xue, T.; Freeman, B.; and Tenenbaum, J. 2016. Learning a probabilistic latent space of object shapes via 3D generativeadversarial modeling. In NIPS, 82–90.
 [Yu et al.2017] Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. SeqGAN: Sequence generative adversarial nets with policy gradient. In AAAI, 2852–2858.
Comments
There are no comments yet.