Music-oriented Dance Video Synthesis with Pose Perceptual Loss

12/13/2019 ∙ by Xuanchi Ren, et al. ∙ 45

We present a learning-based approach with pose perceptual loss for automatic music video generation. Our method can produce a realistic dance video that conforms to the beats and rhymes of almost any given music. To achieve this, we firstly generate a human skeleton sequence from music and then apply the learned pose-to-appearance mapping to generate the final video. In the stage of generating skeleton sequences, we utilize two discriminators to capture different aspects of the sequence and propose a novel pose perceptual loss to produce natural dances. Besides, we also provide a new cross-modal evaluation to evaluate the dance quality, which is able to estimate the similarity between two modalities of music and dance. Finally, a user study is conducted to demonstrate that dance video synthesized by the presented approach produces surprisingly realistic results. The results are shown in the supplementary video at



There are no comments yet.


page 1

page 7

page 8

Code Repositories


(ACM MM 20 Oral) PyTorch implementation of Self-supervised Dance Video Synthesis Conditioned on Music

view repo


论文Music-oriented Dance Video Synthesis with Pose Perceptual Loss的fork

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Music videos have become unprecedentedly popular all over the world. Nearly all the top 10 most-viewed YouTube videos111 are music videos with dancing. While these music videos are made by professional artists, we wonder if an intelligent system can automatically generate personalized and creative music videos. In this work, we study automatic dance music video generation, given almost any music. We aim to synthesize a coherent and photo-realistic dance video that conforms to the given music. With such music video generation technology, a user can share a personalized music video on social media. In Figure 1, we show some images of our synthesized dance video given the music “I Wish” by Cosmic Girls.

The dance video synthesis task is challenging for various technical reasons. Firstly, the mapping between dance motion and background music is ambiguous: different artists may compose distinctive dance motion given the same music. This suggests that a simple machine learning model with

or distance [19, 35] can hardly capture the relationship between dance and music. Secondly, it is technically difficult to model the space of human body dance. The model should avoid generating non-natural dancing movements. Even slight deviations from normal human poses could appear unnatural. Thirdly, no high-quality dataset is available for our task. Previous motion datasets [18, 33] mostly focus on action recognition. Tang et al. [35] provide a 3D joint dataset for our task. However, we encounter errors that the dance motion and the music are not aligned when we try to use it.

Nowadays, there are a large number of music videos with dancing online, which can be used for the music video generation task. To build a dataset for our task, we apply OpenPose [4, 5, 42] to get dance skeleton sequences from online videos. However, the skeleton sequences acquired by OpenPose are very noisy: some estimated human poses are inaccurate. Correcting such a dataset is time-consuming by removing inaccurate poses and thus not suitable for extensive applications. Furthermore, only or distance is used for training a network in prior work [19, 35, 43], which is demonstrated to disregard some specific motion characteristics by  [24]. To tackle these challenges, we propose a novel pose perceptual loss so that our model can be trained on noisy data (imperfect human poses) gained by OpenPose.

Dance synthesis has been well studied in the literature by searching dance motion in a database using music as a query [1, 16, 31]. These approaches can not generalize well to music beyond the training data and lack creativity, which is the most indispensable factor of dance. To overcome such obstacles, we choose the generative adversarial network (GAN) [12] to deal with cross-modal mapping. However, Cai et al. [3] showed that human pose constraints are too complicated to be captured by an end-to-end model trained with a direct GAN method. Thus, we propose to use two discriminators that focus on local coherence and global harmony, respectively.

In summary, the contributions of our work are:

  • With the proposed pose perceptual loss, our model can be trained on a noisy dataset (without human labels) to synthesize realistic dance video that conforms to almost any given music.

  • With the Local Temporal Discriminator and the Global Content Discriminator, our framework can generate a coherent dance skeleton sequence that matches the length, rhythm, and the emotion of music.

  • For our task, we build a dataset containing paired music and skeleton sequences, which will be made public for research. To evaluate our model, we also propose a novel cross-modal evaluation that measures the similarity between music and a dance skeleton sequence.

2 Related Work

Figure 2: Our framework for human skeleton sequence synthesis. The input is music signals, which are divided into pieces of 0.1-second music. The generator contains an audio encoder, a bidirectional GRU, and a pose generator. The output skeleton sequence of the generator is fed into the Global Content Discriminator with the music. The generated skeleton sequence is then divided into overlapping sub-sequences, which are fed into the Local Temporal Discriminator.

GAN-based Video Synthesis. A generative adversarial network (GAN) [12] is a popular approach for image generation. The images generated by GAN are usually sharper and with more details compared to those with and distance. Recently, GAN is also extended to video generation tasks [21, 25, 36, 37]. The most simple changes made in GANs for videos are proposed in  [29, 37]. The GAN model in  [37] replaced the standard 2D convolutional layer with a 3D convolutional layer to capture the temporal feature, although this characteristic capture method is limited in the fixed time. TGAN [29] overcame the limitation but with the cost of constraints imposed in the latent space. MoCoGAN [36] could generate videos that combine the advantages of RNN-based GAN models and sliding window techniques so that the motion and content are disentangled in the latent space.

Another advantage of GAN models is that it is widely applicable to many tasks, including the cross-modal audio-to-video problem. Chen et al. [34] proposed a GAN-based encoder-decoder architecture using CNNs to convert between audio spectrograms and frames. Furthermore, Vougioukas et al. [38] adapted temporal GAN to synthesize a talking character conditioned on speech signals automatically.

Dance Motion Synthesis. A line of work focuses on the mapping between acoustic and motion features. On the base of labeling music with joint positions and angles, Shiratori et al. [31, 16] incorporated gravity and beats as additional features for predicting dance motion. Recently, Alemi et al. [1] proposed to combine the acoustic feature with the motion features of previous frames. However, these approaches are entirely dependent on the prepared database and may only create rigid motion when it comes to music with similar acoustic features.

Recently, Yaota et al. [43]

accomplished dance synthesis using standard deep learning models. The most recent work is by Tang et al. 


, who proposed a model based on LSTM-autoencoder architecture to generate dance pose sequences. Their approach is trained with a

distance loss, and their evaluation only includes comparisons with randomly sampled dances that are not on a par with those by real artists. Their approach may not work well on the noisy data obtained by OpenPose.

3 Overview

To generate a dance video from music, we split our system into two stages. In the first stage, we propose an end-to-end model that directly generates a dance skeleton sequence according to the audio input. In the second stage, we apply an improved pix2pixHD GAN [39, 7] to transfer the dance skeleton sequence to a dance video. In this overview, we will mainly describe the first stage, as shown in Figure 2.

Let be the number of joints of the human skeleton, and the dimension of a 2D coordinate is 2. We formulate a dance skeleton sequence as a sequence of human skeletons across consecutive frames in total: where each skeleton frame

is a vector containing all

joint locations. Our goal is to learn a function that maps audio signals with sample rate per frame to a joint location vector sequence.

Generator. The generator is composed of a music encoding part and a pose generator. The input audio signals are divided into pieces of 0.1-second music. These pieces are encoded using 1D convolution and then fed into a bi-directional 2-layer GRU in chronological order, resulting in output hidden states

. These hidden states are fed in the pose generator, which is a multi-layer perceptron to produce a skeleton sequence


Local Temporal Discriminator. The output skeleton sequence is divided into overlapping sequences

. Then these sub-sequences are fed into the Local Temporal Discriminator, which is a two-branch convolutional network. In the end, a small classifier outputs

scores that determine the realism of these skeleton sub-sequences.

Figure 3: The overview of the pose perceptual loss based on ST-GCN. is our generator in the first stage. is the ground-truth skeleton sequence, and is the generated skeleton sequence.

Global Content Discriminator. The input to the Global Content Discriminator includes the music and the dance skeleton sequence . For the pose part, the skeleton sequence is encoded using pose discriminator as . For the music part, similar to the sub-network of the generator, music is encoded using 1D convolution and then fed into a bi-directional 2-layer GRU, resulting an output and is transmitted into the self-attention component of  [22] to get a comprehensive music feature expression . In the end, we concatenate and along channels and use a small classifier, composed of a 1D convolutional layer and a fully-connected (FC) layer, to determine if the skeleton sequence matches the music.

Pose Perceptual Loss.

Recently, Graph Convolutional Network (GCN) has been extended to model skeletons since the human skeleton structure is graph-structured data. Thus, the feature extracted by GCN remains a high-level spatial structural information between different body parts. Matching activations in a pre-trained GCN network gives a better constraint on both the detail and layout of a pose than the traditional methods such as

distance and distance. Figure 3 shows the pipeline of the pose perceptual loss. With the pose perceptual loss, our output skeleton sequences do not need an additional smooth step or any other post-processing.

4 Pose Perceptual Loss

Perceptual loss or feature matching loss  [2, 8, 11, 15, 27, 39, 40] is a popular loss to measure the similarity between two images in image processing and synthesis. For the tasks that generate human skeleton sequences [3, 19, 35], only or distance is used for measuring pose similarity. With a or loss, we find that our model tends to generate poses conservatively (repeatedly) and fail to capture the semantic relationship across motion appropriately. Moreover, the datasets generated by OpenPose [4, 5, 42] are very noisy, as shown in Figure 5. Correcting inaccurate human poses on a large number of videos is labor-intensive and undesirable: a two-minute video with 10 FPS will have 1200 poses to verify. To tackle these difficulties, we propose a novel pose perceptual loss.

The idea of perceptual loss is originally studied in the image domain, which is used to match activations in a visual perception network such as VGG-19 [32, 8]. To use the traditional perceptual loss, we need to draw generated skeletons on images, which is complicated and seemingly suboptimal. Instead of projecting pose joint coordinates to an image, we propose to directly match activations in a pose recognition network that takes human skeleton sequences as input. Such a network is mainly aimed at pose recognition or prediction tasks, and ST-GCN [44]

is a Graph Convolutional Network (GCN) that is applicable to be a visual perception network in our case. ST-GCN utilizes a spatial-temporal graph to form the hierarchical representation of skeleton sequences and is capable of automatically learning both spatial and temporal patterns from data. To test the impact of the pose perceptual loss on our noisy dataset, we prepare a 20-video dataset with many noises due to the wrong pose detection of OpenPose. As shown in Figure

4, our generator can stably generate poses with the pose perceptual loss.

Given a pre-trained GCN network , we define a collection of layers as . For a training pair , where is the ground truth skeleton sequence and is the corresponding piece of music, our perceptual loss is



is the first-stage generator in our framework. The hyperparameters

balance the contribution of each layer to the loss.

5 Implementation

5.1 Pose Discriminator

To evaluate if a skeleton sequence is an excellent dance, we believe the most indispensable factors are the intra-frame representation for joint co-occurrences and the inter-frame representation for skeleton temporal evolution. To extract features of a pose sequence, we explore multi-stream CNN-based methods and adopt the Hierarchical Co-occurrence Network framework [20] to enable discriminators to differentiable real and fake pose sequences.

Two-Stream CNN. The input of the pose discriminator is a skeleton sequence

. The temporal difference is interpolated to be of the same shape of

. Then the skeleton sequence and the temporal difference are fed into the network directly as two streams of inputs. Their feature maps are fused by concatenation along channels, and then we use convolutional and fully-connected layers to extract features.

5.2 Local Temporal Discriminator

One of the objectives of the pose generator is the temporal coherence of the generated skeleton sequence. For example, when a man moves his left foot, his right foot should keep still for multiples frames. Similar to PatchGAN [14, 47, 40], we propose to use Local Temporal Discriminator, which is a 1D version of PatchGAN to achieve coherence between consecutive frames. Besides, the Local Temporal Discriminator contains a trimmed pose discriminator and a small classifier.

Figure 4: In each section, the first image is a skeleton generated by the model without pose perceptual loss, and the second image is a skeleton generated by the model with pose perceptual loss according to the same piece of music.

5.3 Global Content Discriminator

Dance is closely related to music, and the harmony between music and dance is a crucial criterion to evaluate a dance sequence. Inspired by  [38], we proposed the Global Content Discriminator to deal with the relationship between music and dance.

As we mentioned previously, music is encoded as a sequence . Though GRU can capture long term dependencies, it is still challenging for GRU to encode the entire music information. In our experiment, only using to represent music feature will lead to a crash of the beginning part of the skeleton sequence. Therefore, we use the self-attention mechanism [22] to assign a weight for each hidden state and gain a comprehensive embedding. In the next part, we briefly describe the self-attention mechanism used in our framework.

Self-attention mechanism. Given , we can compute its weight at each time step by


where is -th element of the while and . is the assigned weight for -th time step in the sequence of hidden states. Thus, the music feature can be computed by multiplying the scores and , written as .

5.4 Other Loss Function

GAN loss The Local Temporal Discriminator () is trained on overlapping skeleton sequences that are sampled using from a whole skeleton sequence. The Global Content Discriminator () distinguishes the harmony between the skeleton sequence and the input music . Besides, we have and the ground truth skeleton sequence . We also apply a gradient penalty [13] term in . Therefore, the adversarial loss is defined as


where is the weight for the gradient penalty term.

distance Given a ground truth dance skeleton sequence with the same shape of , the reconstruction loss at the joint level is:

Figure 5: Noisy data caused by occlusion and overlapping. For the first part of the K-pop dataset, there is a large number of such skeletons. For the second part of the K-pop dataset, there are few inaccurate skeletons.

Feature matching loss We adopt the feature matching loss from  [39] to stabilize the training of Global Content Discriminator :


where is the number of layers in and denotes the layer of . In addition, we omit the normalization term of the original to fit our architecture.

Full Objective. Our full objective is


where , , and represent the weights for each loss term.

5.5 Pose to Video

Recently, researchers have been studying motion transfer, especially for transferring dance motion between two videos [7, 23, 40, 46]. Among these methods, we adopt the approach proposed by Chan et al. [7] for its simplicity and effectiveness. Given a skeleton sequence and a video of a target person, the framework could transfer the movement of the skeleton sequence to the target person. We used a third-party implementation222

Category Number
Ballet 1165
Break 3171
Cha 4573
Flamenco 2271
Foxtrot 2981
Jive 3765
Latin 2205
Pasodoble 2945
Quickstep 2776
Rumba 4459
Samba 3143
Square 5649
Swing 3528
Tango 3321
Tap 2860
Waltz 3046
(a) Let’s Dance Dataset.
Category Number
Clean Train 1636
Clean Val 146
Noisy Train 656
Noisy Val 74
(b) K-pop.
Category Number
Pop 4334
Rock 4324
Instrumental 4299
Electronic 4333
Folk 4346
International 4341
Hip-Hop 4303
Experimental 4323
(c) FMA.
Table 1: The detail of our datasets. All the datasets are cut into pieces of 5s. Number means the number of the pieces. For Let’s Dance Dataset and FMA, 70% is for training, 5% is for validation, and 25% is for testing.

6 Experiments

6.1 Datasets

K-pop dataset. To build our dataset, we apply OpenPose [4, 5, 42] to some online videos to obtain the skeleton sequences. In total, We collected 60 videos about 3 minutes with a single women dancer and split these videos into two datasets. The first part with 20 videos is very noisy, as shown in Figure 5. This dataset is used to test the performance of the pose perceptual loss on noisy data. 18 videos of this part are for training, and 2 videos of this part are for evaluation. The second part with 40 videos is relatively clean and used to form our automatic dance video generation task. 37 videos of this part are for training, and 3 videos of this part are for evaluation. The detail of this dataset is shown in Table 0(c).

Let’s Dance Dataset. Castro et al. [6] released a dataset containing 16 classes of dance, presented in Table 0(a). The dataset provides information about human skeleton sequences for pose recognition. Though there are existing enormous motion datasets  [18, 30, 33] with skeleton sequences, we choose Let’s Dance Dataset to pre-train our ST-GCN for pose perceptual loss as dance is different with normal human motion.

Metric Cross-modal BRISQUE
Rand Frame 0.151
Rand Seq 0.204
0.312 40.66
Global D 0.094 40.93
Local D 0.068 41.46
Our model 0.046 41.11
Table 2: Results of our model and baselines. On the cross-modal evaluation, lower is better. For BRISQUE, higher is better. The details of the baselines are shown in Section 6.3.

Figure 6: Synthesized music video conditioned on the music “LIKEY” by TWICE. For each 5-second dance video, we show 4 frames. The top row shows the skeleton sequence, and the bottom row shows the synthesized video frames conditioned on different target videos.

FMA. For our cross-modal evaluation, the extraction of music features is needed. To achieve this goal, we adopt CRNN [9] and choose the dataset Free Music Archive (FMA) to train CRNN. In FMA, genre information and the music content are provided for genre classification. The information of FMA is shown in Table 0(c).

6.2 Experimental Setup

All the models are trained on an Nvidia GeForce GTX 1080 Ti GPU. For the first stage in our framework, the model is implemented in PyTorch 


and takes approximately one day to train for 400 epochs. For the hyperparameters, we set

, , , , . For the self attention mechanism, we set

. For the loss function, the hyperparameters

are set to be and Though the weight of distance loss is relatively large, the absolute value of the loss is quite small. We used Adam [17] for all the networks with a learning rate of 0.003 for the generator and 0.003 for the Local Temporal Discriminator and 0.005 for the Global Content Discriminator.

For the second stage that transfers pose to video, the model takes approximately three days to train, and the hyperparameters of it adopt the same as  [7]. For the pre-train process of ST-GCN and CRNN, we also used Adam [17] for them with a learning rate of 0.002. ST-GCN achieves 46% precision on Let’s Dance Dataset. CRNN is pretrained on the FMA, and the top-2 accuracy is 67.82%.

6.3 Evaluation

We will evaluate the following baselines and our model.

  • . In this condition we just use distance to conduct the generator.

  • Global D. Based on , we add a Global Content Discriminator.

  • Local D. Based on Global D , we add a Local Temporal Discriminator.

  • Our model. Based on Local D, we add pose perceptual loss. These conditions are used in Table 2.

6.3.1 User Study

To evaluate the quality of the generated skeleton sequences (our main contributions), we conduct a user study comparing the synthesis skeleton sequence and the ground-truth skeleton sequence. We randomly sample 10 pairs sequences with different lengths and draw the sequences into videos. To make this study fair, we verify the ground truth skeletons and re-annotate the noisy ones. In the user study, each participant watches the video of the synthesis skeleton sequence and the video of the ground truth skeleton sequence in random order. Then the participant needs to choose one of the two options: 1) The first video is better. 2) The second video is better. As shown in Figure 7, in 43.0% of the comparisons, participants vote for our synthesized skeleton sequence. This user study shows that our model can choreograph at a similar level with real artists.

Figure 7: Results of user study on comparisons between the synthesized skeleton sequence and the ground truth. There are 27 participants in total, including seven dancers. In nearly half of the comparisons, users can not tell which skeleton sequence is better given the music. To make the results reliable, we make sure there is no unclean skeleton in the study.
Figure 8:

Cross-modal evaluation. We first project all the music pieces in the training set of the K-pop dataset into an embedding dictionary. We train the pose metric network based on the K-means clustering result of the embedding dictionary. For the K-means clustering, we choose K = 5, according to the Silhouette Coefficient. The similarity between

and is measured by .

Figure 9: Our synthesized music video with a male student as a dancer.

6.3.2 Cross-modal Evaluation

It is challenging to evaluate if a dance sequence is suitable for a piece of music. To our best knowledge, there is no existing method to evaluate the mapping between music and dance. Therefore, we propose a two-step cross-modal metric, as shown in Figure 8, to estimate the similarity between music and dance.

Given a training set where is a dance skeleton sequence and is the corresponding music. Then with a pre-trained music feature extractor  [9], we aggregate all the music embeddings in an embedding dictionary.

The input to our evaluation is music . With our generator , we can get the synthesized skeleton sequence . The first step is to find a skeleton sequence that represents the music . We first obtain the music feature by . Then let be the nearest neighbor of in the embedding dictionary. In the end, we use its corresponding skeleton sequence to represent the music . The second step is to measure the similarity between two skeleton sequences with the novel metric learning objective based on a triplet architecture and Maximum Mean Discrepancy, proposed by Coskun et al. [10]. More implementation details about this metric will be shown in supplement materials.

6.3.3 Quantitative Evaluation

To evaluate the quality of results of the final method in comparison to other conditions, Chan et al. [7] propose to make a transfer between the same video since there is no reference for the synthesized frame and use SSIM [41] and LPIPS [45] to measure the videos. For our task, such metrics are useless because there are no reference frames for the generated dance video. So we apply BRISQUE [26], which is a no-reference Image Quality Assessment to measure the quality of our final generated dance video.

As shown in Table 2, by utilizing the Global Content Discriminator and the Local Temporal Discriminator, even for a single frame result, the score is better. For the addition of the pose perceptual loss, the poses become plausible, and then transferring the diverse poses to the frames may lead to the decline of the score. Furthermore, more significant differences can be observed in our video. To validate our proposed evaluation, we also try two random conditions:

  • Rand Frame. Randomly select 50 frames from the training dataset for the input music instead of feeding the music into the generator.

  • Rand Seq. Randomly select a skeleton sequence from the training dataset for the input music instead of feeding the music into the generator.

    To make the random results stable, we make ten random processes and get the average score.

7 Conclusion

We have presented a two-stage framework to generate dance videos, given any music. With our proposed pose perceptual loss, our model can be trained on dance videos with noisy pose skeleton sequence (no human labels). Our approach can create arbitrarily long, good-quality videos. We hope that this pipeline of synthesizing skeleton sequence and dance video combining with pose perceptual loss can support more future work, including more creative video synthesis for artists.


  • [1] O. Alemi, J. Françoise, and P. Pasquier (2017)

    GrooveNet: real-time music-driven dance movement generation using artificial neural networks

    networks. Cited by: §1, §2.
  • [2] J. Bruna, P. Sprechmann, and Y. LeCun (2016) Super-resolution with deep convolutional sufficient statistics. In ICLR, Cited by: §4.
  • [3] H. Cai, C. Bai, Y. Tai, and C. Tang (2018) Deep video generation, prediction and completion of human action sequences. In ECCV, Cited by: §1, §4.
  • [4] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh (2018) OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In arXiv:1812.08008, Cited by: §1, §4, §6.1.
  • [5] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, Cited by: §1, §4, §6.1.
  • [6] D. Castro, S. Hickson, P. Sangkloy, B. Mittal, S. Dai, J. Hays, and I. Essa (2018) Let’s dance: learning from online dance videos. In arXiv:1801.07388, Cited by: §6.1.
  • [7] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros (2019) Everybody dance now. In ICCV, Cited by: §3, §5.5, §6.2, §6.3.3.
  • [8] Q. Chen and V. Koltun (2017) Photographic image synthesis with cascaded refinement networks. In ICCV, Cited by: §4, §4.
  • [9] K. Choi, G. Fazekas, M. B. Sandler, and K. Cho (2017)

    Convolutional recurrent neural networks for music classification

    In ICASSP, Cited by: §6.1, §6.3.2.
  • [10] H. Coskun, D. J. Tan, S. Conjeti, N. Navab, and F. Tombari (2018) Human motion analysis with deep metric learning. In ECCV, Cited by: §6.3.2.
  • [11] A. Dosovitskiy and T. Brox (2016) Generating images with perceptual similarity metrics based on deep networks. In NeurIPS, Cited by: §4.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §1, §2.
  • [13] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In NeurIPS, Cited by: §5.4.
  • [14] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §5.2.
  • [15] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §4.
  • [16] J. W. Kim, H. Fouad, and J. K. Hahn (2006) Making them dance. In AAAI Fall Symposium: Aurally Informed Performance, Cited by: §1, §2.
  • [17] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §6.2, §6.2.
  • [18] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In ICCV, Cited by: §1, §6.1.
  • [19] J. Lee, S. Kim, and K. Lee (2018) Listen to dance: music-driven choreography generation using autoregressive encoder-decoder network. CoRR. Cited by: §1, §1, §4.
  • [20] C. Li, Q. Zhong, D. Xie, and S. Pu (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In IJCAI, Cited by: §5.1.
  • [21] Y. Li, M. R. Min, D. Shen, D. E. Carlson, and L. Carin (2017) ArXiv:1710.00421ArXiv:1710.00421. Cited by: §2.
  • [22] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. In ICLR, Cited by: §3, §5.3.
  • [23] W. Liu, Z. Piao, J. Min, W. Luo, L. Ma, and S. Gao (2019) Liquid warping GAN: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In ICCV, Cited by: §5.5.
  • [24] J. Martinez, M. J. Black, and J. Romero (2017) On human motion prediction using recurrent neural networks. In CVPR, Cited by: §1.
  • [25] M. Mathieu, C. Couprie, and Y. LeCun (2016) Deep multi-scale video prediction beyond mean square error. In ICLR, Cited by: §2.
  • [26] A. Mittal, A. K. Moorthy, and A. C. Bovik (2012) No-reference image quality assessment in the spatial domain. IEEE Trans. Image Processing. Cited by: §6.3.3.
  • [27] A. M. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune (2016)

    Synthesizing the preferred inputs for neurons in neural networks via deep generator networks

    In NeurIPS, Cited by: §4.
  • [28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §6.2.
  • [29] M. Saito, E. Matsumoto, and S. Saito (2017)

    Temporal generative adversarial nets with singular value clipping

    In ICCV, Cited by: §2.
  • [30] A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016) NTU RGB+D: A large scale dataset for 3d human activity analysis. In CVPR, Cited by: §6.1.
  • [31] T. Shiratori, A. Nakazawa, and K. Ikeuchi (2006) Dancing-to-music character animation. Comput. Graph. Forum. Cited by: §1, §2.
  • [32] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §4.
  • [33] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR. Cited by: §1, §6.1.
  • [34] C. T. and R. R.R. (1998) Audio-visual integration in multimodal communication. In IEEE, Cited by: §2.
  • [35] T. Tang, J. Jia, and H. Mao (2018) Dance with melody: an lstm-autoencoder approach to music-oriented dance synthesis. In ACM Multimedia, Cited by: §1, §1, §2, §4.
  • [36] S. Tulyakov, M. Liu, X. Yang, and J. Kautz (2018) MoCoGAN: decomposing motion and content for video generation. In CVPR, Cited by: §2.
  • [37] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In NeurIPS, Cited by: §2.
  • [38] K. Vougioukas, S. Petridis, and M. Pantic (2018) End-to-end speech-driven facial animation with temporal gans. In BMVC, Cited by: §2, §5.3.
  • [39] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, Cited by: §3, §4, §5.4.
  • [40] T. Wang, M. Liu, J. Zhu, N. Yakovenko, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-video synthesis. In NeurIPS, Cited by: §4, §5.2, §5.5.
  • [41] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing. Cited by: §6.3.3.
  • [42] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh (2016) Convolutional pose machines. In CVPR, Cited by: §1, §4, §6.1.
  • [43] Cited by: §1, §2.
  • [44] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, Cited by: §4.
  • [45] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In CVPR, Cited by: §6.3.3.
  • [46] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg (2019) Dance dance generation: motion transfer for internet videos. CoRR. Cited by: §5.5.
  • [47] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §5.2.