(ACM MM 20 Oral) PyTorch implementation of Self-supervised Dance Video Synthesis Conditioned on Music
We present a learning-based approach with pose perceptual loss for automatic music video generation. Our method can produce a realistic dance video that conforms to the beats and rhymes of almost any given music. To achieve this, we firstly generate a human skeleton sequence from music and then apply the learned pose-to-appearance mapping to generate the final video. In the stage of generating skeleton sequences, we utilize two discriminators to capture different aspects of the sequence and propose a novel pose perceptual loss to produce natural dances. Besides, we also provide a new cross-modal evaluation to evaluate the dance quality, which is able to estimate the similarity between two modalities of music and dance. Finally, a user study is conducted to demonstrate that dance video synthesized by the presented approach produces surprisingly realistic results. The results are shown in the supplementary video at https://youtu.be/0rMuFMZa_K4READ FULL TEXT VIEW PDF
(ACM MM 20 Oral) PyTorch implementation of Self-supervised Dance Video Synthesis Conditioned on Music
论文Music-oriented Dance Video Synthesis with Pose Perceptual Loss的fork
Music videos have become unprecedentedly popular all over the world. Nearly all the top 10 most-viewed YouTube videos111https://www.digitaltrends.com/web/most-viewed-youtube-videos/ are music videos with dancing. While these music videos are made by professional artists, we wonder if an intelligent system can automatically generate personalized and creative music videos. In this work, we study automatic dance music video generation, given almost any music. We aim to synthesize a coherent and photo-realistic dance video that conforms to the given music. With such music video generation technology, a user can share a personalized music video on social media. In Figure 1, we show some images of our synthesized dance video given the music “I Wish” by Cosmic Girls.
The dance video synthesis task is challenging for various technical reasons. Firstly, the mapping between dance motion and background music is ambiguous: different artists may compose distinctive dance motion given the same music. This suggests that a simple machine learning model withor distance [19, 35] can hardly capture the relationship between dance and music. Secondly, it is technically difficult to model the space of human body dance. The model should avoid generating non-natural dancing movements. Even slight deviations from normal human poses could appear unnatural. Thirdly, no high-quality dataset is available for our task. Previous motion datasets [18, 33] mostly focus on action recognition. Tang et al.  provide a 3D joint dataset for our task. However, we encounter errors that the dance motion and the music are not aligned when we try to use it.
Nowadays, there are a large number of music videos with dancing online, which can be used for the music video generation task. To build a dataset for our task, we apply OpenPose [4, 5, 42] to get dance skeleton sequences from online videos. However, the skeleton sequences acquired by OpenPose are very noisy: some estimated human poses are inaccurate. Correcting such a dataset is time-consuming by removing inaccurate poses and thus not suitable for extensive applications. Furthermore, only or distance is used for training a network in prior work [19, 35, 43], which is demonstrated to disregard some specific motion characteristics by . To tackle these challenges, we propose a novel pose perceptual loss so that our model can be trained on noisy data (imperfect human poses) gained by OpenPose.
Dance synthesis has been well studied in the literature by searching dance motion in a database using music as a query [1, 16, 31]. These approaches can not generalize well to music beyond the training data and lack creativity, which is the most indispensable factor of dance. To overcome such obstacles, we choose the generative adversarial network (GAN)  to deal with cross-modal mapping. However, Cai et al.  showed that human pose constraints are too complicated to be captured by an end-to-end model trained with a direct GAN method. Thus, we propose to use two discriminators that focus on local coherence and global harmony, respectively.
In summary, the contributions of our work are:
With the proposed pose perceptual loss, our model can be trained on a noisy dataset (without human labels) to synthesize realistic dance video that conforms to almost any given music.
With the Local Temporal Discriminator and the Global Content Discriminator, our framework can generate a coherent dance skeleton sequence that matches the length, rhythm, and the emotion of music.
For our task, we build a dataset containing paired music and skeleton sequences, which will be made public for research. To evaluate our model, we also propose a novel cross-modal evaluation that measures the similarity between music and a dance skeleton sequence.
GAN-based Video Synthesis. A generative adversarial network (GAN)  is a popular approach for image generation. The images generated by GAN are usually sharper and with more details compared to those with and distance. Recently, GAN is also extended to video generation tasks [21, 25, 36, 37]. The most simple changes made in GANs for videos are proposed in [29, 37]. The GAN model in  replaced the standard 2D convolutional layer with a 3D convolutional layer to capture the temporal feature, although this characteristic capture method is limited in the fixed time. TGAN  overcame the limitation but with the cost of constraints imposed in the latent space. MoCoGAN  could generate videos that combine the advantages of RNN-based GAN models and sliding window techniques so that the motion and content are disentangled in the latent space.
Another advantage of GAN models is that it is widely applicable to many tasks, including the cross-modal audio-to-video problem. Chen et al.  proposed a GAN-based encoder-decoder architecture using CNNs to convert between audio spectrograms and frames. Furthermore, Vougioukas et al.  adapted temporal GAN to synthesize a talking character conditioned on speech signals automatically.
Dance Motion Synthesis. A line of work focuses on the mapping between acoustic and motion features. On the base of labeling music with joint positions and angles, Shiratori et al. [31, 16] incorporated gravity and beats as additional features for predicting dance motion. Recently, Alemi et al.  proposed to combine the acoustic feature with the motion features of previous frames. However, these approaches are entirely dependent on the prepared database and may only create rigid motion when it comes to music with similar acoustic features.
Recently, Yaota et al. 
accomplished dance synthesis using standard deep learning models. The most recent work is by Tang et al.
, who proposed a model based on LSTM-autoencoder architecture to generate dance pose sequences. Their approach is trained with adistance loss, and their evaluation only includes comparisons with randomly sampled dances that are not on a par with those by real artists. Their approach may not work well on the noisy data obtained by OpenPose.
To generate a dance video from music, we split our system into two stages. In the first stage, we propose an end-to-end model that directly generates a dance skeleton sequence according to the audio input. In the second stage, we apply an improved pix2pixHD GAN [39, 7] to transfer the dance skeleton sequence to a dance video. In this overview, we will mainly describe the first stage, as shown in Figure 2.
Let be the number of joints of the human skeleton, and the dimension of a 2D coordinate is 2. We formulate a dance skeleton sequence as a sequence of human skeletons across consecutive frames in total: where each skeleton frame
is a vector containing alljoint locations. Our goal is to learn a function that maps audio signals with sample rate per frame to a joint location vector sequence.
Generator. The generator is composed of a music encoding part and a pose generator. The input audio signals are divided into pieces of 0.1-second music. These pieces are encoded using 1D convolution and then fed into a bi-directional 2-layer GRU in chronological order, resulting in output hidden states
. These hidden states are fed in the pose generator, which is a multi-layer perceptron to produce a skeleton sequence.
Local Temporal Discriminator. The output skeleton sequence is divided into overlapping sequences
. Then these sub-sequences are fed into the Local Temporal Discriminator, which is a two-branch convolutional network. In the end, a small classifier outputsscores that determine the realism of these skeleton sub-sequences.
Global Content Discriminator. The input to the Global Content Discriminator includes the music and the dance skeleton sequence . For the pose part, the skeleton sequence is encoded using pose discriminator as . For the music part, similar to the sub-network of the generator, music is encoded using 1D convolution and then fed into a bi-directional 2-layer GRU, resulting an output and is transmitted into the self-attention component of  to get a comprehensive music feature expression . In the end, we concatenate and along channels and use a small classifier, composed of a 1D convolutional layer and a fully-connected (FC) layer, to determine if the skeleton sequence matches the music.
Pose Perceptual Loss.
Recently, Graph Convolutional Network (GCN) has been extended to model skeletons since the human skeleton structure is graph-structured data. Thus, the feature extracted by GCN remains a high-level spatial structural information between different body parts. Matching activations in a pre-trained GCN network gives a better constraint on both the detail and layout of a pose than the traditional methods such asdistance and distance. Figure 3 shows the pipeline of the pose perceptual loss. With the pose perceptual loss, our output skeleton sequences do not need an additional smooth step or any other post-processing.
Perceptual loss or feature matching loss [2, 8, 11, 15, 27, 39, 40] is a popular loss to measure the similarity between two images in image processing and synthesis. For the tasks that generate human skeleton sequences [3, 19, 35], only or distance is used for measuring pose similarity. With a or loss, we find that our model tends to generate poses conservatively (repeatedly) and fail to capture the semantic relationship across motion appropriately. Moreover, the datasets generated by OpenPose [4, 5, 42] are very noisy, as shown in Figure 5. Correcting inaccurate human poses on a large number of videos is labor-intensive and undesirable: a two-minute video with 10 FPS will have 1200 poses to verify. To tackle these difficulties, we propose a novel pose perceptual loss.
The idea of perceptual loss is originally studied in the image domain, which is used to match activations in a visual perception network such as VGG-19 [32, 8]. To use the traditional perceptual loss, we need to draw generated skeletons on images, which is complicated and seemingly suboptimal. Instead of projecting pose joint coordinates to an image, we propose to directly match activations in a pose recognition network that takes human skeleton sequences as input. Such a network is mainly aimed at pose recognition or prediction tasks, and ST-GCN 
is a Graph Convolutional Network (GCN) that is applicable to be a visual perception network in our case. ST-GCN utilizes a spatial-temporal graph to form the hierarchical representation of skeleton sequences and is capable of automatically learning both spatial and temporal patterns from data. To test the impact of the pose perceptual loss on our noisy dataset, we prepare a 20-video dataset with many noises due to the wrong pose detection of OpenPose. As shown in Figure4, our generator can stably generate poses with the pose perceptual loss.
Given a pre-trained GCN network , we define a collection of layers as . For a training pair , where is the ground truth skeleton sequence and is the corresponding piece of music, our perceptual loss is
is the first-stage generator in our framework. The hyperparametersbalance the contribution of each layer to the loss.
To evaluate if a skeleton sequence is an excellent dance, we believe the most indispensable factors are the intra-frame representation for joint co-occurrences and the inter-frame representation for skeleton temporal evolution. To extract features of a pose sequence, we explore multi-stream CNN-based methods and adopt the Hierarchical Co-occurrence Network framework  to enable discriminators to differentiable real and fake pose sequences.
Two-Stream CNN. The input of the pose discriminator is a skeleton sequence
. The temporal difference is interpolated to be of the same shape of. Then the skeleton sequence and the temporal difference are fed into the network directly as two streams of inputs. Their feature maps are fused by concatenation along channels, and then we use convolutional and fully-connected layers to extract features.
One of the objectives of the pose generator is the temporal coherence of the generated skeleton sequence. For example, when a man moves his left foot, his right foot should keep still for multiples frames. Similar to PatchGAN [14, 47, 40], we propose to use Local Temporal Discriminator, which is a 1D version of PatchGAN to achieve coherence between consecutive frames. Besides, the Local Temporal Discriminator contains a trimmed pose discriminator and a small classifier.
Dance is closely related to music, and the harmony between music and dance is a crucial criterion to evaluate a dance sequence. Inspired by , we proposed the Global Content Discriminator to deal with the relationship between music and dance.
As we mentioned previously, music is encoded as a sequence . Though GRU can capture long term dependencies, it is still challenging for GRU to encode the entire music information. In our experiment, only using to represent music feature will lead to a crash of the beginning part of the skeleton sequence. Therefore, we use the self-attention mechanism  to assign a weight for each hidden state and gain a comprehensive embedding. In the next part, we briefly describe the self-attention mechanism used in our framework.
Self-attention mechanism. Given , we can compute its weight at each time step by
where is -th element of the while and . is the assigned weight for -th time step in the sequence of hidden states. Thus, the music feature can be computed by multiplying the scores and , written as .
GAN loss The Local Temporal Discriminator () is trained on overlapping skeleton sequences that are sampled using from a whole skeleton sequence. The Global Content Discriminator () distinguishes the harmony between the skeleton sequence and the input music . Besides, we have and the ground truth skeleton sequence . We also apply a gradient penalty  term in . Therefore, the adversarial loss is defined as
where is the weight for the gradient penalty term.
distance Given a ground truth dance skeleton sequence with the same shape of , the reconstruction loss at the joint level is:
Feature matching loss We adopt the feature matching loss from  to stabilize the training of Global Content Discriminator :
where is the number of layers in and denotes the layer of . In addition, we omit the normalization term of the original to fit our architecture.
Full Objective. Our full objective is
where , , and represent the weights for each loss term.
Recently, researchers have been studying motion transfer, especially for transferring dance motion between two videos [7, 23, 40, 46]. Among these methods, we adopt the approach proposed by Chan et al.  for its simplicity and effectiveness. Given a skeleton sequence and a video of a target person, the framework could transfer the movement of the skeleton sequence to the target person. We used a third-party implementation222https://github.com/CUHKSZ-TQL/EverybodyDanceNow_reproduce_pytorch.
K-pop dataset. To build our dataset, we apply OpenPose [4, 5, 42] to some online videos to obtain the skeleton sequences. In total, We collected 60 videos about 3 minutes with a single women dancer and split these videos into two datasets. The first part with 20 videos is very noisy, as shown in Figure 5. This dataset is used to test the performance of the pose perceptual loss on noisy data. 18 videos of this part are for training, and 2 videos of this part are for evaluation. The second part with 40 videos is relatively clean and used to form our automatic dance video generation task. 37 videos of this part are for training, and 3 videos of this part are for evaluation. The detail of this dataset is shown in Table 0(c).
Let’s Dance Dataset. Castro et al.  released a dataset containing 16 classes of dance, presented in Table 0(a). The dataset provides information about human skeleton sequences for pose recognition. Though there are existing enormous motion datasets [18, 30, 33] with skeleton sequences, we choose Let’s Dance Dataset to pre-train our ST-GCN for pose perceptual loss as dance is different with normal human motion.
FMA. For our cross-modal evaluation, the extraction of music features is needed. To achieve this goal, we adopt CRNN  and choose the dataset Free Music Archive (FMA) to train CRNN. In FMA, genre information and the music content are provided for genre classification. The information of FMA is shown in Table 0(c).
All the models are trained on an Nvidia GeForce GTX 1080 Ti GPU. For the first stage in our framework, the model is implemented in PyTorch
and takes approximately one day to train for 400 epochs. For the hyperparameters, we set, , , , . For the self attention mechanism, we set
. For the loss function, the hyperparametersare set to be and Though the weight of distance loss is relatively large, the absolute value of the loss is quite small. We used Adam  for all the networks with a learning rate of 0.003 for the generator and 0.003 for the Local Temporal Discriminator and 0.005 for the Global Content Discriminator.
For the second stage that transfers pose to video, the model takes approximately three days to train, and the hyperparameters of it adopt the same as . For the pre-train process of ST-GCN and CRNN, we also used Adam  for them with a learning rate of 0.002. ST-GCN achieves 46% precision on Let’s Dance Dataset. CRNN is pretrained on the FMA, and the top-2 accuracy is 67.82%.
We will evaluate the following baselines and our model.
. In this condition we just use distance to conduct the generator.
Global D. Based on , we add a Global Content Discriminator.
Local D. Based on Global D , we add a Local Temporal Discriminator.
Our model. Based on Local D, we add pose perceptual loss. These conditions are used in Table 2.
To evaluate the quality of the generated skeleton sequences (our main contributions), we conduct a user study comparing the synthesis skeleton sequence and the ground-truth skeleton sequence. We randomly sample 10 pairs sequences with different lengths and draw the sequences into videos. To make this study fair, we verify the ground truth skeletons and re-annotate the noisy ones. In the user study, each participant watches the video of the synthesis skeleton sequence and the video of the ground truth skeleton sequence in random order. Then the participant needs to choose one of the two options: 1) The first video is better. 2) The second video is better. As shown in Figure 7, in 43.0% of the comparisons, participants vote for our synthesized skeleton sequence. This user study shows that our model can choreograph at a similar level with real artists.
It is challenging to evaluate if a dance sequence is suitable for a piece of music. To our best knowledge, there is no existing method to evaluate the mapping between music and dance. Therefore, we propose a two-step cross-modal metric, as shown in Figure 8, to estimate the similarity between music and dance.
Given a training set where is a dance skeleton sequence and is the corresponding music. Then with a pre-trained music feature extractor , we aggregate all the music embeddings in an embedding dictionary.
The input to our evaluation is music . With our generator , we can get the synthesized skeleton sequence . The first step is to find a skeleton sequence that represents the music . We first obtain the music feature by . Then let be the nearest neighbor of in the embedding dictionary. In the end, we use its corresponding skeleton sequence to represent the music . The second step is to measure the similarity between two skeleton sequences with the novel metric learning objective based on a triplet architecture and Maximum Mean Discrepancy, proposed by Coskun et al. . More implementation details about this metric will be shown in supplement materials.
To evaluate the quality of results of the final method in comparison to other conditions, Chan et al.  propose to make a transfer between the same video since there is no reference for the synthesized frame and use SSIM  and LPIPS  to measure the videos. For our task, such metrics are useless because there are no reference frames for the generated dance video. So we apply BRISQUE , which is a no-reference Image Quality Assessment to measure the quality of our final generated dance video.
As shown in Table 2, by utilizing the Global Content Discriminator and the Local Temporal Discriminator, even for a single frame result, the score is better. For the addition of the pose perceptual loss, the poses become plausible, and then transferring the diverse poses to the frames may lead to the decline of the score. Furthermore, more significant differences can be observed in our video. To validate our proposed evaluation, we also try two random conditions:
Rand Frame. Randomly select 50 frames from the training dataset for the input music instead of feeding the music into the generator.
Rand Seq. Randomly select a skeleton sequence from the training dataset for the input music instead of feeding the music into the generator.
To make the random results stable, we make ten random processes and get the average score.
We have presented a two-stage framework to generate dance videos, given any music. With our proposed pose perceptual loss, our model can be trained on dance videos with noisy pose skeleton sequence (no human labels). Our approach can create arbitrarily long, good-quality videos. We hope that this pipeline of synthesizing skeleton sequence and dance video combining with pose perceptual loss can support more future work, including more creative video synthesis for artists.
GrooveNet: real-time music-driven dance movement generation using artificial neural networks. networks. Cited by: §1, §2.
Convolutional recurrent neural networks for music classification. In ICASSP, Cited by: §6.1, §6.3.2.
Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In NeurIPS, Cited by: §4.
Temporal generative adversarial nets with singular value clipping. In ICCV, Cited by: §2.
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: §6.3.3.