Music2Dance: Music-driven Dance Generation using WaveNet

02/02/2020 ∙ by Wenlin Zhuang, et al. ∙ 0

In this paper, we propose a novel system, named as Music2Dance, for addressing the problem of fully automatic music and choreography. Our key idea is to shift the WaveNet, which is originally designed for speech generation, to the human motion synthesis. To balance the big differences between these two tasks, we propose a novel network structure. Typically, being regarded as the local condition for our network, the music features are first extracted by considering the characteristics of rhythms and melody. In addition, the types of dance are then designed as the global condition for the network. Both of the two conditions are utilized to stabilize the network training. Beyond the network architecture, another main challenge is the lack of data. In order to further tackle the obstacle, we have captured the synchronized music-dance pairs by professional dancers, and thus build a high-quality music-dance pair dataset. Experiments have demonstrated the performance of the proposed system and the proposed method can achieve the state-of-the-art results.



There are no comments yet.


page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As an art of human motion, dance plays an important role in culture, sports and related fields, e.g., art programs, rhythmic gymnastics, figure skating. Conventionally, dance is always involved with music to enhance artistic appeal, and the combination of music and choreography needs careful design and meticulous arrangement. In general, music and choreography should not only reflect the artistic quality of dance, but also express the content of music. Nevertheless, it always requires numerous efforts and plenty of time, particularly the professional experiences from choreographers [33]

, to implement the choreography with music. Performing the easy and fully automatic choreography with music is always difficult, and thus becomes an interesting research topic in the filed of computer vision and computer graphics 

[10, 16, 32, 50]. In this paper, we focus on the problem of automatic choreographing with music.

Figure 1: Our method can directly generate realistic dance motion sequences from the input music (wave).

Early researchers address this problem through modifying the local motion with perceptual cues extracted from music [10], taking the idea of stitching motion [26, 40], or building a dataset to synthesize new dances by matching music features to those candidates in the dataset [49, 33]

. In recent years, with the development of deep learning

[31, 29], more and more researchers begin to investigate this challenging problem by deep learning. GrooveNet [2]

adopted Factored Conditional Restricted Boltzmann Machines and Recurrent Neural Networks (RNN), but it is only trained and tested on a small dataset. Lee 

et al[32] used encoder-decoder model to generate 2D dance motion. Tang et al[50]

trained a LSTM-autoencoder (Long Short Term Memory) to generate dance motion directly from music features. However, their methods could not generate different dance types from the same model, that is,

dance types need to train models. Yalta et al[55]

tried the weakly-supervised learning method, and their generated dance motions are always simple and repetitive. It is worth noting that dance motion is complex, various, and high-quality, which is far from being solved in choreography.

Beyond the technical challenge, lack of data is another obstacle. Existing 3D human motion datasets include: CMU motion-capture [13], SFU motion-capture [48], Mixamo [1], SBU Kinect interaction [56], and HDM05 [42]. But these datasets are all about walking, running and jumping, lacking dance motion, and more importantly, they are not the music-dance pair datasets. Existing datasets which contain dance motions also have many drawbacks. Shiratori et al[49]

built a database of simple dance motions, but it is not open source. Lee 

et al[32] only constructed 2D dance motions by OpenPose [9] to detect human poses. Tang et al[50] collected about hours of dance motion (4 dance types) by motion capture devices, but the quality of dance motions is not high, and the dance motions do not match the music (they are not music-dance pairs). The lack of high-quality music-dance pair data makes it difficult for the topic to take a step.

Aiming at the above mentioned challenges, we build a high-quality music-dance pair dataset and propose a complete music2dance system. Our dataset contains two dance types, modern and curtilage dance, both are accurately aligned with the corresponding music. Modern dance consists of 94155 frames (FPS=60), approximately 26.15min, and curtilage dance consists of 114,192 frames (FPS=60), approximately 31.72min, for a total of 208,347 frames, 57.87min. It should be noted that the motion we collect contains finger motion, which is a very important part of the dance motion, and can make the dance more natural and rich (natural means the dance motion are realistic, and rich means the dance steps are various in this paper). The dataset will be public available in the future.

Our proposed music2dance system includes

music feature extraction

, music attribute classification, music-driven dance generation and post-processing. 1) music feature extraction. As the input of the whole system, music features play a decisive role. We learn from professional choreographers that the most important part of music choreography is to create dance according to rhythm and melody of music. In our system, we use a special representation to represent music, which can characterize the rhythms and melody. 2) Music attribute classification. In general, a piece of music can determine the dance type, e.g

., smoothing music is suitable for modern dance, not curtilage dance, so it is necessary to classify it so that our system can automatically determine the dance type.

3) music-driven dance generation. This is the core part of our whole system, and we adopt WaveNet [43]

, which has been very mature in speech generation. WaveNet is exactly an autoregressive model that can achieve motion generation. We take the music features as the local condition and the dance type (music attribute) as the global condition. The local and global conditions determine the generated dance, which should not only conform to the music rhythm and melody, but also conform to the corresponding dance type.

4) Post-processing. The generated dance motion has some artifacts such as foot sliding, and need to be processed. Different from previous methods [40, 54, 21], we put forward the Foot Constraint model to solve the foot sliding problem. Because our dance motions are too complex, it is difficult to deal with the problem directly by using inverse kinematics (IK). Experimental results show that the dance motions generated by our method are not only consistent with music, but also are high-quality, natural and rich. We build a LSTM-based model as the baseline, and the comparison also highlights the advantages of our method.

The main contributions of this work include:

  • We propose a music2dance system, which takes music as input and outputs high-quality dance motions that are suitable for music.

  • To our knowledge, this is the first work that utilizes WaveNet for motion generation. Compared with other sequence models, the quality of motion generation is greatly improved.

  • We build a high-quality music-dance pair dataset, and the dance motion includes finger motion.

2 Background

Music Feature. In [32, 50, 55]

, the researchers used Mel spectrum, Mel Frequency Ceptral Coefficient(MFCC) or short-time Fourier transform (STFT) spectrum as a music feature. Mel spectrum or STFT spectrum is obtained by Fourier transform. Although it is widely used in speech recognition, it is not applicable to music. Because it is a low-level feature that all audio contains, it is not suitable as the music feature. The most basic features in music are beat, rhythm and melody. More critically, this is also the most important dependency in dance generation. As far as we know, most of the work in the music information retrieval is about how to extract the music features. Onset can expresses the beginning of music notes and it is the most basic expression form of music rhythm

[20, 14, 3]. Beat is another form of rhythm, and there is a lot of work on beat detection [5, 27, 28]. Melody, one of the most important music features, can be expressed via chroma feature [18, 25, 41]. Most importantly, chroma feature is highly adaptable to changes in timbre and instruments. Therefore, we adopt onset, beat and chroma feature as the music features in our method.

Motion Generation

. The early methods on motion generation can be divided into Hidden Markov Models(HMMs)

[6, 7], statistical dynamic model [35, 44, 53, 22, 12, 30, 54] and low-dimensional statistical model [11] for human poses. The most famous approach is motion graph [26, 40, 47]. Recent work focus on deep learning methods to model human motion, including RNN [15, 23, 34, 36, 38, 19], fully connected networks [8, 21]

, reinforcement learning

[46, 45], and generative adversarial networks(GAN)[52]. In spite of there are a lot of research work, most of them are about motion prediction [15, 23, 38, 36], which could only predict short-term motion but could not generate long-term motion. Li et al[36] proposed auto-conditioned LSTM to solve error accumulation and generate long-term sequence, but the quality of generated motion is not high enough. In addition, a series of work on motion synthesis and control [21, 46], which can generate long-term high-quality motion, while the generated motions are walking, running, jumping and other simple motions, unable to generate complex motions, e.g., dance motion. The method we proposed can solve this problem and generate complex, natural and high-quality dance motion.

Sequence Model:WaveNet. WaveNet [43]

is a sequence generation model that has been used in speech generation. The basic principle of WaveNet is conditional probability, with the predictive distribution for each sample conditioned on all previous ones. Based on WaveNet, we propose dance motion generation model.

3 Dataset

Figure 2: Our music-dance pair data collection process. There are three steps: 1) playing music and professional dancers dancing with music, 2) motion capture and collection system collecting human motion, 3) repairing motion, clipping and registration according to music.

3.1 Music data

In order to automatically determine the music attribute, we need to collect a lot of music data. The music attribute contains two types: smoothing music and fast-rhythm music. We downloaded the music songs from the music website, and then carefully distinguished the music attribute. Finally, we collected about 35.7 hours music songs. 18.2 hours are smoothing music (suitable for modern dance), and the others are fast-rhythm music (suitable for curtilage dance).

3.2 Music-Dance pair

To the best of our knowledge, there are few datasets of dance motions for the network training. Tang et al[50] attempted to collect dance motions with motion capture devices, but the quality of dance motions is low, and their captured motions do not match the corresponding music. In order to achieve the high-quality dance motion generation, we have collected high-quality music-dance pair data. Our data collection process includes three steps: 1) playing music and letting the professional dancers to dance with music, 2) collecting human motions with a motion capture system, 3) repairing, clipping and registering motions according to music, and finally producing music-dance pair data, as shown in Figure  2.

We asked two professional dancers (a man and a woman) to collect the modern and curtilage dance motions respectively. To get high-quality dance motions, we spent a lot of time on data repair and alignment, and then we obtained modern dance motion about 26.15min (94155 frames, FPS=60) and curtilage dance motion about 31.72min (114192 frames, FPS=60). Notably, each frame of our dance motion contains 55 joints including fingers, which can help to improve the naturalness of dance motion.

4 Methodology

Figure 3: Our music2dance system framework. Our system implements input music audio and output complex, natural and high-quality dance motion. The whole system consists of four parts: music feature extraction, music attribute classification, music-driven dance generation, and post-processing.

4.1 System Framework

Our goal is to generate high-quality dance motion, and the generated dance not only matches the music style, but also needs to be rich and natural. In order to achieve the goal, we construct a complete music2dance system, which includes four parts: music feature extraction, music attribute classification, music-driven dance generation, and post-processing. Each part has its own specific function, and music-driven dance generation is the core of the system.

4.2 Music Feature Extraction

Instead of using Mel spectrum or other acoustic features, we adopt high level music features: onset, beat and chroma. There are many methods to extract these music features in the music information retrieval. The most common libraries are librosa [39] and madmom [4]. We find that the music features extracted by madmom are more accurate than librosa. Therefore, we use madmom to extract music features

: onset, beat and chroma. The onset feature is a 1D vector, and the value represents the probability (the probability of being onset). The beat feature, including the beat and downbeat, can fully express the rhythm information. The chroma feature closely relates to the twelve different pitch classes and can characterize the music melody, as shown in Figure  

4. The frame per-second (FPS) of the three music features is 10, and the dimension of each frame is 15 (onset 1, beat 2, chroma 12).

4.3 Music Attribute Classification

The music attributes include two types: the music that is suitable for modern dance (e.g., smoothing music) and the music that is suitable for curtilage dance(e.g., fast-rhythm music). In our method, the input of Music Attribute Classifier is the music features (onset, beat, chroma mentioned in Section  4.2), and the output is the music attribute (two categories, represented by a one-hot code).


The input is computed within a sliding window, and the window size must be carefully considered, because the music attribute is not determined by a short music clip (2-3 seconds). There are many climax and gentle clips in a song, and it is inaccurate to use a small clip to extract the music attribute. In our experiment, the window size is 30 seconds, so the input dimension is . For the model , we adopt 3 temporal convolution layers and 1 Bi-LSTM (Bi-directional LSTM) layer as a high-dimensional feature extractor, and finally it is classified by a fully connected layer, as shown in Figure  5

. The kernel size of temporal convolution is set to 2, and the feature channels are in order: 1-16-24-32. To reduce the dimension, the stride of the third temporal convolution layer is set to 2, and the probability of dropout is set to 0.1. We adopt cross entropy as loss function.

4.4 Music-driven Dance Generation

In this section, we elaborate on how to achieve dance generation based on music features. Firstly, we introduce the representation of the motion feature in detail (Section  4.4.1), then describe how to map the music features to motion domain (Section  4.4.2). The most crucial part, how to generate dance motion, is introduced in Section  4.4.3.

4.4.1 Motion Representation

Each frame in the motion data contains 55 joints: one is for root joint, whose motion is represented by translation and rotation related to the world coordinate (), and the remaining joints are represented by the rotation related to their parent joints (, is the joint index). To better describe the motion feature, we modify the root joint representation. We use the relative rotation between current frame and previous frame for the rotation around Y-axis(vertical axis of human pose), and the , translation of the root joint are defined on the local coordinate of previous frame (). There is a great advantage: no matter where the last frame moves to and which direction it faces, our method can describe the next frame motion, which indicates the invariant of our data representation. The joint rotation motion thus can be described as follows:


However, if we adopt such a representation, there would generate large accumulation errors. Because the joints from the root to the end-effector are rotated relative to the parent joint, a large error appears on the end-effector position if the rotations are inaccurate, which greatly reduce the quality of the dance motion. To solve this problem, we add the end-effector position into the motion representation, so that our model can predict the end-effector position in each predicted frame. This effectively helps to eliminate the accumulation error. The end-effector motion is described as:


where is the end-effector index. In our method, we use left/right as foot end-effectors, as head end-effector. Since our motion data includes finger motion and there are too many end-effectors in the hand, we use left/right hand as the end-effectors of two hands, respectively. In addition, foot sliding is a common problem in human motion generation. As many previous methods [54, 21], we add the foot constraints to the motion feature: we adopt a 2d vector to describe whether the left/right foot are in contact and fixed to the ground. By detecting the left/right position and speed of each frame, the ground-truth of the foot constraints are obtained. Finally, the motion feature includes:

Figure 4: Music feature. Music is represented by wave, with Mel spectrum as its basic feature. Chroma, beat(beat, downbeat) and onset are its high-level features.

4.4.2 Music Feature Mapping

In our method, we take the music features as a condition to generate dance motion. However, we find that if we directly take the music features as the local condition of WaveNet, the performance of generated dance motion is very poor. The reason is that the music features and the dance motion feature are in two different domains, and they are difficult to fuse directly. We use a model to map the music features to the dance motion domain, so that the features can be fused. We directly use the model adopted in music attribute classification, 3 temporal convolution layers and 1 Bi-LSTM layer (without fully connected layer), to map the music features to the motion feature domain. The music feature mapping is described as:


where is the music feature mapping model, is the high-dimensional feature.

4.4.3 Dance Motion Generation

As a conditional probabilistic and autoregressive model, WaveNet can model the conditional distribution . It takes the high-dimensional feature as the local condition and the music attribute (dance type) as the global condition to output the conditional distribution .

Figure 5: Music Attribute Classification. Our model consists of 3 temporal convolution layers, 1 Bi-LSTM (Bi-directional LSTM) layer, and 1 fully connected layer. The stride of the third temporal convolution layer is set to 2, and the dimension is reduced.

The conditional distribution can be described as:


In order to predict the distribution of current frame, Gaussian Mixture Model (GMM) is adopted to model the probilistic distribution of the motion feature in the current frame. The distribution of GMM is:


GMM model requires , . is the number of gaussian mode, is the mean vector and covariance matrix, respectively. To meet the requirements, we define the output of our model as: (, is the dimension index of motion feature (), the model output includes: , , , ), and the requirements can be satisfied:


The loss function is defined as the negative log likelihood:


In our experiment, is set to 1. The negative log GMM loss calculates the joint rotation motion and the end-effector position motion . For foot constraints , binary cross entropy loss ( loss) is adopted:


so the loss fuction in training phase is,


To balance the two loss functions, we set a parameter and set it to 0.1 in our experiment.

In the inference phase, the predicted motion feature can be sampled from the GMM model,


The predicted motion feature can be described as follows,

Method Accuracy (%)
Baseline (FC + Bi-LSTM) 88.0
Ours (Temproal Conv + Bi-LSTM) 92.1
Table 1: Music attribute classification comparision on testing data (Accuracy).
Method Mean Squared Error (MSE)
Baseline (LSTM) 0.0028
Ours (Temproal Conv) 0.0010

Table 2: Post-processing comparision on testing data (MSE).

4.5 Post-processing

Foot sliding is a common problem in motion generation. Similar as other methods [40, 54, 21], we first attempt to solve this problem with IK, which requires very high accuracy of the predicted foot constraints ( 95%). However, we find that the accuracy of the predicted foot constraints is not high sufficiently (about 85%-90%) due to the irregularity and complexity of dance motion. The jitter problem occurs if we adopt IK. Therefore, we propose a Foot Constraint Model to solve the problem. The Foot Constraint Model consists of 3 temporal convolution layers, and the kernel size is set to 3 with dilation 1, 3 and 9, respectively.

The goal of the Foot Constraint Model is to solve the foot sliding of the motion generated by the WaveNet, so we only deal with the motion of the lower body (including root joint). The input of the model consists of the joint rotation motion of the lower body , the position of foot end-effectors (left/right ) , and the foot constraints . The output is the increment of the lower body joint rotation, so it can be described as:


is the Foot Constraint Model. In the training phase, we simulate the foot sliding data by adding gaussian noise and gaussian smoothing into the ground-truth data. Finally, we adopt MSE loss function, and we add the smoothing loss (smoothing factor is set to 0.1 in our experiment).

5 Experiment

The implementation details are described in supplementary materials. In this section, we demonstrate our approach by evaluating our results (Section  5.1) and analyzing our methods (Section  5.2).

Figure 6: Result of user study. This figure shows the scores given by 25 users, including the means and standrand deviations for dance motion generated by Baseline(LSTM) and our method.
(a) Example of generated modern dance.
(b) Example of generated curtilage dance.
Figure 7: Dance motion generated by our method. We rendered the generated dance motion with meshes and textures. Our method can obtain high-quality, rich dance motion, and the dance motion includes finger motion (blue arrow). See the video in supplementary materials.

5.1 Result

Our entire system is difficult to fully quantify because it is a generation task for motion synthesis, and we break it down into three parts. Among them, both music attribute classification and post-processing can be quantitatively evaluated. Therefore, these two parts are firstly evaluated, and then user study is adopted to evaluate the dance generated by the system.

Music Attribute. To verify the effectiveness of our approach, we built an FC+Bi-LSTM model as the baseline. It replaces the temporal convolution layer in our music attribute classifier with fully connected layer. The training data and strategies are exactly the same as our model. We use the prediction accuracy to evaluate the final results and out method can obtain excellent results than the baseline, as shown in Table  1.

Post-processing. Our Foot Constraint Model is a stacked temporal convolution network (3 layers). To illustrate its effectiveness, we propose a baseline: 1 fully connected layer + LSTM + 1 fully connected layer. We compare the performance on the test data, and use MSE for evaluation, and our method can obtain better performance, as shown in Table  2.

Our music-driven dance generation result. In this section, we demonstrate the ability of our model for different music style. Similar to [15, 23, 38], we built a motion generation model based on RNN (LSTM) as the baseline. In the LSTM-based model, the music features are mapped to the motion feature domain (high-dimensional feature, as described in Section  4.4.2), and then the high-dimensional feature is merged with motion feature as the input of LSTM network.

We quantitatively evaluated the generated dance motion via user study. 20 dance motions are generated by our model and the baseline respectively. The 20 dance motions includes: 10 modern dance motions and 10 curtilage dance motions (each 25s). We asked 25 users to score these dance motions. The score basis consists of the naturalness of dance motion (25%), the richness (25%), whether it adapts to the music rhythm and melody (40%), foot sliding (10%). More scoring indicators are described in supplementary materials. We report the mean scores and standrand deviations for the dance motions generated by our model and the baseline, as shown in Figure  6

. The result of user study shows that our generated dance motions are obviously superior to the baseline. Our mean score reaches 8.452 (modern dance), 8.196 (curtilage dance), and the standrand deviations are significantly smaller than the baseline, especially modern dance. In our motion data, modern dance is more complicated than curtilage dance, which is a very important reason that the modern dance generated by the baseline is worse than curtilage dance (lower mean score, larger standard deviation). Our approach can generate complex modern dance, and the score is slightly higher than curtilage dance. It means that our method is more robust to complex dance motion.

In addition, we rendered the generated dance motions with meshes and textures. We show two examples in Figure  7, one for modern dance and another for curtilage dance. Both examples show that our method can obtain complex, rich and high-quality dance motion.

5.2 Discussion

To demonstrate our method, we discuss each part of our method in this section. The first part (Mel spectrum v.s. Music features(onset,beat,chroma)) is discussed via the curtilage dance, and the last two parts (music feature mapping, with/without end-effector position) are discussed via the modern dance.

(a) left hand
(b) right hand
Figure 8: The comparison between mel spectrum and music features in left/right hand height. The generated dance has a richer variation of left/right hand motion with music features as input, especially right hand motion.

Mel spectrum v.s. Music features. We trained the model with mel spectrum and music features respectively, and tested on the same music clip. When we use the mel spectrum as input, the generated dance motion is very stiff and appears jitter problem (we smooth the motion via Gaussian filter with a large kernel size ). We compared the generated motion generated by the same music clip and plotted the height of the left/right hand over time (480 frames), as shown in Figure  8. When the music features are used as input, the generated dance is richer, matches better with music. It means that the music features in our method have higher generalization ability.

Figure 9: The comparison of with/without music feature mapping. PCA is used to reduce the dimension of end-effector position features, so that we can compare the variety of generated dance motion. Higher dance richness can be obtained when using music feature mapping.

Music Feature Mapping.

The music feature mapping module is a very important part. We tried to directly take the music features as the local condition of WaveNet without the music feature mapping, and adopted the same training data and strategies to train the model. We found that the richness of the dance motion generated by the model is poor (only simple motion or standing still). When the model is trained with the music feature mapping, there is a lower GMM loss. In order to better compare the results, we extracted the end-effector position (important motion feature) of the generated motion and used Principal Component Analysis (PCA) to reduce the dimension to visualize the results, as shown in Figure  

9. It is obvious that we can get richer motion when we use the music feature mapping module.

(a) left toe_end
(b) right toe_end
Figure 10: The comparison of whether add end-effector position feature to the motion feature. Left/right toe_end motion generated by the model with end-effector position has more variety. In this comparison experiment, we do not adopt post-processing.

With/Without end-effector position. In Section  4.4.1, we explain why the end-effector position feature is added to the motion feature. In order to verify its advantages, we trained a model without the end-effector position feature. Since there is no end-effectors in the generated motion (no toe_end position), we do not adopt post-processing in this comparison experiment. We compared the dance motion generated by the same music clip and plotted the height of the toe_end position over time, as shown in Figure  10. It shows that adding the end-effector position can get more diverse dance motions.

6 Conclusion

In this paper, we propose a music2dance system, and it can generate realistic dance motion from the input music. We build a high-quality music-dance pair dataset, including two dance types. We adopt WaveNet as our motion generation model, which can generate dance motion of different dance types from the same model. Our results demonstrate the power of our method. However, there are still some defects: the generated dance motions are not as good as professional dancers, and there is slight foot sliding. These are the focus of our future work.


  • [1] Adobe. Adobe mixamo dataset., 2017.
  • [2] Omid Alemi, Jules Françoise, and Philippe Pasquier. Groovenet: Real-time music-driven dance movement generation using artificial neural networks. networks, 8(17):26, 2017.
  • [3] Sebastian Böck, Andreas Arzt, Florian Krebs, and Markus Schedl. Online real-time onset detection with recurrent neural networks. In Proceedings of the 15th International Conference on Digital Audio Effects (DAFx-12), York, UK, 2012.
  • [4] Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Gerhard Widmer. Madmom: A new python audio and music signal processing library. In Proceedings of the 24th ACM international conference on Multimedia, pages 1174–1178. ACM, 2016.
  • [5] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Joint beat and downbeat tracking with recurrent neural networks. In ISMIR, pages 255–261, 2016.
  • [6] Richard Bowden. Learning statistical models of human motion. In IEEE Workshop on Human Modeling, Analysis and Synthesis, CVPR, volume 2000, 2000.
  • [7] Matthew Brand and Aaron Hertzmann. Style machines. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 183–192. ACM Press/Addison-Wesley Publishing Co., 2000.
  • [8] Judith Butepage, Michael J Black, Danica Kragic, and Hedvig Kjellstrom. Deep representation learning for human motion prediction and classification. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 6158–6166, 2017.
  • [9] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008, 2018.
  • [10] Marc Cardle, Loic Barthe, Stephen Brooks, and Peter Robinson. Music-driven motion editing: Local motion transformations guided by music analysis. In Proceedings 20th Eurographics UK Conference, pages 38–44. IEEE, 2002.
  • [11] Jinxiang Chai and Jessica K Hodgins. Performance animation from low-dimensional control signals. ACM Transactions on Graphics (ToG), 24(3):686–696, 2005.
  • [12] Jinxiang Chai and Jessica K Hodgins. Constraint-based motion optimization using a statistical dynamic model. In ACM Transactions on Graphics (TOG), volume 26, page 8. ACM, 2007.
  • [13] CMU. Carnegie-mellon motion capture database., 2010.
  • [14] Florian Eyben, Sebastian Böck, Björn Schuller, and Alex Graves. Universal onset detection with bidirectional long-short term memory neural networks. In Proc. 11th Intern. Soc. for Music Information Retrieval Conference, ISMIR, Utrecht, The Netherlands, pages 589–594, 2010.
  • [15] Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pages 4346–4354, 2015.
  • [16] Satoru Fukayama and Masataka Goto. Music content driven automated choreography with beat-wise motion connectivity constraints. Proceedings of SMC, pages 177–183, 2015.
  • [17] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    , pages 249–256, 2010.
  • [18] Emilia Gómez. Tonal description of music audio signals. Department of Information and Communication Technologies, 2006.
  • [19] Anand Gopalakrishnan, Ankur Mali, Dan Kifer, Lee Giles, and Alexander G Ororbia. A neural temporal model for human motion prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12116–12125, 2019.
  • [20] Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck. Onsets and frames: Dual-objective piano transcription. arXiv preprint arXiv:1710.11153, 2017.
  • [21] Daniel Holden, Jun Saito, and Taku Komura. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG), 35(4):138, 2016.
  • [22] Eugene Hsu, Kari Pulli, and Jovan Popović. Style translation for human motion. In ACM Transactions on Graphics (TOG), volume 24, pages 1082–1089. ACM, 2005.
  • [23] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5308–5317, 2016.
  • [24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [25] Filip Korzeniowski and Gerhard Widmer. Feature learning for chord recognition: The deep chroma extractor. arXiv preprint arXiv:1612.05065, 2016.
  • [26] Lucas Kovar, Michael Gleicher, and Frédéric Pighin. Motion graphs. In ACM SIGGRAPH 2008 classes, page 51. ACM, 2008.
  • [27] Florian Krebs, Sebastian Böck, Matthias Dorfer, and Gerhard Widmer. Downbeat tracking using beat synchronous features with recurrent neural networks. In ISMIR, pages 129–135, 2016.
  • [28] Florian Krebs, Sebastian Böck, and Gerhard Widmer. Rhythmic pattern modeling for beat and downbeat tracking in musical audio. In ISMIR, pages 227–232, 2013.
  • [29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [30] Manfred Lau, Ziv Bar-Joseph, and James Kuffner. Modeling spatial and temporal variation in motion data. In ACM Transactions on Graphics (TOG), volume 28, page 171. ACM, 2009.
  • [31] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  • [32] Juheon Lee, Seohyun Kim, and Kyogu Lee. Listen to dance: Music-driven choreography generation using autoregressive encoder-decoder network. arXiv preprint arXiv:1811.00818, 2018.
  • [33] Minho Lee, Kyogu Lee, and Jaeheung Park. Music similarity-based approach to generating dance motion sequence. Multimedia tools and applications, 62(3):895–912, 2013.
  • [34] Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5226–5234, 2018.
  • [35] Yan Li, Tianshu Wang, and Heung-Yeung Shum. Motion texture: a two-level statistical model for character motion synthesis. In ACM transactions on graphics (ToG), volume 21, pages 465–472. ACM, 2002.
  • [36] Zimo Li, Yi Zhou, Shuangjiu Xiao, Chong He, and Hao Li. Auto-conditioned lstm network for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363, 3, 2017.
  • [37] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
  • [38] Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2891–2900, 2017.
  • [39] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, volume 8, 2015.
  • [40] Jianyuan Min and Jinxiang Chai. Motion graphs++: a compact generative model for semantic motion analysis and synthesis. ACM Transactions on Graphics (TOG), 31(6):153, 2012.
  • [41] Meinard Müller and Sebastian Ewert. Chroma toolbox: Matlab implementations for extracting variants of chroma-based audio features. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR), 2011. hal-00727791, version 2-22 Oct 2012. Citeseer, 2011.
  • [42] Meinard Müller, Tido Röder, Michael Clausen, Bernhard Eberhardt, Björn Krüger, and Andreas Weber. Documentation mocap database hdm05. 2007.
  • [43] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  • [44] Vladimir Pavlovic, James M Rehg, and John MacCormick. Learning switching linear models of human motion. In Advances in neural information processing systems, pages 981–987, 2001.
  • [45] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG), 37(4):143, 2018.
  • [46] Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics (TOG), 36(4):41, 2017.
  • [47] Alla Safonova and Jessica K Hodgins.

    Construction and optimal search of interpolated motion graphs.

    ACM Transactions on Graphics (TOG), 26(3):106, 2007.
  • [48] SFU. Sfu motion capture database., 2017.
  • [49] Takaaki Shiratori, Atsushi Nakazawa, and Katsushi Ikeuchi. Dancing-to-music character animation. In Computer Graphics Forum, volume 25, pages 449–458. Wiley Online Library, 2006.
  • [50] Taoran Tang, Jia Jia, and Hanyang Mao. Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 1598–1606. ACM, 2018.
  • [51] Tijmen Tieleman and Geoffrey Hinton.

    Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.

    COURSERA: Neural networks for machine learning

    , 4(2):26–31, 2012.
  • [52] Ruben Villegas, Jimei Yang, Duygu Ceylan, and Honglak Lee. Neural kinematic networks for unsupervised motion retargetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8639–8648, 2018.
  • [53] Jack Wang, Aaron Hertzmann, and David J Fleet. Gaussian process dynamical models. In Advances in neural information processing systems, pages 1441–1448, 2006.
  • [54] Shihong Xia, Congyi Wang, Jinxiang Chai, and Jessica Hodgins. Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics (TOG), 34(4):119, 2015.
  • [55] Nelson Yalta, Shinji Watanabe, Kazuhiro Nakadai, and Tetsuya Ogata. Weakly supervised deep recurrent neural networks for basic dance step generation. arXiv preprint arXiv:1807.01126, 2018.
  • [56] Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L Berg, and Dimitris Samaras. Two-person interaction detection using body-pose features and multiple instance learning. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 28–35. IEEE, 2012.

1 Implementation detail

There are three important parts in our whole system: music attribute classification, music-driven dance generation, and post-processing. The three models are trained separately, and the training details are described as follows.

1.1 Music Attribute Classification

We divided the music data into training data (about 80%) and test data (20%). We used Adam [24]

to optimize the model on 1 NVIDIA GTX 2080Ti GPU with a batch size of 128 for 30 epochs. The initial learning rate is

and is dropped by 10 at 18th and the 25th epoch.

1.2 Music-Driven Dance Generation

1.2.1 Training details

In the wavenet model, we stack 20-layer residual blocks with 32 channels and the maximum dilation is . WaveNet is very difficult to train. We add Gaussian noise to the input and the ground-truth, and dropout (0.4) is used to the input motion feature. We use Xavier normal [17] to initialize our model, and RMSprop [51] as the optimizer. To improve the performance, we adopted two-phase training strategy. Phase 1: training 1300 epochs, the learning rate is initialized to and is dropped by 10 at 1000th epoch. Phase 2: multi-stage training. The prediction result of the first time in training process is used as the input of the second time for training again, which needs 200 epochs, with an initial learning rate of and is dropped by 10 at 100th epoch.

1.2.2 Data clustering

We need to cluster the dance data because the data distribution is not uniform. It can help improve the performance of our model. In our experiment, the window size of motion sample is 480 frames, and we adopt k-means

[37] to cluster motion samples. It is mainly noted that the motion feature adopted by k-means clustering is not joint rotation, but the joint position feature. After the clustering results are obtained, the training samples are sampled by sliding window according to the probability of each sample. The window size is set to 480 and the stride is set to 3 frames.

Figure 11: GMM loss (With/Without cluster).
Score level Content description
0 completely do not move with music
2 not like human motion
4 like human motion, but does not match music at all
6 slightly natural, slightly match music, but not rich, large foot sliding
8 natural, match music, but not rich, no large foot sliding, (junior dancers)
9 natural, match music, but not rich, no large foot sliding, (general dancers)
10 natural, highly match music, no foot sliding, very rich (professional dancers)

Table 3: Scoring level.

1.2.3 The performance of motion clustering

Dance motion clustering can help solve the problem of unbalanced samples in training, so it can improve the training performance. We plot GMM loss in training phase 1 (1300 epoch), as shown in Figure  11. It is obvious that when k-means is adopted for motion clustering, the loss is about 3 lower than that without clustering. In addition, the dances generated by the model (with cluter) will be richer when we test the generated results.

1.3 Post-processing.

We directly simulated the dance motion to get the foot sliding motion. Our dataset was divided into training data (85%) and test data (15%). Similarly, the training sample is sampled by sliding window with size 480 frames, stride 3. The model is optimized by Adam [24] with 200 epochs, and initial learning rate is .

User Study Scoring

We asked 25 people to score these dance motions. Score basis: the naturalness of dance motion (25%), the richness (25%), whether it adapts to the music rhythm and melody (40%), foot sliding (10%). More scoring indicators are described in Table  3