Generative Autoregressive Networks for 3D Dancing Move Synthesis from Music

11/11/2019 ∙ by Hyemin Ahn, et al. ∙ Seoul National University Delft University of Technology 0

This paper proposes a framework which is able to generate a sequence of three-dimensional human dance poses for a given music. The proposed framework consists of three components: a music feature encoder, a pose generator, and a music genre classifier. We focus on integrating these components for generating a realistic 3D human dancing move from music, which can be applied to artificial agents and humanoid robots. The trained dance pose generator, which is a generative autoregressive model, is able to synthesize a dance sequence longer than 5,000 pose frames. Experimental results of generated dance sequences from various songs show how the proposed method generates human-like dancing move to a given music. In addition, a generated 3D dance sequence is applied to a humanoid robot, showing that the proposed framework can make a robot to dance just by listening to music.



There are no comments yet.


page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Dance is one of the most important form of performing arts that having been emerged in all known cultures. As one of the specific subcategory of under theatrical dance, choreography associated with music is also one of the most popular forms that have usually been designed and physically performed by professional choreographers. Recently, there has been a number of new attempts to profit commercially with the dancing character [kda]. Specifically, a vocal group, whose members are all fictional and in 3D animated characters, succeeded attracting substantial attention by their music and choreography. This choreography has been obtained by using an expensive motion capture equipment with professional artists, which is a time consuming and costly task. However, since the obtained choreography can only be used for a single song, the motion capture process needs to be repeated to create a choreography for another music, unless the agent mimics the human ability to generate dances. To overcome this limitation, this paper proposes a novel framework for automatic dance generation which can synthesize a sequence of 3D dancing moves from music.

There have been several attempts related to 3D human motion modeling. There exist studies related to 3D joint sequence prediction for the character motion synthesizing such as human locomotion generation [holden2016deep, pavllo2019modeling]. However, these are for generating a motion itself, while our objective is generating a motion dependent to the specific source sequence. Related to motion generation dependent on a specific source sequence, [ahn2018text2action] has proposed a network for generating a motion of 3D upper body skeleton according to the given language sentence explaining a specific human behavior. When the input condition sequence is music, [shlizerman2018audio] has succeeded in producing a upper body motion of an avatar playing a violin or a piano when a classical music has been provided as an input. Our goal is similar to [shlizerman2018audio] in that we regard music as input, however, we suggest a methodology for a more challenging task of generating the appropriate 3D full-body human dance sequence to the provided music.

Even for a human, generating a dance sequence is a challenging task, requiring talents and experiences. One needs to understand a rhythmic feature and mood of a given music, and create a motion sequence that can meet the aesthetic criteria constrained by the musical content. In this paper, we aim at enabling an intelligent system to be capable of such process using a data-driven approach, namely deep neural networks. The proposed framework is composed as follows: a music feature encoder which generates a feature vector containing the information of a given music, a set of pose generators which is trained based on each genre dataset, such as cha-cha, rumba, tango, and waltz dances

[relate_1], and a music genre classifier selecting which pose generator to use based on the classified genre of a given music.

Fig. 1: An example of a generated dance sequence of seconds to the song of Call me maybe by Carly Rae Jepsen

. From the top, we have an input audio signal, a generated audio feature, and a generated 3D dancing moves. Note that the dimension of the plotted audio feature has been reduced using principal component analysis (PCA) for better visualization.

The proposed pose generator is a generative autoregressive model which does not employ the structure of recurrent neural networks (RNNs)

[lstm]. Our generator predicts the next pose frame based on the music feature and the set of poses generated during a certain period of time in the past. It consists of a dilated causal convolution layers, which has been shown to be effective when generating a sequence of a large number of frames [bai2018empirical, oord2016wavenet], and fully connected (FC) layers. In Section IV, it is shown that our model is efficient in that it can provide better qualitative performance with fewer parameters than RNN-based models [relate_1].

Experimental results show examples of generated dance sequences, each sequence with more than 5,000 pose frames. Comparison with other baseline methods [relate_1] demonstrates that the proposed method is more effective in synthesizing dynamic movements according to the given music. In addition, we apply the generated pose sequence to a humanoid robot NAO, so that the robot can also dance to the given music.

The remainder of the paper is structured as follows. The related works of dance generation are introduced in Section II. Section III describes the structure of proposed framework, which consists of music feature encoder, pose generator, and a music genre classifier. Section IV shows the results of generated dance sequences from various songs, and compares the proposed method with other baseline methods. The demonstration of the proposed method using a humanoid robot NAO is also provided.

Ii Related Work

There have been several approaches for generating 3D dancing moves of a human [relate_4, relate_1, DBLP:journals/corr/abs-1811-00818]. In [relate_4], the authors proposed a model that can randomly generate a choreography based on the dataset of movements of dancing humans recorded by a commercial sensor array.111Microsoft Kinect Although the generated dance sequence is a potential guidance for the choreography design process of a human, it does not utilize the audio information as input so the resulting dance has no correlation with the music. On the other hand, a model proposed in [relate_1] is a recurrent neural network (RNN) based auto-encoder which learns the relationship between music and pose features. The model encodes the music feature, which consists of -dimensional pre-extracted acoustic features and three temporal indices, into a latent vector in order to predict -dimensional pose features.

The acoustic feature used in this work is based on a set of hand-crafted features that are popularly used in music and audio domain, and the temporal indices are related to each audio frame’s location and beat per minute (BPM) information. The pose feature is a set of values of joint positions which are relative to the center of mass of skeleton. The method proposed in [relate_1] has a similar concept with ours in terms of learning the relationship between audio feature and pose. To show the effectiveness of our proposed method, we have analyzed the dance generation results in terms of the number of parameters of the network. Experimental results in Section IV-B show that our model can synthesize more realistic dances than [relate_1] even with fewer network parameters.

Our approach shares a similar concept with [DBLP:journals/corr/abs-1811-00818] as it predicts next pose based on the audio feature and pose data from a certain past period. However, unlike the moving 3D human shape generated by our proposed methodology, the pose skeleton generated by [DBLP:journals/corr/abs-1811-00818] does not provide realistic dancing moves, since the position of the root joint is always fixed at a specific position and pose data lies on the two-dimensional coordinate.

Fig. 2: An overview of the proposed system. It first determines the genre of the input music, chooses a pose generator for the determined genre, and generates a pose based on the selected model.

Iii Methodology

Iii-a Overall Structure

We focus on building a machine learning-based framework for generating a 3D pose sequence of a dancing human when music has been given as an input. Figure 

2 shows an overview of the proposed system. The proposed dance generation system is composed of a music feature encoder, a pose generator, and a music genre classifier. In order to synthesize a dance movement tailored to the given music, the proposed framework works in the following order.

  1. Convert the input music into a set of audio feature vectors.

  2. Determine the genre of the music by considering the entire audio feature vector.

  3. Choose the one pose generator according to the identified genre.

  4. Based on the chosen model, estimate the pose for all frames.

After the music feature encoder generates a set of audio feature vectors , where denotes the total number of time frames of the given audio, a music genre classifier estimates the genre of the input music at time by considering a set of audio features such that . Here, denotes the length of audio window and where denotes the number of genres. Given , which denotes the estimated genre considering ( in Figure 2), the proposed system chooses the -th pose generator for generating dance poses for all audio frames.

For generating a pose vector at time , the pose generator takes the audio feature , and a set of previously generated pose vectors , where denotes the length of the pose window. Since our model is autoregressive, the generated 3D pose vector is appended to the set of generated poses such that , and is used for generating the next pose at .

Iii-B Music Feature Encoder

To accelerate the learning process, transfer learning is considered for the lower layers where the music signal is encoded. Specifically, we trained a neural network that estimates the beat per minute (BPM) of the given audio signal, which reported as an effective source task to be transferred to the task of which temporal dynamic is important feature such as dance music genre classification 

[Kim2019]. After pre-training, we employed the lower layers of the BPM estimation model as the music encoder layers of the proposed system.

Iii-B1 Dataset and Preprocessing

Training of a BPM estimator model requires a set data points that is a tuple of the music audio signal and the corresponding BPM value of given music audio, which leads to the dataset . We exploited the Million Song Dataset (MSD) [Bertin-MahieuxEWL11] that provides various metadata on commercial songs. MSD includes the BPM information for the all entries, which can be directly used for our learning setup.222Note, that the BPM provided from MSD is computed from BPM estimation algorithm, which well known for its frequent octave error where the erroneous estimates are the integer multiple of its ground-truth value.

Further, we standardized the BPM values to regularize the loss function.

As for the audio signal, we used the snippets of the 30-seconds music preview at the sampling rate of 22,050 Hz, which are collected with 7digital API333, to extract a low-level feature of audio which is served as the input data. More specifically, we applied the -law encoding and decoding on the raw audio [OordDZSVGKSK16], where we choose as the quantization resolution. The decoding process is simply applying the inverse function of encoding process on the quantized data. While the quantization compresses the original audio samples efficiently, it is reported that the encoded representation can be still effectively used as the data source of the learning process [OordDZSVGKSK16]. We applied the encoding for storing and serving the data for efficiently employ the data at scale, and applied the decoding on the fly when the data is input to the models for the optimization.

Finally, we employed 200,000 / 10,000 / 13,673 tracks for the training, validation, and the testing of the model, respectively.

Iii-B2 Network Architecture

As for the network architecture, we employed the 1-dimensional convolutional neural network (CNN) architecture introduced in 

[KimLN18], which is 1-dimensional analogous of the VGG-like networks [Simonyan2014VeryRecognition]. It consists of cascading convolution and pooling operations, which is shown as effective not only on the image recognition tasks, but also the audio and music related tasks [Kim2019]. In addition to its base architecture, we added several additional components that are reported as effective on audio and music related tasks [OordDZSVGKSK16, KimLN18]

: batch normalization 


, residual connection 

[he2016deep], and dropout [Srivastava2014Dropout:Overfitting]. Finally, we choose the gated tanh function as the core non-linearity [OordDZSVGKSK16]

for convolution blocks and rectified linear unit (ReLU

[Nair2010RectifiedMachines] for later fully connected (FC) layers. The general overview of the architecture can be found in Figure 3.

Fig. 3: The general architecture of BPM estimator. The network consists of a series of ConvBlock that contains convolution, pooling and other operations. A raw audio signal encoded by these blocks serves to the following regression layers. The numbers inside the brackets refer the relevant hyper parameter regarding each block: For FC layers, it indicates the number of output units. As for ConvBlock, the triplet refers the number of output kernels, the size of each kernel, the ratio of subsampling, respectively. The rightmost block illustrates the inner structure of each ConvBlocks.

Iii-B3 Training

The training of the network is achieved by minimizing the mean squared error between the ground truth and the estimation as follows:


where is the set of parameters of BPM estimator and input data is randomly selected from the training set at every iteration, and also the randomly cropped 2 seconds chunk (44100 samples) out of the original 30 seconds, which is common in the music and audio task domain [Kim2019, KimLN18]. We applied the Adam optimizer for the robust optimization [Kingma2014Adam:Optimization]

in which the parameters are updated in total 600 epochs. To adapt the training at the scale we tested, we applied the mini-batch stochastic gradient with the batch size


Iii-B4 Transfer

After pre-training, we transferred the first 10 ConvBlocks to the main system as the music recognition pipeline that encodes the input audio into the feature vector as follows:


where is the output of the 10th ConvBlocks for given signal

temporally centered on the MoCap frame where the pose frame is located. The first dimension of the output tensor is the temporal dimension, which still is required to be reduced to represent the music input for each pose frame. We applied the normalized Gaussian window

for pooling. It eventually summarizes the given temporal feature exponentially more weighted on the temporally close to the pose frame.

Iii-C Pose Generator

Iii-C1 Dataset and Preprocessing

For training our pose generator, we use a dataset from in [relate_1], which consists of four types of 3D human dance poses synchronized with each genre of songs. This dataset provides 3D motion data which have been obtained from a motion capture system with 25 fps, however, the center of mass of the human pose is always fixed in the middle so that the human always remains in the center. Since the movement of the center of mass is one of the crucial components of dance, we have preprocessed the motion data again so that a human can move along footsteps.

After the preprocessing, we convert the dataset into a format which has been suggested in [holden2016deep]. According to [holden2016deep], our pose dataset consists of 63-dimensional 3D joint positions defined in the body’s local coordinate system, the forward direction of the body (3-dimension), the global velocity of the body in the floor plane (3-dimension), and rotational velocity of the body around the vertical axis (1-dimension), and foot contact labels of left/right heel or toe (4-dimension) such that the total dimension of a pose vector is . For the description of the foot contact label, let denote the height of the foot away from the floor plane. Regarding this, we use as the value of foot contact label with . After the conversion, a set of total pose vectors

in training dataset are normalized with its mean and standard deviation values.

Fig. 4:

The architecture of a proposed pose generator. The numbers inside the brackets refer the relevant hyperparameters regarding each block: for fully connected (FC) layers, it indicates the number of output units, for


, the quadruplet indicates the number of output kernels, the size of each kernel, the ratio of subsampling in max-pooling, the stride value in max-pooling.

Iii-C2 Network Architecture

Let denote an extracted audio feature at time , and denote a generated pose vector at time . When a person dances, one considers the characteristics of the current music and how s/he has been moving. Therefore, in order to generate , the proposed pose generator considers and a set of pose vectors that have been generated for a certain period frames, such that , where denotes the window length of poses and has been used in our experiment. In the test phase, is filled up with the zero-valued vector, which is the mean value of after the normalization.

As shown in Figure 4, the proposed network encodes a feature from based on the series of ConvBlocks. Each ConvBlock consists of 1D convolutional layer, max-pooling layer, and leaky ReLU layer as shown in Figure 4 and 5. This dilated convolution structure resembles [bai2018empirical] in that it is causal and the receptive field can be larger with fewer parameters and layers. In our proposed network, the length of the receptive field decreases to the half after passing each ConvBlock as shown in Figure 5. Note that ConvBlock is also used in our music genre classifier.

Fig. 5: The architecture of ConvBlock used in proposed pose generator and music genre classifier.

After concatenating the generated pose feature vector with the audio feature vector , the proposed pose generator synthesizes the output pose after passing a series of fully connected layers and leaky ReLU layers. Since our model is autoregressive, the generated is appended to so that can be used for generating the next pose vector .

Iii-C3 Training

When training a model which generates a sequence, one can follow the strategy called as teacher-forcing method [teacher_forcing], which gives the ground truth -th output to the model as an input when generating -th output. This method has the advantage that the model can quickly learn how to generate the data from the ground truth dataset. However, when a model is autoregressive, it can make the model vulnerable to its own prediction error, when a model is exposed by its generation results in a test phase. Regarding this, one can use a strategy called as student-forcing method in a training phase, which gives the generated -th output to the model as an input when generating -th output, but it takes lots of training time to overcome errors from model’s imperfect output. Therefore, we start to train the network with the teacher-forcing method, and slowly increase the ratio of selecting a student-forcing method.


denotes the probability to select a teacher-forcing method to train a model. For each training step, whether to use the ground truth value or generated value of

for generating is selected with a probability of . We start with , and decay with the factor at every training step. In addition, the -loss function between the ground truth pose vector and estimated pose vector is minimized by the Adam optimizer [Kingma2014Adam:Optimization], with a learning rate value of .

Iii-D Music Genre Classifier

Fig. 6: The architecture of a music genre classifier. The numbers inside the brackets refer the relevant hyperparameters regarding ConvBlock: the quadruplet indicates the number of output kernels, the size of each kernel, the ratio of subsampling in max-pooling, the stride value in max-pooling.

Iii-D1 Dataset and Preprocessing

For training a music genre classifier, we also use a dataset in [relate_1], which provides songs of four music genres, such as Cha-Cha, Rumba, Tango, and Waltz. The audio signal of each song is converted into a set of audio features , and used for the training a music genre classifier .

Iii-D2 Network Architecture

The proposed music genre classifier , takes a set of audio features as inputs, where the window size is used in our experiment. The structure of the music genre classifier is based on the ConvBlock which is also employed in our pose generator. After goes through a series of ConvBlocks, softmax, and argmax layer, the genre estimation value is generated as outputs, where is the number of genres in training dataset. Based on the set of total estimated genre values , the most estimated genre is chosen as the final genre of the music. Note that we have a set of pose generators for all genres, and -th pose generator is selected to synthesize the dance pose sequence for all time frames.

Iii-D3 Training

For training a proposed music genre classifier, we use a cross entropy loss function and minimize that loss using the Adam optimizer [Kingma2014Adam:Optimization] with the learning rate of for epochs. For the validation, the dataset is divided into 56 songs consisting of the audio feature vectors for training, and six songs consisting of audio feature vectors for testing. After training, the proposed music genre classifier has succeeded in classifying of the test audio feature vectors. However, for classfying the genre of a song, it has succeeded in classifying all of six text data songs correctly. This is because the genre of the song is determined by taking into account all genre values of the entire song, so even if there are misclassified genre values in a few places, the genre for that song can be correctly classified.

Fig. 7: Generation results for four music genres. The whole sequence is seconds. (Left) movement from above. As the time step increases, the color of skeleton darkens. The red line on the ground indicates the path of center of mass.(Right) the corresponding sequence plotted by 1 fps.
Fig. 8: The analysis of which part of the training dance data matches to the generated dance. The dance data is plotted with fps, Poses highlighted in the same color are the ones with the highest similarity.
Fig. 9: Comparison of generation with [relate_1] and baseline method. The whole sequence is seconds. (Left) movement from above. As the time step increases, the color of skeleton darkens. he red line on the ground indicates the path of center of mass. (Right) the corresponding sequence plotted by 1 fps.

Iv Experiments

Iv-a Dance Generation Results

Figure 7 shows the generated dance sequences of seconds from four songs. The result is based on the pose generator selected by the genre classified by the music genre classifier. These dances are generated from unheard songs and the supplementary video shows related results more vividly.444Video link : The supplementary video shows that the proposed framework can generate the movements of a person dancing to the beat of given music, and produce the appropriate dance for the classified genre.

To verify whether the trained pose generator has learned to dance from the training dataset, we have analyzed where the generated dance originated from the training data. After dividing the generated dance and training dance data into a certain time interval, we have tried to determine which section of the generated dance is most similar to the training data using dynamic time warping (DTW) [dtw]. Figure 8 shows the analysis of which part of the training dance data matches to the dance generated based on There’s nothing holding me back by Shawn Mendes. The pose sequences highlighted in the same color are the ones with the highest similarity. The result shows that the generated dance has a pattern similar to the training data, but its details differ from the training data so that it can match the given input music. Based on this, we claim that the proposed system has the ability to adapt dance patterns learned from training data to the unheard input music, and a new dance sequence different from the training dataset can be generated.

Iv-B Comparison with [relate_1] and Baseline Models

In this section, we compare the proposed pose generator with a generative network suggested in [relate_1]. The network proposed in [relate_1]

consists of an LSTM-based autoencoder that maps the classic audio features into ones suitable for the dance generation, and the LSTM-based generator that synthesizes dance poses based on the generated audio features. Since no source code for implementing the network is provided, we have implemented the network according to the instructions described in

[relate_1]. The difference from the original paper is the batch of size 4 has been formed after randomly selecting sections for 250 frames of audio and pose dataset. In addition, since their method of masking the audio feature does not seem to contribute to the improvement of dance quality, the feature masking part has been excluded. Also, we have employed the representation of the pose vector used in this paper, so that the center of the generated dance pose can move.

Proposed [5] (Small) [5] (Big) LSTM-RNN
4.15M 4.17M 13.47M 4.87M
TABLE I: Number of network parameters of models in Figure 9

Comparison results of our pose generator and the network proposed in [relate_1] are shown in Figure 9. In this figure, all dances are generated from the same section of the same song (Single Ladies - Beyoncé). From the dance sequences generated by the network proposed in [relate_1], we have observed many unnatural parts such as the body moving while not walking. Table I shows the number of network parameters of models in Figure 9, where denotes the number of network parameters. Even if the number of parameters of the network proposed in [relate_1] is set to be similar to or more than that of our network, same result has been observed. We speculate that the authors of [relate_1] did not observe this phenomenon because they used the data after the preprocessing procedure that keeps the center of the human pose from moving. For more vivid comparison, refer the supplementary video.

In order to verify the effectiveness of our music feature encoder, we analyze the results when the input audio feature for the pose generator of [relate_1] is the one generated by our music feature encoder. The result is shown in Figure 9 with a label LSTM-RNN. It is shown that the pose generator of [relate_1] has been improved when using our high dimensional audio feature, instead of the audio feature encoded from their LSTM-based autoencoder. However, slight body and foot slidings have been still observed even with the number of network parameters similar to our proposed network. (See the supplementary video for details.)

Fig. 10: Generated dancing moves for song Dancing is not a crime by Panic! at the Disco (top), a simulated dancing robot (middle), and a real robot dancing to the song (bottom).

Iv-C Dance Generation for a Humanoid Robot NAO

This section shows the experimental results when the dance pose sequence generated by our proposed framework is applied to a humanoid robot NAO. To make the humanoid robot follow the generated dance pose sequence, we have solved the inverse kinematics to calculate the joint angle values for controlling the robot. The calculated joint angle values are passed to the API provided by NAO, and a path is replanned considering the maximum joint velocity and collision between body parts. However, this replanning process can make the robot’s dance different from the original dance, if the movement speed of the dance is too fast. This is the inevitable problem due to the limitations of the robot hardware. To overcome this hardward limitation, a simple remedy is used to generate a video of a dancing robot by making the robot dance slower, shooting the video, and replaying the video faster.

When the robot dance is implemented on the simulator, the full body parts of the robot are moved. However, when using the real robot, only the upper body parts were moved since the robot could lose balance and fall. Figure 10 shows the result of transferring the generated Rumba dance to the robot (Music: Dancing is not a crime by Panic! at the Disco). In this figure, the results of 10 second dance are shown based on 1 fps. This dance was not fast enough to be implemented on the robot at the original speed. For more vivid results, please refer the supplementary video.

V Conclusion

In this paper, we have proposed a machine learning based framework for synthesizing a 3D dance pose sequence of a human when a music has been given as an input. The proposed framework consists of three parts: a music feature encoder, a pose generator, and a music genre classifier. From a given input music, a music feature encoder extracts a set of audio features. Based on this, the genre of the music is determined by the music genre classifier, and a pose generator trained for that genre is used to generate the dance pose sequence for all frames. The proposed pose generator is a generative autoregressive model, which takes the current output pose as an input for generating the next pose frame.

The disadvantage of the proposed method is that the pose generator must be trained separately for each genre. If we trained all genres of dance so that one model could learn, it has been observed that the movements unrealistic dancing moves are generated. In order to construct a model that can learn patterns of various genres of dance, it will be necessary to apply a multi-task learning technique, which is our future work.