Listening to the music with inspirational touching melody, suppose that you are a performer who is always full of passion to dance rhythmically with music for the pleasure of experiencing the body and emotion expression. Back to reality, however, such kind of circumstance seems to be most likely appeared in our imagination since most of us have no idea about how to dance due to the lack of professional dance training. It would be much more encouraging if we develop a dance machine, which is able to generate realistic dance motion sequences based on music with several customized key poses provided by users. In this work, our goal is to take a piece of music as well as some key poses and synthesize continuous human dance motion sequences. The generated motion sequences are expected to not only follow the rhythm of music, but also keep a consistency with the key poses. To this end, we design a novel Transformer-based architecture which involves two single-modal transformer encoders for music and initial seed motion embedding, and a cross-modal transformer decoder for motion generation controlled by key pose constraints. We show that our model stably generate various dance sequences with different key poses driven by the same music.
Dance consists of rhythmic movements along with music, which is of aesthetic value. Dance is a kind of art which is performed in many cultures for social, celebration, entertainment, etc. People express their emotions by dancing with different types of music. Recently, with the popularization of online media platforms, more and more users try to record dancing videos performed by theirselves. For the consideration of entertainment or business, they upload and share the dancing video on various online social applications. However, dancing is a type of art that is full of skills. A good dancer need to be professionally trained in different aspects with expensive cost, including basic dance motions as well as choreography, etc. Such character of dance makes it extremely difficult for most people to perform an impressive dance performance. On the one hand, dance motion synthesis with music have a wide range of applications including digital human, dance assistant, gaming, etc
. On the other hand, researching on dance synthesis could help researcher to explore better techniques for cross-modal sequence-to-sequence generation tasks. Hence, the idea of developing an automatically dance motion generation system with computer vision and artificial intelligence techniques is naturally proposed for the potential commercial value as well as academic explorations.
When it comes to music-driven dance generation with computer vision and computer graphics techniques in deep learning, music-motion paired data play significant roles. In general, a large quantity of clean data is necessary to train a deep neural network. However, such data is collected with motion capture devices while the professional dancer is performing, which makes the process tedious and expensive. Regenerating dance motions within limited music-motion pairs is one of the most efficient ways to enlarge the dataset with a lower cost. Our method can regenerate a sequence of new dance motion by randomly sampling key poses at different positions. The synthesized dance motion matches the original music while varying from pose to pose.
Synthesizing dance motions with music has been researched for many years and is becoming a hot topic in motion synthesis. Existing methods on dance synthesis are roughly grouped into two categories, i.e., retrieval based frameworks with motion graph [arikan2002interactive, shiratori2006dancing, kim2003rhythmic, kim2006making, chen2021choreomaster] and deep generative model based methods [alemi2017groovenet, tang2018dance, li2021ai, li2021dancenet3d, huang2020dance, zhang2021dance, ren2019music, sun2020deepdance, lee2019dancing, ren2020self]. Early works mainly focus on motion graph. The key idea of such framework is to generate motions by finding an optimal path in a pre-built motion graph. Each node in the graph represents a motion clip, while the directed edge corresponds to the cost between two associated motion clips, considering the correlation and consistency of transition. Kim et al. [kim2003rhythmic] propose to synthesize a new motion from unlabelled example motions, which preserves the rhythmic pattern, by traversing the motion graph. Aristidou et al. [aristidou2018style] attempt to synthesize style-coherent animation accounting for stylistic variations of the movement. The up-to-date dance generation system with motion graph is proposed in [chen2021choreomaster]. In this work, Chen et al. train a deep neural network to learn choreomusical embedding and incorporate such embedding into a novel choreography-oriented graph-based motion synthesis framework, with various choreographic rules considered. Although graph based methods have explicit explanation while the optimal path in graph is straightforward, the drawbacks of such methods still remain. In fact, the output dance synthesized with motion graph is in essentially a composition of existing motion clips in database, which imposes restrictions on the diversity of dance.
Recently, benefiting from the development of deep learning in cross-modal understanding and sequence-to-sequence generation, deep generative model has become a standard scheme for music-driven dance synthesis. Tang et al. [tang2018dance]
use LSTM-autoencoder to synthesize dance choreograph with temporal indexes. To generate dance with controllable style, Zhanget al. [zhang2021dance] propose to transfer dance styles to dance motions, in terms of generating dance with specific style. Transformer has shown powerful capability in sequence modeling. Several works apply transformers into dance synthesis [huang2020dance, li2021ai, li2021dancenet3d, li2022danceformer]. To generate long dance sequence, Huang et al. [huang2020dance] propose a novel curriculum learning strategy to alleviate error accumulation of auto-regressive models. Aristidou et al. [aristidou2022rhythm] present a neural framework to generate dance motion driven by music that form a global structure respecting the culture of the dance genre. Duan et al. [duanautomatic2021] collect a large scale dancing dataset with labeled and unlabeled music. Besides, Li et al. [li2021ai] release a new public multi-modal dataset of 3D dance motion and music, which makes a large progress of dance motion synthesis research.
In this work, they also propose a full-attention cross-modal transformer network for 3D dance motion generation. Generating dance with deep neural networks has greater diversity in comparison with graph based methods. However, the fundamental shortcoming of such category of methods with deep neural networks is the lack of controllability. In our proposed method, we attempt to generate dance motions with poses expected to appear in output dance, which makes the generating process explicit and controllable.
2 Related Work
2.1 Human Motion Synthesis
Predicting future frames with initial pose or past motion has attracted researchers for a long time. Early works mainly focus on statistical sequential models for motion sequence prediction. Galata et al. [galata2001learning]
make use of variable length Markov models for the efficient behaviours representation, and achieve good performance for long-term temporal motion prediction. Brandet al. [brand2000style]
approach the problem of stylistic motion synthesis by learning motion patterns from highly varied of motion sequences with style-specific Hidden Markove Models (HMMs). Different from motion prediction with initial pose, Petrovichet al. [petrovich2021action] attempt to synthesize motion from scratch conditioned on action categories. They design a Transformer-based architecture for encoding and decoding a sequence of human poses parameterized by SMPL model in conjunction with generative VAE training.
2.2 Human Pose and Motion Modeling
A recent trend in 3D human pose and shape estimation is to use deep neural network to learn a parametric human body representation. The Skinned Multi-Person Linear Model (SMPL) is a realistic 3D model of the human body that is wildly used for various tasks,e.gkolotouros2019learning], motion synthesis [petrovich2021action], action recognition [varol2021synthetic], etc
. To have a robust representation of human body, Graph neural network (GNN) is a good choice due to the intrinsic graph structure of human body. For motion sequence modeling, deep recurrent neural networks (RNNs) and transformer-based networks show the powerful capability for sequence processing and achieve top performance in different tasks.
2.3 Cross-Modal Sequence Generation
Except for motion synthesis with action category conditioned, sequence generation task with cross-modal data is becoming a research trend, which involves vision, audio, text, etc
. In natural language processing and computer vision, a subset of prior works attempt to translate text instructions to action motions for virtual agents[hatori2018interactively]. Text-based motion synthesis targets on generating realistic motion sequence following the semantic meaning of the input text. The cross-modal data are expected to be semantically aligned while generating motions. Ahuja [ahuja2019language2pose] propose to learn a joint embedding space for pose and language with curriculum learning. Semantic understanding from input text plays an important role in this task. Ghosh et al. [ghosh2021synthesis] propose to use a hierarchical two-stream sequential model to explore a finer joint-level mapping between input sentences and pose sequences. Similar to natural language, audio data is also used as input for motion generation. Music-driven dance synthesis is a difficult task due to the complicated choreography. To generate long sequences of realistic dance motion, Transformer-based architectures are utilized in many prior works [li2021ai, huang2020dance, li2021dancenet3d].
In this section, we present our approach to generate dance driven by music with key pose constraint. The proposed framework involves two transformer encoders for motion and music representation, respectively. To explore the correspondence between music and motion sequences, a cross-modal transformer is utilized to fuse such two representations and generate dance motions.
3.1 Framework and Formulation
We briefly discuss the formulation of our task. Here we have a piece of music represented as , and a seed sequence of motion represented as , as well as a sequence a key poses , where , and are the lengths of and , respectively. is the number of key poses in sequence while is the timestamp corresponding to key pose , where . Our goal is to generate the motion sequence from time step to , with the constraint that the key poses expected to appear in the generated motion at corresponding time step, which means the mean error, represented as , between the input key poses and generated key poses should be as smaller as possible.
Fig. 2 illustrates the framework of our dance motion generation framework. There are multiple inputs to the model, including a piece of driven music, a segment of initial seed motion clip, and sequence of key poses with the corresponding positions. Our dance generation framework consists of the following tiers of neural network.
Transformer encoders for motion and music. To generate dance motions, we need a representative embedding for music features. we utilize transformer to encode music considering its powerful capability of sequence modeling. We use a two-stream transformer architecture to encode music and motion, respectively.
Motion generation with cross-modal transformer. The motion and music embedding from the transformer encoders are concatenated along time dimension and sent to a cross-modal transformer to learn the correspondence between these two modalities. The output of the transformer, which involves choreography contained in the correspondence between motion and music embedding, is projected to pose space with a linear layer as the predicted motion.
3.2 Transformer Encoders for Music and Motion
The driven music and initial seed motion are represented as and
, respectively. The input of music transformer contains a set of music features extracted from raw audio including 12-D chroma, 2-D downbeat, 1-D onset. We concatenate all these features resulting in a 15-D music representation,e.g. . Transformer encodes the sequence with fully attention among all time steps, which makes it wildly used in modeling sequence with long-term dependency. The transformer consists of scaled dot-product attention units. The scaled dot-product attention deals with three inputs, including queries, keys, and values, represented as , , and , respectively. These inputs are firstly embedded into the same feature space of dimension by a linear layer. Denote the function of linear layer as , the attention weights are calculated as follows,
The output of scaled-dot production attention is as follows,
In transformer, masking is a skillful mechanism for different task, which controls the attention relationship among the whole sequence. For the task of sequence generation with auto-regression, the mask is designed as upper triangular to enable casual attention. The music features are first embedded in a high-dimension latent space via linear projection . The queries, keys, and values of music transformer are all the same and set to be the output of linear layer. We use to represent music transformer 111To avoid confusing with motion transformer and simplify the notation, we use the superscript “A” corresponding to “audio” to represent music related module., then the output of music transformer encoder is denoted as follows,
The data of each frame in the sequence of dance motion consists of human body pose and the global translation vector. The body pose is represented by 24 joints with 3-D rotation angle, and the 3-D translation vector indicates the global position of root joint. We use the 6-D rotation representation proposed in[zhou2019continuity] considering its continuity. Thus, the motion data in each frame is represented as a 147-D vector by concatenating the rotation vector and translation vector. Given a sequence of initial dance motion clip, denoted as , we first embed the input into a high-dimension feature space with a linear layer . The the motion transformer followed by the linear layer models the temporal correlations of input embedding, generating a sequence of motion representations with the attention among the whole input sequence. In a similar way with music transformer, we use to represent motion transformer, then the output of motion transformer encoder is denoted as follows,
3.3 Motion Generation
To learn the correspondence between music and motion, the representations of music and motion are concatenated along time dimension as one of the inputs of cross-modal transformer. The cross-modal transformer models multiple inputs including the joint concatenation of music and motion embedding, the key pose embedding, as well as the local position embedding where the key pose expected to be located in the output motion sequence.
3.3.1 Key pose embedding.
For the key poses, we only focus on the posture of human body while ignoring the global translation. The trajectory which consisted of the global translation has potential effect on the overall feelings of the generated dance motion. However, the translation of the isolated key pose is pointless, even has restrictions on generating dance motions. In consideration of the aspects mentioned above, we choose to concentrate on the posture of human body with only joint rotations and discard the global translations. We embed the key poses into a feature vector which has the same dimension with the music and motion representations calculated by Eq. 3 and Eq. 4
via a linear projection layer. The positions without key poses are padded with zeros. The sequence of key pose embedding is represented as follows,
is the key pose introduced in Sec. 3.1.
3.3.2 Local Positional Embedding.
Unlike Recurrent Neural Networks (RNNs) that process the temporal sequence one by one with recurrence mechanism, Transformer ditches the ordinal temporal inputs in favor of full attention mechanism crossing the entire sequence. Although such kind of designing can capture long dependencies, and accelerate the training and inference procedure due to the parallel computation, there exists deficiency of the model since the temporal-order information is omitted. To make the model has sense of order information, an alternative solution is incorporate the position representation into the input sequence.
In the task of motion generation with key poses, the output pose of each frame is modeled by the whole sequence of input poses. However, the key poses with different positions have different contributions to the output pose. In general, the significance for the output pose of current frame depends on the distance between current pose and key pose. The key poses, which are closer to output pose, contribute more to the prediction. Based on such consideration, we propose a local position embedding module to incorporate the position information of the adjacent key poses in both sides. The local position embedding mechanism is illustrated in Fig. 3. We briefly review the ordinary positional encoding in [vaswani2017attention]. The position information is encoded with a positional embedding matrix , where is the potential maximum length of sequence, is the input dimension of the Transformer. The element on row and column in is represented as follows,
The local positional embedding is derived with . For the current input at time step , we denote the left and right adjacent key poses as and at the temporal position of and , respectively. The local positional embedding consists of two parts corresponding to relative positions of left and right adjacent key poses. We get the ordinary positional embedding of the key pose in left side by indexing the positional embedding matrix with relative temporal distance , represented as
Similarly, of the key pose in right side is as follows,
The local relative positional embedding for time step is the concatenation of and , represented as follows,
where the concatenation operation.
3.3.3 Cross-Modal Transformer.
We use a cross-modal Transformer to learn the correlation between initial motion embedding and music embedding. The motion representation and music representation are concatenated along time axis as one of the cross-modal Transformer inputs. Other information cues that expected to be modeled by the cross-modal Transformer include key pose embedding and local positional embedding. Thus, the overall input of the cross-modal Transformer is represented as the summation of local positional embedding, key pose embedding, as well as the music-motion joint embedding. Denote the overall input as , which is derived as follows,
where , , corresponds to embeddings of motion, music, and key poses, respectively. is the local positional embedding introduced in the previous section, is the ordinary position encoding calculated in Eq. 7. We denote the mapping function of the cross-modal Transformer as . By passing the overall input embedding through , we obtain the output sequence
. We use a linear transformation layerto transpose the output sequence into the final predicted motions which contain a sequence of poses and global translation vectors, the final predictions are as follows,
To optimize the network, we design a weighted reconstruction loss on motions. We define the consistency error between the key poses and the corresponding generated poses at the same time steps. Given a sequence of key poses and the predicted motions , the consistency error is calculated by mean squared error (MSE) as follows,
where is the number of key poses, corresponds to the time step of key pose. As illustrated in the previous section, our goal is to generate the motion sequence with the constraint that the key poses expected to appear in the generated motion at corresponding time step, which means is supposed to be as smaller as possible. To this end, the error at the key pose position is enlarged by multiply a large factor. The weighted reconstruction loss between the predicted motions and ground-truth sequence . The weighted reconstruction loss is calculated as follows,
is a tunable hyper-parameter which balances the smoothness of the whole dance motion sequence and consistency with the key poses. controls the range that is affected by the key poses. An example for a fixed distribution of key poses with different paired parameters denoted as is shown in Fig. 4. From this figure, we notice that the maximum weight appears at the key frame where the key pose located in. The weight decreases when it leaves far from the key pose. A small means a sharp decline of the weight for the calculation of reconstruction loss defined in Eq. 14. Obvious, there is a trade-off between the smoothness of the whole generated dance motions and the consistency with the key poses. With the increase of , the model learns to generate dance motions with a smaller consistency error , which makes the output poses at the key frame much more closer to the input key poses.
4.1 Dataset and Implementation Details
We conduct our experiments on AIST++ [li2021ai], a large-scale 3D human dance motion dataset consists of 3D motion paired with music. A variety of cross-modal analysis and synthesis tasks are utilized with AIST++, including dance generation conditioned on music, human motion prediction, etc. We split the dataset for training and evaluation following the setup in [li2021ai].
We use Adam optimizer [kingma2014adam]
with a initial learning rate of 1e-4 and 20 batch size. The learning rate drops to 1e-5 and 1e-6 after 100k and 250k iterations, respectively. We train two models with different number of parameters. For the large model, the cross transformer has 12 layers with 10 attention heads and 800 hidden size for each layer. For the light model, the number of layers and hidden size are reduced, which is set to be 8 layers with 4 attention heads and 256 hidden size for each layer. The motion transformer and music transformer in both large model and light model have the same layers which is set to be 4. The whole framework is implemented on PyTorch and experiments are performed on NVIDIA Tesla V100 GPU.
4.2 Experimental Results
4.2.1 Evaluation metric.
We utilize two different evaluation metrics for dance generation on key pose consistency and smoothness, respectively. On one hand, the synthesized dance motion is controlled by the given key poses. As a result, it’s significant to keep the generated poses at key frames as similar as possible with the given key poses. Hence, we use the mean squared error (MSE) between the generated key poses and the input key poses, formulated inEq. 13. On the other hand, smoothness is one of the most important evaluation metrics in motion synthesis task. For the smoothness evaluation, we utilize the coefficient of variation
, defined as the ratio of the standard deviation to the mean, for the temporal difference of the generated pose sequence, which is formulated as follows,
where and correspond to standard deviation and mean value of the sequence, respectively. means the temporal difference of the generated motion sequence. A smaller smoothness score means the generated dance tend to be more smooth. In addition, we use beat hit rate to evaluate whether the dance motion beats and musical beats are aligned to each other. The beat hit rate is defined as , where is the number of dance motion beats that are aligned with musical beats, and is the number of musical beats. The aligned dance motion beat is defined whether the musical beat occurs in its adjacent temporal interval , where is the of dancing video, is a hyper-parameter to control the range of the temporal interval.
4.2.2 Study on different hyper-parameters.
Our model is optimized with weighted reconstruction loss defined in Eq. 14. We have two hyper-parameters in this equation. The results with different combination of and are given in Tab. 1-Tab. 4. From the results in Tab. 1 and Tab. 3, we notice that the hyper-parameter has less effect on the consistency error compared with . With the increase of , the consistency error of the key poses gradually decreases, also as shown in Fig. 5. This is reasonable, since larger enforce the network to pay more attention to the error at the key frame.
As for the smoothness scores in Tab. 2 and Tab. 4, we can see that the smoothness score moves in the opposite direction to the consistency error, as shown in Fig. 6. With the increase of , the smoothness score gradually increase as well, which means a large value of could have negative effects on the smoothness of the synthesized dancing videos. The experimental results and the above analysis indicate that there is a trade-off between the consistency error with the key poses and the smoothness score of the synthesized dance motion, mentioned in Sec. 3.4 as well. Although a large value of enforce the generated key poses to have a high similarity with the given anchor poses, it does have negative effects on the smoothness of the whole synthesized dancing video. We can adjust the parameter to generated dance motions determined by our preferences for consistency or smoothness.
4.2.3 Beat hit rate.
The results of beat hit rate with different hyper-parameter are shown in Tab. 5 and Fig. 7. In our experiments, the video is set to be 30, and temporal range hyper-parameter ranges from 1 to 5. A small value of indicates a strict alignment between dance motion beats and musical beats. The beat hit rate of the synthesized dance motion gets close to the beat hit rate of the real dance motion, which means our algorithm perform well on beat alignment of generated dance motion and input music.
We visualize the synthesized dance motion sequences with different seed motions and controllable key poses constraints, shown in Fig. 8. We synthesize two dance motion sequences with different seed motions driven by the same music and a sequence of anchor key poses. From the visualization result, we can see that the synthesized poses at key frame have the same appearance with the input anchor key poses, which keeps a good consistency, while the smoothness is guaranteed as well. The diversity of the generated dance motions are achieved by the variations of the seed motions. Besides, we visualize the alignment between dance motion beats and musical beats as well, shown in Fig. 9. The dance motion beat is defined as the local extrema of the kinetic velocity. We mark the dance motion beats with red circles. As we can see, the dance motion beats are aligned with music beats (orange dotted line).
In this paper, we propose a novel framework for dance motion synthesis based on music and key pose constraint. We target on synthesizing high-quality dance motion driven by music as well as customized poses performed by users. Our model consists of single-modal transformer encoders for music and motion representations, and a cross-modal transformer decoder for dance motions generation. We use a local positional embedding to encode the relative position information of adjacent key poses. With this mechanism the cross-modal transformer are sensitive to key poses and the corresponding positions. Our dance synthesis model achieves satisfactory performance both on quantitative and qualitative evaluations with extensive experiments, which demonstrates the effectiveness of our proposed method.