American poet Henry Wadsworth Longfellow called music the universal language of mankind. “Music is the art of arranging sounds in time to produce a composition through the elements of melody, harmony, rhythm, and timbre” (Wikipedia). Given this richness, researchers have extensively explored automatic music generation. Jukebox  is a recent work that generates music of different genres, artists, and lyrics.
Music is naturally connected with dance. It sets the mood, maintain a dancer’s beat and can amplify a dance’s emotional affect. In fact, music and movement are represented similarly in the brain, using a shared neural code 
. This connection between dance and music makes it interesting to study the two complimentary art forms together. While several deep learning[8, 10] and search  based solutions have been proposed to generate dance for a given music, the other direction of automatically generating music for a dance is still relatively unexplored.
Such an algorithm has applications in creating short-form videos popular on social media platforms, in home assistants that can generate music as a family is playfully dancing around, in adding rhythm to our workouts making them more enjoyable, etc. By allowing people to generate music via their body movements, the tool enables an interesting confluence of our senses.
As a preliminary step in this direction, we represent music as a sequence of notes from the C major pentatonic scale. Dance is processed as a series of human poses extracted from an existing video or an incoming video feed. We generate a sequence of notes that aligns with the movement pattern in the dance poses. We present a search-based offline approach that generates music after processing the entire dance video and an online approach that uses a deep neural network to generate music in real-time. We have integrated this online approach in a live demo. A video of the demo can be found here: https://sites.google.com/view/dance2music/live-demo. We also present a strong baseline and compare our approaches on 10 dance clips via human studies. We also quantitatively evaluate our online approach on 45 videos.
Generating dance from music. Several works exist to automatically generate dance from music. The inverse is relatively under explored. Early works   generate dance from music using retrieval-based methods. 
propose a search based approach to generate dance from music such that the overall spatio-temporal movement pattern matches the overall structure of music. Several deep learning approaches,  have also been proposed to do the same.
Generating sound from silent videos. Several approaches reproduce missing sound from a video clip of people playing musical instruments  or someone sneezing or coughing . Our task is to generate novel music that goes well with the dance, rather than reproduce reality. In fact, there may have been no music playing with the dance in reality.
Controlling audio from body movement. Several works use sensors to control music generation through hand movements and eye gaze , .  track movement by attaching sensors to bodies of dancers to guide music generation. In contrast, our setup has a lower barrier to entry (it uses just a camera) and can also generate music post hoc (from a dance video).
Our goal is to generate music that goes well with an input dance. Similar to , we hypothesize that if notes are (dis)similar when the associated poses in the dance are (dis)similar, the music will feel synced with the dance.
Our approach maximizes this correlation between poses and notes.
Dance representation. In this preliminary work, we assume a single dancer in the video. We represent dance as a sequence of poses. For an existing dance video, we extract 30 frames per second and use OpenPose 
to estimate the person’s pose for each extracted frame. Each pose consists of 18 2D keypoints111https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/doc/02_output.md#pose-output-format-coco
and is represented by a 36-dimensional continuous vector normalized by the image size so that the values are between -1 and 1. The same representation is used when processing a live dance video feed.222To run a live demo on CPU we use Intel OpenVINO toolkit  because of its optimized inference on CPU. Music representation. We use piano notes in the C major pentatonic scale – C4, D4, E4, G4 and A4 – to compose our music. The pentatonic scale is found in different forms in most of the world’s music . Additionally, notes in a pentatonic scale are so consonant with each other that even a random sequence is pleasing to listen to.
In our offline approach, we generate music for an existing dance video. Given a sequence of dance frames, and a hyper-parameter , we generate a music sequence of length . – the music interval – denotes the number of dance frames after which we play a musical note. Based on an initial qualitative assessment, in our experiments we set to be 6. Recall that our aim is generate music, such that the generated music is maximally correlated with the dance.
We bring dance and music to a unified similarity matrix representation and then compute the Pearson correlation between the two matrices vectorized.
Let be the dance similarity matrix, and be the music similarity matrix. We define them as follows.
Here pose[i] denotes the 36 dimensional representation for the dance frame, and note[i] represents the note played at music interval. We represent the notes C4, D4, E4, G4 and A4 in an ordinal manner as 0, 1, 2, 3 and 4, ordered by frequency. The matrices capture how similar the two poses or notes are in two points in time. In order to compute correlation between the two, we resize to be of the same size as
using nearest neighbour interpolation. Recall thatis smaller than because we only play a note every frames.
Our objective is to choose a sequence of notes such that the Pearson correlation between the corresponding music and dance similarity matrix is maximized. As the search space is exponential in the number of notes (e.g., for a 12-second video), an exhaustive search is infeasible.
Search. We use beam search to find the sequence with the highest correlation with the dance. With the first note fixed at E4, the generation advances from left (start of the video) to right (end of the video). At each step of beam search, we maintain a candidate list of sequences of notes that currently have the highest correlation with the dance. We append each of the 5 candidate notes to each sequence, resulting in sequences. Each of these sequences are scored using the correlation with the dance. We keep the top , and discard the rest. This is repeated till notes have been generated. Instead of using the entire history of dance (and notes) in computing the correlation, we use the local history (of 10 notes and associated 60 frames). We find that using the local history allows us to capture more minute details in the dance, while using the global history results in more “flat” music (see Figure 2). This approach is illustrated in Figure 3. We use in our experiments.
The above approach cannot be run in real-time due to beam search – the note to be produced at a given time is not decided until the computation for future notes has been done. To enable real-time note generation, we model the problem using neural networks. We run the offline approach on a set of videos to collect paired dance and music data. Using this data, we train a neural network to take in as input the local history of the dance poses and generated notes, and produce as output the note to be played next. We model this as a 5-way classification task (to generate one of the 5 notes).
shows the model architecture. We process the past dance similarity matrix via 6 CNN layers, each of kernel size 3 and filters 64, 128, 128, 256, 512 and 32 respectively. Max pooling is done after the first and third convolution and ReLU activation is used after each CNN layer. Global average pooling is performed to get the dance feature representation. The past music note sequence is fed as an input to an LSTM with hidden dimension 32 and the resulting features are concatenated with the dance features. Three fully connected (FC) layers of sizes 512, 256 and 128, each followed by ReLU activation are applied over the concatenated features. Finally an FC layer of size 5 is applied to get the final logits. The network is trained with Adam optimizer for 200 epochs with a learning rate of 2e-4. The dance similarity matrix and the music sequence are padded to ensure that each data point in a batch is of the same size.
For generating music on-the-fly as dance frames come in, we start with the first note being E4, and then sample iteratively from our trained model. The note predicted at time along with local history of both generated notes and dance poses is fed as an input for the note prediction at time .
The task that the neural network is modelling is difficult because it has access to partial information. For the offline approach, the note generated at any time is determined based on not just past but also future notes and frames. However, our neural network doesn’t have access to future notes and frames (thus enabling the “online” use case).
Baseline Inspired by  that creativity is a combination of quality (value) and surprise (novelty), we design a baseline that generates a sequence of notes that is both in sync with the dance and unpredictable. We select a random note when the similarity between poses across a music interval is below a threshold, otherwise continue playing the previous note. We use the percentile of all similarities over the training dance frames as the threshold. We choose percentile instead of median to ensure that a note change occurs frequently and the generated music is not flat.
Examples of music composed by our various algorithms can be found here: https://sites.google.com/view/dance2music.
Dataset We perform our experiments using the AIST dataset  and videos from . The former consists of street dance performed by different people across 10 different genres. It has each dance video captured from 9 angles; we use the ones with front facing camera view. The latter consists of short YouTube videos where a single subject dances in front of a static camera. All videos are seconds long.
Automatic metrics. We start by evaluating how well the neural network in our online approach mimics the offline approach. We train the model on 455 videos from the AIST dataset, and evaluate it on 45 videos. It achieves a test accuracy of 73.5% at the task of predicting the next note accurately (that is, match the note produced by the offline approach). Note that chance performance would be 20%. We also compute the Pearson correlation between the generated note sequence and the dance sequence – the metric the offline approach was optimizing for – across all the test videos. The mean correlation of the online model is 0.33. As a reference, the mean correlation of the offline approach (which our neural network model is mimicking) is 0.38. Figure 5 shows dance and music similarity matrices for both approaches.
Human Study. Next we conduct human studies to compare our approaches to the strong baseline presented earlier. We use 10 videos – 8 from AIST and 2 from . This gives us 10 video pairs for each comparison type: offline vs. baseline and online vs. baseline. For each comparison we show subjects a pair of videos, both contain the same dance but the music is generated by the different approaches. We ask subjects: Which music composition goes better with the dance? Each pair is evaluated by 3 subjects. 18 subjects (11 male, 7 female) voluntarily participated in the study. Their ages range from 16 to 49 years. Each subject evaluated up to 3 video pairs for each comparison type.
A one-sample proportion hypothesis test suggests that for our sample size of 30 (3 subjects 10 video pairs), a “win ratio” over 0.633 (19 out of 30), or below 0.367 (11 out of 30) is statistically significant at 95% confidence.
Table 1 shows the results of our human study. For offline vs. baseline, offline was preferred 77% of the times (23 out of 30), which is statistically significant. This shows that our search based approach that finds a sequence of notes so that (dis)similar poses in the dance have (dis)similar notes associated with them generates music that subjects find goes better with the dance than a strong baseline that includes both sync (quality) and randomness (novelty).
Subjects preferred the online approach over the baseline 70% of the time (21 out of 30), which is also statistically significant. This, combined with the automatic metrics reported above where the online approach mimics the offline approach well, suggests that our neural network model is a promising direction for generating music on-the-fly that goes well with a live dance. A video of a live demo of our online approach can be found here https://sites.google.com/view/dance2music/live-demo.
|Offline vs. Baseline||Offline||23||0.77|
|Online vs. Baseline||Online||21||0.70|
Conclusion and Future Work
We present a preliminary study for automatic music generation conditioned on input dance. We present two approaches, one for offline generation for an existing video, and one for real-time generation for an ongoing dance. We compare the approaches against a strong baseline using human studies. Human subjects find our offline and online approaches to be significantly more aligned with the dance than our baseline. Future work involves exploring more dimensions of music (increasing the number of notes and instruments, including chords, playing notes at different frequencies instead of every frames), and more varied videos (different dance forms, more than one person in a video).
Acknowledgment. Larry Zitnick, Kristin Galvin and David Kant for helpful discussions.
-  (1992) The creative mind. Abacus, London. Cited by: Online Approach.
OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE PAMI 43 (1). Cited by: Approach.
-  (2019) Everybody Dance Now. In ICCV, Cited by: Results, Results.
-  (2020) Generating Visually Aligned Sound From Videos. TIP. Cited by: Related Work.
-  (2020) Jukebox: A Generative Model For Music. arXiv preprint arXiv:2005.00341. Cited by: Introduction.
-  (2012) Example-Based Automatic Music-Driven Conventional Dance Motion Synthesis. IEEE TVCG 18 (3). External Links: Cited by: Related Work.
-  (2020) Foley Music: Learning To Generate Music From Videos. In ECCV 2020, Cited by: Related Work.
-  (2020) Dance Revolution: Long-Term Dance Generation With Music Via Curriculum Learning. arXiv preprint arXiv:2006.06119. Cited by: Introduction, Related Work.
-  (2020) OpenVINO Toolkit . Note: https://software.intel.com/en-us/openvino-toolkit Cited by: footnote 2.
-  (2019) Dancing To Music. arXiv preprint arXiv:1911.02001. Cited by: Introduction, Related Work.
-  (2013) Music Similarity-Based Approach To Generating Dance Motion Sequence. Multimedia tools and applications 62 (3). Cited by: Related Work.
-  (2011) Pentatonic Scale . Note: https://www.britannica.com/art/pentatonic-scale Cited by: Approach.
-  (2001) SICIB: An Interactive Music Composition System Using Body Movements. Computer Music Journal 25 (2). External Links: Cited by: Related Work.
-  (2008) Infotainment Devices Control By Eye Gaze And Gesture Recognition Fusion. IEEE Transactions on Consumer Electronics 54 (2). Cited by: Related Work.
-  (2019) Visual And Auditory Brain Areas Share A Neural Code For Perceived Emotion. bioRxiv. External Links: Cited by: Introduction.
-  (2020) Feel The Music: Automatically Generating A Dance For An Input Song. arXiv preprint arXiv:2006.11905. Cited by: Introduction, Related Work, Approach.
-  (2019) AIST Dance Video Database: Multi-Genre, Multi-Dancer, And Multi-Camera Database For Dance Information Processing.. In ISMIR, Cited by: Results.
-  (2019) Hand Gesture Based Music Player Control In Vehicle. In IEEE I2CT, Cited by: Related Work.