With the blooming of mobile technologies, daily updates of personal news from celebrities, athletes and people of influence have become a culture phenomenon. Most content posted on social media platforms such as Facebook and Twitter has gradually drifted from pure text to multi-media such as videos and pictures. According to Twitter Marketing UK, tweets with videos, which are the most shared media type, are retweeted six times more than those with photos, and three times more than those with GIFs.
The most popular videos on social media, shared and liked by millions, are often accompanied with background music carefully edited to complement the video. Although with the improved video capacities of modern mobile devices, filming high quality videos has become much easier, producing suitable background music to match the videos remains a challenging task. Copyrighted music created by human experts might not synchronize with person moves in the video. Even if we want to use kinds of music created by experts, copyright is a problem that prevents us from doing so.
To tackle the above problems, an automatic music generation system is helpful. In addition to preventing breaching copyright when using existing music, automatic music generation for video makes editing a multimedia post easier and more efficient. To this end, we propose a novel model to generate symbolic piano music for video. To the best of our knowledge, there is no previous work dealt with generating music for video. Considering that the goal of the task is to generate music from a given video, video music generation (VMG) is challenging in a few aspects.
Firstly, there is no existing dataset that we can train a suite of models on VMG task. Previous work on background music recommendation [16, 11, 17] is to extract emotion from facial, voice, physiological signals and text and then affect them on music elements including melody, rhythm, tempo and text. Hence, their work focus on modeling overall features rather than local features (ex. people dancing frame in units of music notes). We expect our generated music formed up by piano notes which suits every video frame. Therefore, we release a new dataset, MIDI and Video Edited for Synchronous Piano Notes and Music Videos, composed of over 7 hours of piano scores with fine alignment between pop music videos and MIDI tracks.
Secondly, there is no previous work dealing with the video-to-music generation (VMG) task. The task of VMG needs a mechanism to model multi-modal video and song. We are inspired by the success of video caption generation, but unlike text, music is an artistic creation that can’t be interpreted consistently. Moreover, music sequences have more diversity and long-term structure rather than caption sequences. Recent advances in attention-based neural networks with properties of handling long-term structures have made it possible to train the music generation model. Thus, we propose Video Music Transformer (VMT), a novel attention-based multi-modal model to generate music for given videos. Experiment results show VMT substantially outperform Seq2Seq on both music smoothness and video relevance.
Our contributions can be summarized as follows:
We propose VMT, a novel attention-based multi-modal model to generate music for a given video.
With the lack of training data of alignment between video and MIDI notes, we release a new dataset composed of over 7 hours of videos, including over 2500 videos with aligned MIDI files.
One crucial challenge of this work lies in the lack of training data. At the time of writing, there is no existing dataset for video to MIDI music generation. Music Transformer  released a rich classical piano MIDI dataset. However, since most classical music are simply accompanied by recordings of the live performance. In this paper, we created a new dataset for the task of video to music generation.
2.1 Data Collection
We collected pop music as the training data for our (task) for the following reasons:
In many music video fragments, the singer’s moves are synchronized with the groove of the music.
Pop music videos are plentiful with camera angles, shots, and movements, which offer benefits to learning variety of videos and corresponding music.
Although there is an existing classing music MIDI dataset, it is hard to find corresponding music there are seldom music videos for classical music songs. However, most pop music songs have official music videos, which are usually filmed by famous directors and edited by experts. These videos usually fit music’s emotion and representation with compare to other kinds of music.
We collected the most famous music videos from YouTube channels Vevo, the world’s largest all-premium music video provider, and Warner Music, a major music company with interests in recorded music, music publishing, and artist services.
2.2 Piano Sheet Collection
For piano MIDI, at first, we try to use optical music recognition for music sheet auto-generation. However, unlike pure-piano classing music, since most pop music songs are polyphonic, the performance of optical music recognition is imprecise.
Instead, we collected piano scores from MuseScore111https://musescore.com
, a popular free and open-source score-writer software, with a website containing more than 1M scores shared by more than 200k musicians. With such a large community, we can easily find great music for our research.
With some exploration, we found an author named ZakuraMusic222https://musescore.com/zakuramusic, who has more than 11.3k followers and shared 257 music scores (as of the time of this writing). Most scores are piano rearrangements of famous pop music songs. The high quality of these scores’ BPM makes the alignment much more straightforward.
2.3 Music Alignment
Music alignment is the biggest issue in dataset collection. Without a perfect alignment, it is not possible to teach our model the relationship between the musicality and the emotion in the video. In the following, we show some challenges we faced in the alignment:
Most pop music songs do not provide their BPM.
To deal with this problem, we first find the average BPM of each video, and then generate a MIDI from the piano sheet with the same BPM. Next, we use OpenShot333https://www.openshot.org for rearrangement. By overlapping the original video with our MIDI, we can manually check if the correctness of the alignment, and find the arranged measurements in the video.
The BPM in performance is usually not constant.
Since humans are not perfect metronomes, the speed in their performance is usually not constant. Fortunately, most pop music songs are adjusted before release. For those videos with non-constant BPM, we find the BPM of each musical phrase, and either modify the speed of the video or the piano sheet.
The songs in videos differ from the original ones.
In music videos, it is common to remix the video by adding or removing some fragments. In this case, we either remove the added fragment from the movie or modify the piano sheet to fit the movie. Sometimes in music movies, there are tempo changes such as explosions, slow motion clips, or dream fantasy clips. By using accelerato (acceleration) and rallentando (slow down) in the piano sheet, we may also create these effects in MIDI. There are also music pauses in some music movies. These pieces are usually narrating or dialogues, which are not relevant to the music. Therefore, removing these pieces should not affect the quality of the music generation.
2.4 Dataset Summary
|Songs||Fragments||Length (hrs)||Notes (k)|
After the above works, we collected 128 music videos and the corresponding music sheets. We split this dataset into training (90 songs), validation (10 songs), and testing (28 songs). Due to the CUDA memory limitation, we divided each song into 10-second fragments. As show in tab:example, we provide a video to piano MIDI dataset, containing samples extracted from pop music videos. Each sample contains 10-second fragment of a music video, with frames scaled to images and rearranged piano MIDI. The total length of this dataset is hours, with more than k notes.
3 Proposed Model
, but deviates from its text generation framework.
Unlike text sequences, a musical sequence consists of recurring phrases, called a motif, which is of special importance or is characteristic of a composition. The occurrence of motifs can be sparse. Furthermore, musicians often include improvisations to create surprise and variations between performances. Hence, it is required of our model to maintain long-range “coherence’s”. Intuitively, an end-to-end attention-based model can be an effective approach to realize this — the attention mechanism creates weighted representation of frame sequences — which then are used to predict piano notes.
In ssec:model-conv2d, we show the structure of the convolutional layers in detail. The convolution layers extract and encode the video, including abstract features such as emotion and movement. ssec:model-baseline shows the baseline model to which our VMT will be compared to in the experiments. Finally, we formulate attention mechanisms such as dot-product (self-attention) and the encoder-decoder attention to predict piano notes in ssec:model-vmt.
Notation and Task Definition.
We denote as a sequence of video frames, where is the length of this sequence. Each
consists of three channels of RGB data and encoded with PNG using TensorFlow. The size ofis . The target of our video-to-music task is a piano event sequence . To generate symbolic music like language modeling, we use the performance encoding proposed by  to represent the MIDI data. Hence, Each piano event is included in the vocabulary , . Besides, because of all piano music in our dataset, the vocabulary size .
In this work, our model takes
as input, and outputs sequence of piano event probability.
3.1 Convolutional-2D Encoding
As shown in fig:conv, for each frame , we use a 3-layer 2D convolution with LeakyReLU activation, layer normalization 
and zero padding given by (1), where be convoluted with the kernel matrix
, with strideand filters . Finally, the output
passes through an average pooling layer to get a flattened embedding vectorwith dimension which represents one frame.
3.2 Baseline Model
The task of video-to-music conversion is typically viewed as a sequence-to-sequence problem. Most competitive multi-modal sequence-to-sequence models, including video captioning  and language translation [2, 5] utilizes an encoder-decoder structure. Thus, we implement an encoder-decoder model with GRU layers as our baseline model. As shown in fig:seq2seq, we feed the frame vectors
into the encoder. Both the encoder and decoder architecture are 3-layer gated recurrent units (GRUs). The GRU has a reset gate, update gate and new gate , which formula are given by (2). The reset gate and update gate help the model to determine how much information from the hidden state needs to be passed on.
Specifically, we use the same GRU layer for each layer in encoder and decoder. Benefiting from this modification, the decoder is capable of carrying on the information from frame sequence. Also, we use encoder-decoder attention, which is the same as inter-attention in VMT described in section 3.3, to weight the output hidden state from the encoder in order to get representation as to the input of the decoder. Finally, we use the softmax function on the decoder output to get the probability of performance event . For the training, we use the negative log-likelihood loss as our objective function.
3.3 Video-Music Transformer
As shown in fig:vmt, the model architecture of VMT consists of four modules: the 2D convolution encoding, the performance event embedding layer, the encoder and the decoder. The details of 2D convolution encoding are given in ssec:model-conv2d, where frames are encoded separately to vectors . The frame vectors are then fed into dot-product attention. Also, in order to make use of the order of frame sequences, we add positional encodings  to the frame vectors and the performance event embeddings at the bottom of the encoder and decoder, respectively. In this work, we use sine and cosine function of different frequencies:
For the performance event embedding from time step to , we use an embedding matrix , where is the dimension of the event vector. The values of are learned during training.
For each layer in the encoder and decoder, we use dot-product attention and feed-forward network in each layer. Specifically, since there are different ways to calculate query, key and value, there are two different dot-product attention in our model, called intra-attention and inter-attention. In every time step, the intra-attention weights different positions of frame sequence in order to compute a representation of video, i.e., in the encoder, as well as in the decoder. On the contrary, the inter-attention weights different positions of encoder output vectors, i.e., , where we denote the output of the encoder for each time step by and indicate the hidden state in the decoder. Moreover, the hidden state from inter-attention relates the information from encoder and performance events from to steps. Either the outputs from intra-attention or the output from the inter-attention is passed to the feed-forward network.
Finally, we then pass the hidden state in the last layer to a linear layer and softmax function. The output is the probability of generating a performance event. We compute the negative log-likelihood loss for the target event as our objective function.
In this section, we describe the experiment setting and report the results.
4.1 Experimental Setting
We train the VMT model and the Seq2Seq model on a single NVIDIA P40 GPU. Due to the memory limitation, we down-sampled the video by extracting 40 frames from every 10 seconds of video with OpenCV and reduce frame size to . As mentioned previously, our dataset consists of aligned pair of video and piano music, each has the same length of 10 seconds. We denote the hidden layers and dimensions as and , respectively. For details, Seq2Seq is trained with three hidden layers. Both the encoder and decoder in VMT are trained with six hidden layers; additionally, all hidden states have dimension 512, i.e., , , , . Moreover, we keep a dropout of 0.1 on all layers and attention weights for Seq2Seq and VMT model. The learning rate is linearly warmed up over the first 8000 steps to achieve a peak value of
and then decayed with the inverse square root of the steps. For the optimizer, we use Adam with hyperparameters, and , which converges to better sets of our model parameters. Our implementation uses a batch size of 4 video frame sequences, whose size is . We train the VMT model for a total of 50,000 steps over 18 hours.
In particular, the maximum length of the target sequences of VMT is set to 1024. Since we use the performance encoding on our target piano notes, the vocabulary size of events is 310, including 88
VELOCITY, and two special tokens represent the start and end of event sequences.
4.2 User Evaluation
In this paper, we are working on a new task of video-to-music translation, and there is yet a credible benchmark for this task. Therefore, we compare the proposed model VMT with Seq2Seq and original music from the testing dataset. We conduct a user evaluation involving 23 participants, with each participant evaluating 30 randomly sampled examples from the testing set. In each trial, the participant is asked to compare three pieces of music accompanying the same video. The pieces included one generated by VMT, one by Seq2Seq and the original soundtrack. Participants compare the three pieces and score each piece by the smoothness of the music and it’s relevance to the video. Each score ranges from 0 to 5. Since our dataset is collected from pop music videos, we design a button “I know this song or music video. Skip!” to avoid biases from affecting the accuracy of evaluation. For example, the songs of “Adele — Hello”, “Ed Sheeran — Shape of you” and “Lady Gaga — Poker face” achieved high skip rates compared with the other samples in the testing set for the experiment. We report in tab:user the scores that the two models and the original music receives.
As shown in tab:user, although the original soundtracks remain far ahead in terms of both smoothness and relevance scores, the proposed VMT model has substantially outperformed the baseline seq2seq model. Each score in tab:user is the average value of the total of 690 samples graded from 23 participants. Interestingly, both generated music and original soundtrack whose scores for the relevance of video are lower than the scores of music smoothness. Since our dataset consisted of pop music videos and pop music. It is possible that limiting the model to generate only piano MIDI impedes the user experience of watching the video and consequently hurts the relevance performance.
4.3 Case Study
We choose two samples from the testing set and visualized the MIDI files generated from both VMT and Seq2Seq, compared with the original soundtracks. fig:three graphs shows the notes played along with seconds, i.e., the x-axis and y-axis are pitch and seconds, range from C2–C7 and 0–10, respectively.
Taking a closer look at fig:seq1, (d)d. While samples generated with Seq2Seq achieve higher scores on smoothness and relevance, there are two critical problems for them.
Firstly, as shown in fig:seq1, even though the length of all videos is ten seconds, Seq2Seq is incapable of generating music whose length is exactly ten seconds.
That is, Seq2Seq fails to correctly generate end tokens, resulting in an inconsistency between the music and the video. We removed the music pieces over ten seconds for the user evaluation.
Moreover, while decoding the output event sequence from the Seq2Seq model using the performance encoding module, there are some warning messages such as a pitch with
NOTE_ON events but not found
NOTE_OFF would be removed.
Thus, we suppose that the selected metrics have overrated samples generated with Seq2Seq. And verify the Seq2Seq is a lack of maintaining long-term sequence generation.
In contrast, the examples generated with VMT (fig:vmt1, (e)e) have consistent lengths with videos and without warning message while decoding using the performance encoding module.
Secondly, we observe the rhythmicity and harmonicity of note sequences. Seq2Seq tends to generate notes repeatedly (fig:seq2), which causes a lack of melodic motion. This example gets a higher score on relevance, since the user mistook the generated notes for the beat of drums. This also demonstrated that the relevance metric of the video is ambiguous and needs a benchmark to train users before doing the evaluation.
As shown in fig:three graphs, the green vertical boxes indicate a recurring phrase of notes. Both fig:vmt1 and fig:vmt2 generated with VMT has a motif, which is the most often thought of in melodic terms. We observe the VMT not only generates recurring phrases but also maintains the variability of pitches that have harmonic, melodic and rhythmic aspects.
In summary, we compare examples from models with the original soundtrack (fig:gt1, (f)f). The visualization (fig:vmt1, (e)e) shows the VMT is capable of generating notes with the recurring phrase, melodic and harmonic. Hence, the training procedure passes knowledge of music structure onto the VMT model and composes music like a human.
5 Related work
The closest task to video-to-music is the music recommendation for video. There are two approaches for music recommendation: emotion-based and correlation-based. The first emotion-based model  focus on user emotions, which conduct a mixed media graph to detect music emotion rather than directly using the label of music emotion.
On the other hand, the EMV-matchmaker  proposed to extract video and music features separately and then utilize the temporal phase sequence to connect music and video. Before that, most of the researches on video music recommendation based on the similarity of user preference.
Music video generation  combined the content and emotion features. Given a video, their model predicts the acoustic features to match the music.
To sum up, those works are video retrieving music task, that may suffer from the copyright and the insufficient diversity of database. Moreover, the music generated by the model are more varied and fitting the video.
With the advancement in deep learning, music generation models has improved dramatically. Some works developed depending on symbolic music, that is, the target data of training are MIDI files. The first neural-network-based model on music generation was proposed by. Along with the development of model structure, many music generation models are based on RNN architectures [10, 14], due to the sequential nature of the input.
based on variational autoencoder is capable of style transfer on symbolic music by changing pitches and instruments. A branch of multi-track music generation studies appeared along with the release of the Lakh MIDI Dataset. The LakhNES  was a transformer architecture model with a pre-training technique to generate chiptune music. MusuGAN  generated polyphonic music of multi-track instruments, which using convolutions in both generators and the discriminators, which conditioned by intra-track and inter-track features.
On the other hand, following the generative model of raw audio waveforms , an autoencoder model  is trained to learn music features and generate music mixed with bass, flute, and organ spectrum. The music synthesis model Mel2Mel  learned the instrument embedding to predict Mel spectrogram from a given note sequence. The release of MAESTRO  enabled the process of transcribing, composing, and synthesizing audio waveform, also known as Wave2Midi2Wave. The model  trained on the NSynth dataset  was capable of independently controlling pitch and timbre then generate audio music.
6 conclusion and future work
We propose a novel attention-based multi-modal model, video-music transformer, called VMT, which generates piano music for a given video. We release a new dataset composed of over 7 hours of piano scores with excellent alignment with video. The experiment shows VMT outperforms the Seq2Seq model on music smoothness and relevance of video. In future work, we plan to explore proper benchmark of relevance with video, including emotion, rhythm, and motion connection. Furthermore, we will develop the model architecture to encourage the model to learn these features.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.1.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.2.
-  (2018) MIDI-vae: modeling dynamics and instrumentation of music with applications to style transfer. arXiv preprint arXiv:1809.07600. Cited by: §5.
Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §3.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.2.
-  (2019) LakhNES: improving multi-instrumental music generation with cross-domain pre-training. arXiv preprint arXiv:1907.04868. Cited by: §5.
Musegan: multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §5.
-  (2019) Gansynth: adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710. Cited by: §5.
Neural audio synthesis of musical notes with wavenet autoencoders.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1068–1077. Cited by: §5.
-  (2017) Deepbach: a steerable model for bach chorales generation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1362–1371. Cited by: §5.
-  (2010) Music emotion classification and context-based music recommendation. Multimedia Tools and Applications 47 (3), pp. 433–460. Cited by: §1.
-  (2018) Enabling factorized piano music modeling and generation with the maestro dataset. arXiv preprint arXiv:1810.12247. Cited by: §5.
-  (2018) Music transformer. arXiv preprint arXiv:1809.04281. Cited by: §2.
-  (2016) . In Proceedings of Deep Reinforcement Learning Workshop, NIPS, Cited by: §5.
-  (2019) Neural music synthesis for flexible timbre control. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 176–180. Cited by: §5.
-  (2005) Emotion-based music recommendation by association discovery from film music. In Proceedings of the 13th annual ACM international conference on Multimedia, pp. 507–510. Cited by: §1, §5.
-  (2015) EMV-matchmaker: emotional temporal course modeling and matching for automatic music video generation. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 899–902. Cited by: §1, §5.
-  (2016) Automatic music video generation based on emotion-oriented pseudo song prediction and matching. In Proceedings of the 24th ACM international conference on Multimedia, pp. 372–376. Cited by: §5.
-  (2016) C-rnn-gan: continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904. Cited by: §5.
-  (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §5.
-  (2018) This time with feeling: learning expressive musical performance. Neural Computing and Applications, pp. 1–13. Cited by: §3.
-  (2016) Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching. Ph.D. Thesis, Columbia University. Cited by: §5.
Videobert: a joint model for video and language representation learning.
Proceedings of the IEEE International Conference on Computer Vision, pp. 7464–7473. Cited by: §3.
-  (1989) A connectionist approach to algorithmic composition. Computer Music Journal 13 (4), pp. 27–43. Cited by: §5.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.3.
-  (2015) Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pp. 4534–4542. Cited by: §3.2, §3.