There is a growing interest in multimodal techniques for solving Music Information Retrieval (MIR) problems. Music performances have a high multimodal content and the different modalities involved are highly correlated: sounds are emitted by the motion of the player performing and in chamber music performances the scores constitute an additional encoding that may be as well leveraged for the automatic analysis of music .
A fundamental problem in music analysis and in general in audio processing is Blind Source Separation (BSS). BSS consists in, given a mixture of signals, recovering the individual signals the mixture is conformed by. Mathematically, a mixture of sounds can be expressed as the sum of individual sources: . Thus, the BSS problem consists in recovering each for a given . In speech, it is also known as the Cocktail Party problem, which refers to the task of recognizing an individual speech in noisy social environments 
. Single-channel source separation problem can be approached from an audio-only perspective using techniques such as independent component analysis (ICA), sparse decomposition , nonnegative matrix factorization (NMF) , computational auditory scene analysis (CASA) , probabilistic latent component analysis (PLCA) 
or deep learning techniques (e.g.[2, 33]).
On the other side, by visually inspecting the scene we may extract information about the number of sound sources, their type, spatio-temporal location and also motion, which naturally relates to the emitted sound. Besides, it is possible to carry out self-supervised tasks in which one modality supervises the other one. This entails another research field, the cross-modal correspondence (CMC). We can find pioneering works for both problems BSS and CMC. [12, 16] make use of audio-visual data for sound localization and , , for speech separation. In the context of music, visual information has also proven to help model-based methods both in source separation [22, 27] and localization . With the flourishing of deep learning techniques many recent works exploit both, audio and video content, to perform music source separation [10, 38, 36], source association , localization  or both . Some CMC works explore features generated from synchronization [26, 18] and prove these features are reusable for source separation. These works use networks that have been trained in a self-supervised way using pairs of corresponding/non-corresponding audio-visual signals for localization purposes  or the mix-and-separate approach for source separation [39, 10, 38, 36]. Despite deep learning made possible to solve classical problems in a different way, it also contributed to create new research fields like cross-modal generation, in which the main aim is to generate video from audio[25, 4] or viceversa . More recent works related to human motion make use of skeleton as an inner representation of the body which can be further converted into video [30, 11]
which shows the potential of skeletons. The main contribution of this paper is Solos, a new dataset of musical performance recordings of soloists that can be used to train deep neural networks for any of the aforementioned fields. Compared to a similar dataset of musical instruments presented in and its extended version , our dataset does contain the same type of chamber orchestra instruments present in the URMP dataset. Solos is a dataset of 755 real-world recordings gathered from YouTube which provides several features missing in the aforementioned datasets: skeletons and high quality timestamps. Source localization is usually indirectly learned by networks. Thus, providing a practical localization ground-truth is not straightforward. Nevertheless, networks often point to the player hands as if they were the sound source. We expect hands localization can help to provide additional cues to improve audio-visual BSS or can be used as source ground-truth localization. In order to show the benefits of using Solos we trained some popular BSS architectures and compare their results.
Ii Related Work
The University of Rochester Multi-Modal Music Performance Dataset (URMP)  is a dataset with 44 multi-instrument video recordings of classical music pieces. Each instrument present in a piece was recorded separately, both with video and high-quality audio with a stand-alone microphone, in order to have ground-truth individual tracks. Although playing separately, the instruments were coordinated by using a conducting video with a pianist playing in order to set the common timing for the different players. After synchronization, the audio of the individual videos was replaced by the high-quality audio of the microphone and then different recordings were assembled to create the mixture: the individual high-quality audio recordings were added up to create the audio mixture and the visual content was composited in a single video with a common background where all players were arranged at the same level from left to right. For each piece, the dataset provides the musical score in MIDI format, the high-quality individual instrument audio recordings and the videos of the assembled pieces. The instruments present in the dataset, shown in Figure 1, are common instruments in chamber orchestras. In spite of all its good characteristics, it is a small dataset and thus not appropriate for training deep learning architectures.
Two other datasets of audio-visual recordings of musical instruments performances have been presented recently: Music  and MusicES . Music consists of 536 recordings of solos and 149 videos of duets across 11 categories: accordion, acoustic guitar, cello, clarinet, erhu, flute, saxophone, trumpet, tuba, violin and xylophone. This dataset was gathered by querying YouTube. MusicES  is an extension of MUSIC to around the triple of its original size with approximately 1475 recordings but spread in 9 categories instead: accordion, guitar, cello, flute, saxophone, trumpet, tuba, violin and xylophone. There are 7 common categories in MUSIC and Solos: violin, cello, flute, clarinet, saxophone, trumpet and tuba. The common categories between MusicES and Solos are 6 (the former ones except clarinet). Solos and MusicES are complementary. There is only an small intersection of 5% between both, which means both datasets can be combined into a bigger one.
We can find in the literature several examples which show the utility of audio-visual datasets. The Sound of Pixels  performs audio source separation generating audio spectral components which are further smartly selected by using visual features coming from the video stream to obtain separated sources. This idea was further extended in  in order to separate the different sounds present in the mixture in a recursive way. At each stage, the system separates the most salient source from the ones remaining in the mixture. The Sound of Motions  uses dense trajectories obtained from optical flow to condition audio source separation, being able even to separate same-instrument mixtures. Visual conditioning is also used in 
to separate different instruments; during training, a classification loss is used on the separated sounds to enforce object consistency and a co-separation loss forces the estimated individual sounds to produce the original mixtures once reassembled. In, the authors developed an energy-based method which minimizes a Non-Negative Matrix Factorization term with an activation matrix which is forced to be aligned to a matrix containing per-source motion information. This motion matrix contains the average magnitude velocities of the clustered motion trajectories in each player bounding box.
Recent works show the rising use of skeletons in audiovisual tasks. In Audio to body dynamics  authors show it is possible to predict skeletons reproducing the movements of players playing instruments such as piano or violin. Skeletons have proven to be useful for establishing audio-visual correspondences, such as body or finger motion with note onsets or pitch fluctuations, in chamber music performances . A recent work  tackles the source separation problem in a similar to Sound of Motions  but replacing the dense trajectories by skeleton information.
Solos111Dataset available at https://juanfmontesinos.github.io/Solos/ was constructed aiming to have the same categories as the URMP  dataset, so that URMP can be used as testing dataset in a real-world scenario. This way we aim to establish a standard way of evaluating source separation algorithms’ performance avoiding the use of mix-and-separate in testing. Solos consists of 755 recordings distributed amongst 13 categories as shown in Figure 1, with an average amount of 58 recordings per category and an average duration of 5:16 min. It is interesting to highlight that, for 8 out of 13 categories, the median of resolution is HD, despite being a YouTube-gathered dataset. Per-category statistics can be found in Table I. These recordings were gathered by querying YouTube using the tags solo and auditions in several languages such as English, Spanish, French, Italian, Chinese or Russian.
|Category||# Recordings||Mean duration||Median resolution|
Solos is not only a set of recordings. We also provide OpenPose body and hand skeletons for each frame of each recording and timestamps indicating useful parts. To do so, video streams are re-sampled to 25 FPS keeping the audio stream intact. An iterative process returns stamps for which there are at least N frames with a detected hand and no more than M consecutive mispredictions. In practice we use N=150 and M=5, thus, a minimum of 6 seconds of video with at most 5 consecutive frames with hand mispredictions. At this point, we have segments of video in which there are hands detected. To refine these results we further applied an energy-based silence detector which allows to discard those segments in which the instrument is not being played, e.g., transitions, music sheet changes, etcetera. Besides, we perform a linear interpolation of the mispredicted keypoints in a relative base of coordinates. Directly interpolating the absolute coordinates would lead to deformations of the skeleton and inaccuracies. Since skeletons are tree-like graphs it is possible to interpolate the relative coordinates of each joint (node in the graph) with respect to its parent node. Then, the absolute coordinates of the joint are recovered with the sum of the absolute coordinates of its parent and the estimated relative coordinates with respect to the parent. Let us denote bythe relative coordinates of the th joint with respect to its parent at time . On the other hand, denotes the estimated value of when the th joint is mispredicted. can be linearly interpolated using the relative coordinates of the closest th detected joint before time (i.e where ), and analogously with the closest th detected joint after time (i.e where ). For example, given the following sequence of detected and misdetected coordinates (that need to be estimated), and respectively:
then, the interpolation at time can be calculated as:
OpenPose maps mispredicted joints to the origin of coordinates. We empirically found that such a big jump in the position of a joint induces noise. Using interpolated coordinates helps to address this problem.
In order to show the suitability of Solos, we have focused in the blind source separation problem and have trained The Sound of Pixels (SoP)  and the Multi-head U-Net (MHU-Net)  models on the new dataset. We have carried out four experiments: we have evaluated the SoP pre-trained model provided by the authors, we have trained SoP from scratch, we have fine-tuned the pre-trained the SoP network in our dataset and we have trained the Multi-head U-Net from scratch. MHU-Net has been trained to separate mixtures with the number of sources varied from two to seven following a curriculum learning procedure as it improves the results. SoP has been trained according to the optimal strategy described in .
Evaluation is performed on the URMP dataset  using the real mixtures they provide. URMP tracks are sequentially split in 6s-duration segments. Metrics are obtained from all the resulting splits.
Iv-a Architectures and training details
We have chosen The Sound of Pixels as baseline since its weights are publicly available and the network is trained in a straight-forward way. SoP is composed of three main sub-networks: A dilated ResNet as video-analysis network, a U-Net as audio-processing network and an audio synthesizer network. We also compare its results against a Multi-head U-Net .
U-Net  is an encoder-decoder architecture with skip connections in between. Skip connections help to recover the original spatial structure. MHU-Net is a step forward as it consist of as many decoders as possible sources. Each decoder is specialized in a single source, thus improving performance.
, which was tuned for singing voice separation. Instead of having two convolutions per block followed by max-pooling, they use a single convolution with a bigger kernel and striding. The original work proposes a central block with learnable parameters whereas the central block is a static latent space inSoP. U-Net has been widely used as backbone of several architectures for tasks such us image generation 
, noise suppression and super-resolution14], image segmentation  or audio source separation . SoP U-Net consists of 7 blocks with 32, 64, 128, 256, 512, 512 and 512 channels respectively (6 blocks for the MHU-Net). The latent space can be considered as the last output of the encoder. Dilated ResNet is a ResNet-like architecture which makes use of dilated convolutions to keep the receptive field while increasing the resulting spatial resolution. The output of the U-Net is a set of 32 spectral components (channels) which are the same size than the input spectrogram, in case of SoP, and a single source per decoder in case of MHU-Net. Given a representative frame, visual features are obtained using the Dilated ResNet. These visual features are nothing but a vector of 32 elements (which corresponds to the number of output channels of U-Net) which are used to select proper spectral components. This selection is performed by the audio analysis network which consist of 32 learnable parameters, , plus a bias, . This operation can be mathematically described as follows:
where is the -th predicted spectral component at time-frequency bin .
Figure 2 illustrates the SoP configuration. It is interesting to highlight that making the visual network to select the spectral components forces it to indirectly learn instrument localization, which can be inferred via activation maps.
On one hand, MHU-Net has been trained using a curriculum learning strategy that consists of a gradual increment on the amount of sources present in the mixture from two to four. When the loss stays on a plateau for more than 160,000 iterations, the amount of sources is increased by one. We have used mean-square error loss, ADAM optimizer , an initial learning rate (LR) of , weight decay of and dropout of in the decoder. We have also reduced the LR by a half if the loss stays on a plateau for more than 400,000 iterations.
On the other hand, SoP has been trained using a LR of for the U-Net and a LR of
for the Dilated ResNet as it was pre-trained on ImageNet. We have applied a weight on the gradients based on the magnitude of the mixture spectrogram so that time-frequency points of the predicted source/s contribute to the loss according to the energy of the analogous time-frequency points in the mixture spectrogram. This reduces overfitting since, given a source, a time-frequency bin with a low value may be assigned either to one or zero in the ground-truth mask depending on the recorded noise and such weights help reduce its impact on the training. We used different training strategies for SoP and MHU-Net as the optimal training for SoP harms the performance of the MHU-Net.
Iv-B Data pre-processing
In order to train the aforementioned architectures, audio is re-sampled to 11025 Hz and 16 bit. Samples fed into the network are 6s duration. We use Short-time Fourier Transform (STFT) to obtain time-frequency representations of waveforms. Following, STFT is computed using Hanning window of length 1022 and hop length 256 so that we obtain a spectrogram of size 512256 for a 6s sample. Later on, we apply a log re-scale on the frequency axis expanding lower frequencies and compressing higher ones. Lastly, we convert magnitude spectrograms into dB w.r.t. the minimum value of each spectrogram and normalize between -1 and 1.
Iv-C Ground-truth mask
Before introducing ground-truth mask computations we would like to point out some considerations. Standard floating-point audio format imposes a waveform to be bounded between -1 and 1. At the time of creating artificial mixtures resulting waveforms may be out of these bounds. This can help neural networks to find shortcuts to overfit. To avoid this behaviour spectrograms are clamped according to the equivalent bounds in the time-frequency domain.
The Discrete Short-time Fourier Transform can be computed as described in :
Since it can be easily shown that:
i.e., that the magnitude STFT of an audio signal bounded between [-1,1] is bounded between . Thus, given the STFT of N waveforms, the spectogram of a mixture of sounds is defined the following way:
which is equivalent to:
For training Sound of Pixels we have used complementary binary masks as ground-truth masks, defined as:
The Multi-head U-Net has been trained with complementary ratio masks, defined as:
in terms of mean and standard deviation. As it can be observed, Sound of Pixels evaluated using its original weights performs the worst. One possible reason for that could be the absence of some of the URMP categories on the MUSIC dataset. If we train the network from scratch on Solos, results improve by almost 1 dB. However, it is possible to achieve an even better result fine-tuning the network, pre-trained with MUSIC, on Solos. We hypothesize that the improvement occurs as the network is exposed to much more training data. Moreover, the table results show how it is possible to reach higher performance by using more powerful architectures like MHU-Net.
We have presented Solos, a new audio-visual dataset of music recordings of soloists, suitable for different self-supervised learning tasks such as source separation using the mix-and-separate strategy, sound localization, cross-modal generation and finding audio-visual correspondences. There are 13 different instruments in the dataset; those are common instruments in chamber orchestras and the ones included in the University of Rochester Multi-Modal Music Performance (URMP) dataset. The characteristics of URMP – small dataset of real performances with ground truth individual stems – make it a suitable dataset for testing purposes but to the best of our knowledge, to date there is no existing large-scale dataset with the same instruments as in URMP. Two different networks for audio-visual source separation based on the U-Net architecture have been trained in the new dataset and further evaluated in URMP, showing the impact of training on the same set of instruments as the test set. Moreover, Solos provides skeletons and timestamps to video intervals where hands are sufficiently visible. This information could be useful for training purposes and also for learning to solve the task of sound localization.
Objects that sound.
Proceedings of the IEEE European Conference on Computer Vision, Cited by: §I.
Monoaural audio source separation using deep convolutional neural networks. In International Conference on Latent Variable Analysis and Signal Separation, pp. 258–266. Cited by: §I.
-  (2008) CHAPTER 7 - frequency domain processing. In Digital Signal Processing System Design (Second Edition), N. Kehtarnavaz (Ed.), pp. 175 – 196. Cited by: §IV-C.
-  (2017) Deep cross-modal audio-visual generation. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 349–357. Cited by: §I.
-  (1953) Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America 25 (5), pp. 975–979. Cited by: §I.
-  (2000) Audio-visual segmentation and “the cocktail party effect”. In Advances in Multimodal Interfaces—ICMI 2000, pp. 32–40. Cited by: §I.
-  (2019) Interleaved multitask learning for audio source separation with independent databases. ArXiv abs/1908.05182. Cited by: §IV-A, §IV.
-  (1996) Prediction-driven computational auditory scene analysis. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §I.
Music gesture for visual sound separation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10478–10487. Cited by: §II.
-  (2019) Co-separating sounds of visual objects. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3879–3888. Cited by: §I, §II.
-  (2019) Learning individual styles of conversational gesture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3497–3506. Cited by: §I.
-  (2000) Audio vision: using audio-visual synchrony to locate sounds. In Advances in neural information processing systems, pp. 813–819. Cited by: §I.
-  (2000) Independent component analysis: algorithms and applications. Neural networks 13 (4-5), pp. 411–430. Cited by: §I.
Image-to-image translation with conditional adversarial networks. arxiv. Cited by: §IV-A.
-  (2017) Singing voice separation with deep U-Net convolutional networks. In 18th International Society for Music Information Retrieval Conference, pp. 23–27. Cited by: §IV-A, §IV-A.
-  (2005) Pixels that sound. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, pp. 88–95. Cited by: §I.
-  (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §IV-A.
-  (2018) Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems, pp. 7763–7774. Cited by: §I.
-  (2019-02) Creating a multitrack classical music performance dataset for multimodal music analysis: challenges, insights, and applications. IEEE Transactions on Multimedia 21 (2), pp. 522–535. External Links: Cited by: Solos: A Dataset for Audio-Visual Music Analysis ††thanks: This work has received funding from the MICINN/FEDER UE project with reference PGC2018-098625-B-I00, H2020-MSCA-RISE-2017 project with reference 777826 NoMADS, ERC Innovation Programme (grant 770376, TROMPA), Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence Program (MDM-2015-0502) and the Social European Funds. We also thank Nvidia for the donation of GPUs., §II, Fig. 1, §III, §IV, §V.
-  (2017) See and listen: score-informed association of sound tracks to players in chamber music performance videos. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2906–2910. Cited by: §I, §I.
-  (2019) Online audio-visual source association for chamber music performances. Transactions of the International Society for Music Information Retrieval 2 (1). Cited by: §I, §II.
-  (2017) Audiovisual source association for string ensembles through multi-modal vibrato analysis. Proc. Sound and Music Computing (SMC). Cited by: §I.
-  (2018-03) Photographic image synthesis with improved u-net. In 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI), pp. 402–407. Cited by: §IV-A.
-  (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in neural information processing systems, pp. 2802–2810. Cited by: §IV-A.
-  (2019) Speech2face: learning the face behind a voice. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7539–7548. Cited by: §I.
-  (2018) Audio-visual scene analysis with self-supervised multisensory features. arXiv preprint arXiv:1804.03641. Cited by: §I.
-  (2017) Guiding audio source separation by video object information. In Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017 IEEE Workshop on, pp. 61–65. Cited by: §I, §II.
-  (2007) Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Transactions on Audio, Speech, and Language Processing 15 (1), pp. 96–108. Cited by: §I.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §IV-A, §IV-A.
-  (2017) Audio to body dynamics. CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Cited by: §I, §II.
-  (2006) A probabilistic latent variable model for acoustic modeling. Advances in models for acoustic processing, NIPS 148, pp. 8–1. Cited by: §I.
-  (2002) Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli. EURASIP Journal on Advances in Signal Processing 2002 (11), pp. 382823. Cited by: §I.
-  (2018) Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185. Cited by: §I.
-  (2006) Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing 14 (4), pp. 1462–1469. Cited by: §IV-D.
-  (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE transactions on audio, speech, and language processing 15 (3), pp. 1066–1074. Cited by: §I.
-  (2019) Recursive visual sound separation using minus-plus net. In Proceedings of the IEEE International Conference on Computer Vision, pp. 882–891. Cited by: §I, §II.
-  (2017) Dilated residual networks. In Computer Vision and Pattern Recognition (CVPR), Cited by: §IV-A.
-  (2019) The sound of motions. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1735–1744. Cited by: §I, §II, §II.
-  (2018-09) The sound of pixels. In The European Conference on Computer Vision (ECCV), Cited by: §I, §II, §II, §IV-A, §IV-B, TABLE II, §IV.
-  (2019-10) Vision-infused deep audio inpainting. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §I, §II.
-  (2018) Visual to sound: generating natural sound for videos in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3550–3558. Cited by: §I.
-  (2001) Blind source separation by sparse decomposition in a signal dictionary. Neural computation 13 (4), pp. 863–882. Cited by: §I.