Humans combine sensory information of different origins to obtain a comprehensive picture of the world. Actually, the information that we perceive with different senses often originates from the same object or event at the physical level. Thus, sound is a vibration which is commonly produced by an object movement. In the case of musical instruments sounds we even can see the movements and associate a particular sound with its source [27, 52]. Also, each of those instruments has its own unique visual characteristics such as shape and color which help us to recognize them. While experiencing online listening through video streaming, we are often given all this information together with associated comments and prior knowledge we may have about a piece, instrumentation or an artist. All this help us with the interpretation of the music we are exposed to, such that, while listening, we can focus on the individual sources of the sound and identify them. Moreover, such phenomenon as synesthesia can also support audio-visual correspondence studies. Thus, some articles report correlations between loudness and visual size/light intensity, as well as musical timbre and arbitrary visual shapes .
In this paper, we focus on Single Channel Source Separation (SCSS). This task is usually solved in the audio domain, but in this work we explore the effects of integrating two additional kinds of context data, namely instrument labels and their visual properties.
We work with audio-visual recordings of musical ensembles with several families of instruments that can be commonly found in a symphonic orchestra such as strings, woodwinds and brass instruments, that is to say, mostly chamber music. Source separation with such a setup is known to be a very challenging task and attempted to be solved with multi-channel score-informed methods  or timbre-informed methods . It is worth emphasizing that the above studies operate on multi-channel recordings and no clear ground truth was available. Besides, once a musical piece has been recorded, there is no simple way to unmix it.
The problem has several origins of complexity, to mention a few:
The instruments within a family could be quite similar to one another;
The number of sources in the mixture is unknown in advance;
There is a high overlap in time and frequency between sources.
Even for instruments which have essentially different timbre, tone color, and different practical techniques, such as clarinet and viola, some musicians may mimic a sound of one while playing another .
As for combining different modalities of information, for many years the key technical problem was the huge gap (both in dimensions and content) between representations of the modalities . One of the common approaches consisted in feature construction followed by dimensionality reduction [21, 18]
. With the advent of deep learning techniques, the problem of the dimensionality mismatch can be considered as solved, while a proper way of fusing different data representations remains an issue.
Another limitation of previous works is that the evaluation was done in somewhat unrealistic settings: typically, mixes of only two sources are considered, and the instruments from the same family are rarely present. In contrast to , we added viola and double bass to the string instruments, and trombone to the brass instruments, increasing the overall variety of timbres. Besides performing the source separation, our method (in non-conditioned settings and while conditioned by visual information) associates the outputs to the different types of instruments, implicitly providing the information of the presence of that instruments in the mix.
This work explores conditioning techniques at different levels of a primary source separation network. We are not the first ones to propose Conditioned-U-Net for source separation or audio-visual source separation [14, 53, 52, 23, 31]. However, unlike prior approaches that were trained with an arbitrary choice of additional data integration, we conduct a thorough study identifying the optimal type of conditioning and comparing possible conditioning strategies with two types of context data: the presence or absence of instruments in the mixture and the video stream data. Another notable contribution of our approach is that training is done by employing a curriculum learning strategy on mixtures of up to 7 sources, and evaluation is carried out on real-world mixtures from the URMP dataset which has up to 4 different instruments per piece, often from the same family. The complexity of the task allows for the present approach to be used as a baseline for future research. In order to facilitate that, the present study is reproducible as we provide pretrained models, code, data and all the training parameters. The supplementary materials and examples are available at https://www.upf.edu/web/mdm-dtic/-/conditioned-u-net.
This paper provides an overview of existing techniques for source separation, audio-visual methods and conditioning strategies in Section 2, indicating relations and differences with the present work. In Section 3 we formalize our approach for performing source separation conditioned on context data. This is followed by Section 4 where we describe the experimental setup and implementation details. Finally, we discuss the obtained results in Section 5 and provide conclusions in Section 6.
2 Related work
2.1 Single channel source separation
Single channel source separation (SCSS) consists in estimating the individual sourcesgiven a mono mixture time-domain signal of sources:
Instead of predicting the signals, a general approach for solving SCSS involves the estimation of
masks for Short-Term Fourier transform (STFT) values of the mixture. In this case, we consider a time-frequency representation of the mixtureand the sources , and the goal of the source separation method is to learn a real-valued (or complex-valued) mask for each source .
In this work we only consider two types of real-valued masks, namely ideal ratio or soft masks :
and ideal binary masks :
there and indicate the magnitude of the STFT value, of and respectively, at frequency and time frame .
We obtain the STFT magnitude values of separated sources by multiplying the STFT magnitude of the mixture by the estimated masks , i.e. . Then, the waveforms of the source signals are recovered by applying the inverse STFT transform on the predicted magnitude and using the phase of the mixture .
The mask estimation step has always been an essential component of model-based source separation algorithms [6, 32, 36, 7, 35, 49]. Consecutively, the masking-based approach for training neural networks has received a lot of attention recently and has been very successful in SCSS [8, 50, 19]. While being consistent in the estimation objective, many authors propose additional schemes and techniques with the aim of raising the separation performance. Thus, the work reported in  shows an improvement of 0.7 dB in scale-invariant signal-to-distortion ratio (SI-SDR) metric  by integrating mixture-consistency and STFT consistency constraints into the training pipeline. Despite the fact that most of the existing work estimates binary or ratio masks, the estimation of STFT magnitude values has also been used in practice 
together with loss function computation in time-frequency or time domain  while internally estimating the masks.
It’s worth noting that the set of methods which has been successfully used in source separation is very diverse, and the optimal choice of an architecture remains an open research question. Some examples include LSTMs  and BLSTMs [47, 45], fully-connected architectures , U-Nets [19, 10], GANs [44, 9], as well as combinations of the above [47, 20]. Some research works suggest the estimation of each source separately with a dedicated network [8, 45], while other approaches employ one-to-many encoder-decoder networks with a shared encoder and one decoder per source . Overall, the use of an individual network for each source seems to provide a better performance but it comes at the cost of increased training time.
There have been diverse proposals for loss functions, which include -distance [8, 47], and -distance [19, 10] on estimated spectrograms, -distance on ratio and binary masks , -distance on ratio masks , binary cross entropy on binary masks [53, 52], as well as negative SI-SDR [24, 30] and SNR  as objective functions.
2.2 Audio-visual approaches and source separation
2.2.1 Audio-visual model-based methods
Audio and visual information are related, and, often, for every particular sound we can see or imagine its visual source of origin. In the real world, they have a causal relation. In addition, the study of some misattribution effects (i.e. ventriloquism) has shown that people tend to relate audio and visual events if they happen simultaneously. Having this in mind, a correlation approach for source localisation was proposed as early as in 2000 . It consisted in calculating intensity changes in audio and video and computing correlations between audio and every pixel in a sequence of frames. The authors showed that the method can successfully identify the speaking person at every time frame in videos of two people speaking in turns.
Thinking along the same line, Kidron et al. present a method that detects pixels associated with a sound source while filtering out other dynamic pixels . The method uses a refined version of canonical correlation analysis and, in contrast to previous studies which mostly focus on speech applications, it can handle different types of sounding sources, not only people speaking but also musical instruments being played. The authors also discuss the chorus ambiguity phenomenon when several people sing in synchrony, and in this particular case they accept the detection of any of the faces as a successful result. The main concern raised by the authors is the extreme locality of the pixel regions associated with an audio event which they overcome by introducing a sparsity constraint. That work was further extended in , incorporating temporal information for matching visual and audio onsets.
More recent research which focuses solely on chamber music performances 
explores the association of musical scores with their spatio-temporal visual locations in video recordings. First, the authors perform audio-score alignment based on chroma features and Dynamic Time Warping, therefore automatically obtaining video-score alignment. Next, they use optical flow to compute bow strokes motion velocities and correlate them with audio onsets. The further video analysis consists in fitting a Gaussian Mixture Model for player detection and computing a histogram of motion magnitudes for fine-grained localisation of a high-motion region.
Parekh et al. 
look for sparse motion patterns which are similar to audio activation matrices obtained with Non-negative Matrix Factorization (NMF). In particular, from the visual modality, the authors compute frame-wise average magnitude velocities of clustered motion trajectories. Then, a linear transformation which transforms the motion velocity matrix into the spectral activation matrix is used to constrain the non-negative least square cost function together with a sparsity constraint. Both NMF and the audio-motion transformation are jointly optimized. The results show a noticeable drop in signal-to-distortion ratio (SDR) while going from Duos to Quartets (from 7.14dB to 0.67dB for the best method while using soft masks for reconstruction). As, the proposed method has troubles separating sounds of the same instrument while addressing this problem for the first time. Interestingly, the authors only focus on the motion component of videos ignoring other visual characteristics such as shape, color, and texture.
2.2.2 Audio-visual deep learning methods
With the breakout of deep learning techniques, the whole area of audio-visual learning has gotten a significant boost, especially the problems formulated in unsupervised and self-supervised manners. Along this line of research there are works focused on representation learning with further applications in audio classification, action recognition and source localization [4, 2, 3, 42, 23, 15, 37, 28]. Most of them combine features from two-stream networks (one tower for the audio and another one for the visual modality) either by concatenating them or by having an additional attention module. Some of them employ time synchrony for the samples of the same video [34, 23], while others learn to extract features by identifying if the audio sample corresponds to a given visual data [42, 23, 4]. More recent work also focuses on the usage of audio for distilling redundant visual information to reduce computational costs .
Different objective functions such as cross-entropy [2, 3], KL-divergence [4, 15], contrastive  or triplet  losses are exploited in audio-visual deep learning. Distinctively, Korbar et al.  use curriculum learning by first training the network with easy examples (correspondence is defined as being sampled from the same video) and then with hard/superhard examples (correspondence is defined as time-synchrony with/without time shift within the same video).
At the same time, the field of visually assisted source separation has emerged [12, 34, 29, 38, 51], in particular, with explicit focus on musical data [53, 52, 13, 14, 51]. Starting with capturing only visual appearance features [53, 13, 14, 51] there is a shift towards capturing and integrating motion data .
To combine the data obtained from different modalities, commonly used approaches include late fusion , conditioning at the bottleneck via tile-and-multiply , concatenation , attention mechanism [53, 52], and FiLM conditioning [11, 52].
Unlike previous studies, in the present work we analyse different ways to combine audio and visual information and extend prior work for multiple and unknown in advance number of sources.
It’s also worth noting that two antecedent works in audio-visual source separation explore approaches which can be applied for estimating multiple sources [14, 51], separating one source at a time. However, they have only been trained on artificial mixtures of up to 4 sources and real mixtures of 2 sources. The separation enhancement scheme proposed in  consists in extracting one source at a time from a residual audio mixture while considering maximum visual energy at every step, which follows the idea proposed in . Authors train the network with mixtures of 2 and 3 instruments, and test it on mixtures of up to 5 instruments.
Concurrently, the idea of co-separation has been proposed in 
. The method consists in guiding source separation by integrating visual features of a detected musical instrument at the bottleneck of the primary U-Net, while the training is done using mix-and-separate approach with a combination of separation and consistency losses. The latter is defined as a cross-entropy loss between ground truth instrument labels and the predictions obtained with an additional classifier on the preliminary separated sources.
2.3 Conditioned source separation
In the previous section we reviewed an existing research line in source separation which combines information from visual and audio modality. It can be reformulated as audio source separation conditioned
on visual information. We observe that, while there are several strategies of data fusion (i.e. concatenation or co-processing), another possibility is to modulate activations of a primary audio network by a context vector extracted from another modality, which is known asFeature-wise Linear Modulation (FiLM) . The conceptual idea of FiLM conditioning is simple: it takes a set of learned features and scale and shift them accordingly to a context vector. Scaling and shifting parameters are learned based on an input context vector c by an arbitrary function which is called FiLM-generator:
The learned parameters modulate a neural network’s activations , where refers to a feature or feature map, via a feature-wise affine transformation:
Other studies consider weak conditioning in source separation using only labels of target sources [31, 43] in contrast to strong conditioning where the context vector could be available frame-wise [46, 41]. The employed weak label conditioning techniques include FiLM  and tile-and-multiply . For strong conditioning, a binary vocal activity vector and vocals magnitude vector have been used for singing voice separation with attention mechanism .
Later, the idea has been explored in the context of universal source separation with conditioning on classification embeddings . First, the method extracts the context embeddings with the classification network, then upsample and normalize them, which is followed by conditioning of the primary source separation network either by concatenation with network’s activations or gating the activations by the embeddings. Another work goes along this line and train a source separation model based solely on weak labels . The method consists in training a classifier network and using the classification loss (with an additional constrain for the estimated sources to sum to the mixture) as the objective function for separation.
We find various strategies to integrate side information, and different modules of the network being conditioned. However, most of the studies inject the context vector at the bottleneck of encoder-decoder architecture with a rare exception of early fusion in . The same authors  report that integration of the context vector at every layer of the primary network leads to overfitting.
In this work, we study the effect of integrating two types of context information, namely labels and visual context, at different locations of the network, while keeping the architecture fixed and simple.
We use a mix-and-separate approach for training, such that every mixture is generated on the fly and, therefore, unique. To create a mixture, we take the following steps: (1) we sample an arbitrary subset of instruments; (2) we subsequently pick a random segment from one of the audios of that instrument category; and (3) we sum time-domain values of the segments and clip them to the range. Given a magnitude spectrogram of the mixture, our network learns to predict real-valued masks , one mask per potential instrument present in the mixture (we use different instruments in our experiments, see Figure 1(a) U-Net). Each output mask is associated to a certain kind of instrument, and their order is fixed to reduce the source permutation effect.
Additionally, we employ a curriculum learning strategy for training, gradually increasing the number of sources in the mixture. Consequently, the predictions of the network are sparse, meaning that many sources should be silent (and many masks are all zeros) as only a subset of instruments is present in the mix.
3.1 U-Net and Multi-Head U-Net baselines
As the focus of this work is on studying the effect of different types of conditioning, we leave for future research the analysis of different source separation networks and adopt two simple U-Net versions as the baseline architectures, given that U-Net has been extensively used and demonstrated good performance [19, 10, 53, 52, 14].
U-Net is an encoder-decoder architecture with skip connections such that activations of every layer of the encoder are concatenated with activations of layer of the decoder, which can be considered as a light form of conditioning by itself. Following [53, 52], we have chosen one of the architectures they propose and set the number of layers to . We employ two variants of the architecture, namely: (a) a baseline U-Net architecture as pictured in Figure 1(a) which outputs 13 masks after the last upconvolutional layer, and (b) Multi-Head U-Net (MHU-Net)  as pictured in Figure 1(b) which has a single shared encoder and 13 decoders, where each dedicated decoder yields a mask for its corresponding instrument.
Audio is resampled at 11025 Hz before preprocessing. We use Hann window, and STFT is computed for every segment of approx. 6 seconds (65535 audio samples) with window size of 1022 and hope size of 256, which results in a matrix of STFT bins. Those parameters are taken from  and some of them have been proven to work well, e.g. the window size about 23ms goes well along with the best performance window size of 25ms in  for universal sound separation. Next, we study a few preprocessing strategies over the STFT representation, including linear and log-sampled frequency scale for STFT, as well as log-scale and dB-scale with normalization for STFT magnitude values as discussed in Section 4.4.
The choice of the loss functions is dependent on the type of the mask. For binary masks at each time-frequency bin we compute binary cross entropy (BCE) loss:
where and represent ground truth and predicted mask values and is a positive weight which is used to compensate for the class imbalance in the mask values.
For ratio masks we employ smooth loss which is defined as:
where refers to the distance between ground truth and predicted mask values.
3.2 Conditioned U-Net
In this section we describe the conditioning strategies and the types of context data which we use in our Conditioned U-Net architecture (Figure 1(c, d)).
3.2.1 Weak label conditioning
We study weak conditioning for source separation which means that instrument labels are available at the level of individual recordings. They indicate the presence or absence of each instrument in the mix, which is encoded in a binary indicator vector where is the total number of instrument classes considered.
Then, we use as a conditioning context vector and compare three types of FiLM conditioning: introduced (1) at the bottleneck, (2) at all encoder layers, and (3) at the final decoder layer as indicated in Figure 1(c). More formally, for each layer we have activations or embeddings: , and the conditioning is as follows:
Furthermore, we explore simple multiplicative conditioning with the binary indicator vector:
where is the component of the context vector and is the preliminary mask as predicted by (MH)U-Net.
3.2.2 Visual conditioning
In the case of visually-informed source separation, we consider both static characteristics and motion-aware conditioning. Nonetheless, we would like to note that learning temporal information from videos is a challenging task which is still under research. Therefore, visually-informed methods mostly use a single frame for conditioning [53, 14, 13], with some exception of dense trajectories , and deep-learned dense trajectories .
Alike [36, 14] we assume that rough spatial location of each source is given (e.g. it can be obtained by a segmentation or human detection algorithm). Keeping this assumption in mind, we use uncropped frames from individual videos for training and evaluation. In a real life scenario (e.g. for testing) we use a bounding box around every player.
For visual context conditioning, we take a single video frame corresponding to the beginning of the audio source sample. We use a pretrained ResNet-50  to extract a visual feature vector of size 2048 for every present source, and then concatenate them, obtaining a visual context vector of size where is the maximum number of sources in the mixture. The context vector for the unavailable sources is set to all zeros. As for the case of weak label conditioning, we compare three alternatives for the FiLM conditioning (see Figure1(c)).
For visual-motion conditioning, we first extract visual feature vectors with the pretrained ResNet-50 at a fixed frame-rate within a selected segment. We then pass the obtained sequence of vectors through a small uni-directional LSTM network as in , with the aim to capture motion characteristics while keeping visual information. We take the last LSTM hidden state of size 1024 for every sequence and concatenate the obtained features resulting in a motion context vector of size . Due to the large computational cost, and based on the results of the ablation study (Section 4.4), we only report this approach with FiLM conditioning at the bottleneck of audio U-Net.
In what follows, we thoroughly evaluate the proposed method on various setups. In particular, we compare the different conditioned networks with respect to several performance metrics.
The original URMP dataset consists of 44 arrangements (of which 11 are duets, 12 are trios, 14 are quartets, and 7 are quintets). Each source track was recorded separately with an external coordination, and the final mixes were assembled afterwards. The instrumentation is a typical one for chamber and orchestral music, and includes such families of instruments as strings (violin, viola, cello and double bass), woodwinds (flute, oboe, clarinet, bassoon, saxophone), and brass (trumpet, horn, trombone, tuba). The dataset is constructed to reflect the complexity of the musical world where the same instrument within a section can appear more than once.
As in this work we only tackle the problem of separating sources of different instruments, we mix source tracks of the same instrument within the same piece and consider the resulting mix as a single source. For example, for a string quartet (which consists of 2 violins, viola and cello), we join two source tracks of violin which results in a corresponding ”trio” where two violins are considered as a single source. Also, we remove four pieces (02_Sonata_vn_vn, 04_Allegro_fl_fl, 05_Entertainer_tpt_tpt, 06_Entertainer_sax_sax) from the dataset as they are duets of the same instrument and there would be nothing to separate. After this preprocessing, we have left with 12 duets, 20 trios and 8 quartets in the final set.
The Solos dataset consists of 755 YouTube videos of solo musical performances of the same 13 instruments categories as of the URMP dataset. It has a total duration of about 66 hours. A major part of the dataset are audition performances which ensures, together with manual and semi-automatic checking, a good quality of audio and video. The dataset is positioned as a tool to facilitate the training by mix-and-separate strategy while being complementary to the URMP dataset. The latter allows proper evaluation on real-world mixtures.
Several studies indicate that widely-adopted source separation metrics such as signal to distortion ratio (SDR), signal to inference ratio (SIR), and signal to artifacts ratio (SAR)  do not always agree with human perception [24, 22, 53, 14]. Recently, scale-invariant and scale-dependent SDR (SI-SDR, SD-SDR) metrics have been proposed  in order to tackle this issue.
Unlike previous works [14, 51, 53, 52], our method produces sparse outputs since many predicted sources are expected to be silent. However, all the above metrics are ill-defined for silent sources and targets. To address this issue, we also compute cumulative predicted energy at silence (PES) and energy at predicted silence (EPS) as proposed in . For SI-SDR and SD-SDR larger values indicate better performance, for PES and EPS smaller values indicate better performance. For numerical stability of log function, in our implementation we add a small constant which results in a lower boundary of the metrics to be dB.
4.3 Training and implementation details
Our U-Net is composed of 6 blocks in the encoder and 6 blocks in the decoder. Each encoder block consists of a convolutional layer followed by batch normalization with an optional conditioning layer (for FiLM-encoder conditioning), and ReLU non-linearity. A decoder block consists of a bilinear upsampling layer, a convolutional layer, batch normalization, ReLU non-linearity, and a dropout layer.
The network is trained for 500k iterations with a batch size of 16, Adam optimizer, an initial learning rate of which is halved after 25k iterations with no improvement on the validation set.
We opted for curriculum learning strategy. It consists in starting the training with only mixtures of 2 sources, and gradually increasing the maximum number of sources up to 7. The increment is carried out if validation loss does not decrease for 10k iterations.
For training and evaluation we utilize the mix-and-separate procedure by creating artificial mixtures from individual videos of Solos. Every training sample has an arbitrary number of sources with an upper bound of the maximum number of sources at the current curriculum stage. For testing, we use real mixtures from the URMP dataset.
4.4 Baseline ablation study
In preparation for conditioned source separation analysis and to define the optimal hyperparameters of the baseline U-Net architecture as described in Section3.1
, we conduct a series of ablation experiments. We examine the following set of hyperparameters: (1) linear vs. log frequency scale for the STFT representation, (2) binary vs. ratio masks estimation, (3) data augmentation with normally-distributed noise, (4) log vs. dB-normalized scale for the STFT values, (5) the use of curriculum learning, and (6) the effectiveness of Multi-Head U-Net vs. vanilla U-Net.
Our final baseline configuration is a model which takes dB-normalized and log-frequency scaled STFT as input. It has has a single decoder and predicts binary masks. We have opted out of augmenting the input with normally-distributed noise and have used curriculum learning.
5 Results and Discussion
5.1 Ablation studies
We report the metrics obtained by our baseline models in the ablation study in Table I, and the full list of hyperparameters is given in Appendix A. The experiments can be matched by the experiment ID. We also provide the metrics for two baselines, the upper bound separation quality (U) with ideal binary masks (IBM), and the mixture metrics (L) which reproduce the input mixture at every possible output source.
Even though the results for a multi-decoder architecture (exp. ids: 3-8) have a higher separation quality, they double the required computational cost and therefore we have opted out of training the MHU-Net architecture. Table I shows that ratio masks (exp. 4), when compared to binary masks (exp. 3), give higher (SI-/SD-)SDR but perform much worse in terms of PES. In particular, the increment in SI-SDR is 8.4dB, in SDR is 3.6dB, while the drop in PES is 29.7dB. In practice, we noticed that, while training with ratio masks, the to-be-silent output sources eventually happen to be an original mixture with a lowered volume. Therefore, in all following experiments we predict binary masks. Further study on combining the binary and soft masks as in  may help solving this issue. We also observed that augmenting input data with normally-distributed noise doesn’t improve separation performance and other more advanced techniques are needed.
shows the performance measured by SI-SDR, SD-SDR and PES in Exp. 4 for each instrument in the URMP dataset. The results emphasize the fluctuations between the instruments. We can see that for the case of bassoon, tuba, horn, and viola the mean SI-SDR is about -6.5dB which is quite poor. In contrast, for some string instruments such as cello, double bass and violin, SI-SDR is higher (with the maximum mean value of 1.8dB for cello). There is a special case of saxophone whose performance metrics are good in average, but the standard deviation is the highest among all the instruments.
Overall, the ablation studies indicate that different aspects of the separation quality measured by the standard metrics can be enhanced by applying different learning strategies. Notably, the curriculum learning technique helps improving overall separation quality for all the metrics measured. The next significant improvement of 3.4dB in SI-SDR is obtained by changing the frequency scale of the STFT representation, followed by the multi-decoder U-Net architecture (1.6dB improvement in SI-SDR) and dB-normalized STFT values (8.4dB improvement in SI-SDR), which improve (SI-/SD-)SDR but worsen PES (-3.4dB and -5.7dB decrease, respectively).
5.2 Conditioning on labels
We further study weak label conditioning of the single-decoder U-Net model. We provide results for linear-frequency scale and log-frequency scale STFT inputs and four conditioning schemes as described in Section 3.2.1. The summary of weak label conditioned source separation is shown in Table II.
We observe that the best performance in terms of (SI-/SD-)SDR is obtained with multiplicative conditioning of the output masks, but it also leads to high PES, even worse than in the case of ratio masks in the ablation study. Explicitly, the label-multiply conditioning method achieves -2.8dB and -3.0dB of SI-SDR for the linear-frequency scale (exp. 12) and log-frequency scale (exp. 16), while yields 7.4dB and 8.9dB in PES, respectively.
Within FiLM conditioning experiments, we note that the FiLM-bottleneck conditioning doubtlessly outperforms FiLM-encoder and FiLM-final types of conditioning by a mean margin of 1.8dB in SI-SDR and 0.7dB in SDR. We found that FiLM-encoder and FiLM-final conditioning may lead to overfitting and even worsen the results w.r.t. non-conditioned U-Net, while FiLM-bottleneck conditioning coherently improves the results in all tested settings.
Even though the log-scale STFT input outperforms linear-scale STFT input for the case of none or FiLM-encoder conditioning, there is no significant difference for FiLM-bottleneck and label-multiply conditioning, and there is a drop in the performance for FiLM-final conditioning.
Figure 3 shows scatter plots of input SI-SDR versus improvement in SI-SDR for each segment in the URMP dataset. Subfigure (a) demonstrates results for the model of exp. 12 with multiplicative label conditioning from Table II. Subfigure (b) displays the upper bound results obtained with ideal binary masks. The figure indicates the potential upper bound separation performance that can be achieved on this dataset.
5.3 Conditioning on visual information
We compare visually conditioned U-Net with its corresponding non-conditioned and label conditioned baselines.
Table III shows the performance of single-frame visually conditioned U-Net given the same FiLM locations as in the label conditioning case. It also indicates the results of conditioning by visual-motion context vector learned from 15 and 50 frames per segment (with the frame rate set to 2.5 fps and 8.3 fps respectively). Lastly, we report the results for the Sound-of-Pixels (SoP) method. SoP-unet7 states for the original method trained on the Music dataset published in . We used the officially provided weights and evaluated the model on the URMP dataset. SoP-unet7-ft indicates the version which was fine-tuned on the Solos dataset. SoP-unet5-Solos accounts for a model with 5 blocks in U-Net which is trained from scratch. In all SoP networks both visual and audio networks are trained simultaneously while in our conditioning experiments visual network is frozen in all experiments except for FiLM-bottleneck-ft.
The results show that visually conditioned U-Net, analogously to label conditioned U-Net, outperforms its non-conditioned baseline only for the case of FiLM-bottleneck conditioning, whereas FiLM-encoder and FiLM-final methods result in a performance drop up to 2dB in SI-SDR. FiLM-bottleneck single-frame conditioning slightly outperforms its hypothetical label conditioned upper bound, Exp. 14 from Table II, and FiLM-bottleneck-ft outperforms the baseline by the margin of 0.8dB. Additionally, in the experiments where both audio and visual subnetworks are trained, FiLM-bottleneck-ft architecture outperforms SoP-unet7 trained on Music in both SDR and SIR. However, it performs worse in SDR when compared to SoP-unet5-Solos trained on Solos while still performing better in SIR. Clearly, SoP-unet7-ft trained on Music and fine-tuned on Solos performs the best out of all visually-conditioned networks which indicate that the performance can still be improved by employing datasets which are bigger and of better quality. The experiments with the visual-motion context vector indicate the need of a better motion representation as the results show the performance drop w.r.t. single-frame visual conditioning.
5.4 Unsuccessful attempts
We would like to report several strategies which didn’t improve source separation performance in our experiments.
In one of the experiments, we used loss while directly predicting spectrogram values instead of using the masking-based approach. However, the network was failing to converge. We hypothesise that this behaviour accounts for the higher complexity of the spectrograms and the sparsity of the outputs.
We also had an unsuccessful attempt to employ multi-task learning in order to further regularize the embedding space. In these experiments we jointly optimized classification and separation losses trying to predict which instruments are present in the mixture using the bottleneck U-Net features as an input for a small classifier consisting of a single fully-connected layer. While generally converging, the classification and separation performance were lower than the results of stand-alone models.
In our experiments we observe that the use of external information generally improves the separation performance. FiLM-encoder conditioning leads to overfitting and only improves on SIR. FiLM-final conditioning improves on SAR but not on the rest of the metrics. FiLM-bottleneck and Label-multiply conditioning improve over all their corresponding baselines in all the metrics except PES, and the same behaviour is observed while predicting ratio masks and using Multi-Head U-Net.
From the results we observe that U-Net conditioned on the visual context vector improves over the unconditioned versions in terms of (SI-/SD-)SDR but performs worse in terms of PES and SIR. A possible explanation for this observation may have to do with the capacity of the network to learn playing/non-playing activity from the visual information. However, it may still have confusions separating musical instruments from the same family (like viola and violin) which may result in more interferences and mispredictions when both of them are present in the mixtures, which is a common case for the URMP dataset.
By inspecting the results obtained by the Sound-of-Pixels method, we highlight the importance of taking the source separation problem in the real-world scenario, as the method was previously tested in mix-and-separate settings and the reported results had an average SDR of 8dB. Our results demonstrate the demand for the testing on the real mixtures rather than using the mix-and-separate approach. Notably, even 5-blocks Sound-of-Pixels trained on Solos performs better than 7-blocks Sound-of-Pixels trained on Music. Joined fine-tuning of the original Sound-of-Pixels model allows to improve the quality of source separation for 1.2dB in SDR which also indicates the need of enlarging the datasets and enhancing their quality.
Following , we confirm that directly integrating visual information from multiple frames in a form of visual features worsens separation results. Even though from the literature we know that source separation can benefit from integrating the motion information [52, 36, 27], we would like to note that all aforementioned methods use complex pre-processing in order to extract reliable motion features, which brings attention to the problem of closing the gap between motion and audio representations.
Another fact that should be noted is that all sources of information should be correctly combined, preserving synchrony between them. While for single-frame visual and weak label information it is not so important, for temporal data such as motion, pitch, and musical scores it may become a crucial aspect for successful conditioning. Consecutively, a different baseline source separation architecture, such as an RNN-based network, may improve on the current results due to its sequential nature which better preserves time-domain information.
Taking into consideration the above-mentioned observation, we can note that the U-Net architecture may be a limitation of our study, and the results may be different for other baseline architectures.
Given that the best results in terms of different metrics are achieved by using different setups (e.g. binary and ratio masks), we would like to emphasise that a further enhancement can be obtained by having the best of both worlds as has been proposed in . Finally, we would like to note the opportunity to surpass the current performance by employing additional constraints for the loss functions as in , , or weighting the loss values of the masks with the magnitude values of the mixture [53, 14] as it may help to avoid treating every time frequency bin equally and focus attention on the areas where most of the energy is concentrated.
We tackle a problem of Single Channel Source Separation for multi-instrument polyphonic music conditioned on external data. In this work we have shown that the use of extra information such as (1) binary vectors indicating the presence or absence of musical instruments in the mix and (2) visual feature vectors extracted from corresponding video frames improve the separation performance.
We also show that different types of conditioning have different effects w.r.t. the performance metrics. We have conducted a thorough study of FiLM-conditioning introduced at three possible locations of the primary source separation U-Net model. We have demonstrated that the best results can be obtained with FiLM-bottleneck conditioning and with multiplicative label conditioning on the predicted masks.
The results shown in the present work indicate that the real-case scenario such as chamber quartets source separation is challenging and there is still a significant performance gap of about 13dB between the state-of-the-art separation methods and ideal binary masks.
Potential improvements could include modifying the U-Net architecture, combining binary and soft masks to obtain a good balance between SDR and PES. Another possibility could be integrating an advanced motion analysis network and employing audio-motion synchrony for conditioning the network, and conditioning on musical scores.
Appendix A Hyperparameters of the experiments from Section 4.4
We provide the full set of model hyperparameters used in the experiments in Section 4.4 and Section 5.2 in Table IV. Note, that there is only a single difference within each pair of the experiments compared in Table I. For the experiments in Section 5.3 the model parameters are set as described in Section 4.4.
|ID||STFT F-scale||STFT V-scale||model||noise||mask||bias||loss||curr.||cond. type|
Appendix B Per-experiment bar plots with source separation performance results
This work was funded in part by ERC Innovation Programme (grant 770376, TROMPA); Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence Program (MDM-2015-0502) and the Social European Funds; the MICINN/FEDER UE project with reference PGC2018-098625-B-I00; and the H2020-MSCA-RISE-2017 project with reference 777826 NoMADS. We gratefully acknowledge NVIDIA for the donation of GPUs used for the experiments.
-  (2014) Audiovisual correspondence between musical timbre and visual shapes. Frontiers in human neuroscience 8, pp. 352. Cited by: §1.
Look, listen and learn.
Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617. Cited by: §2.2.2, §2.2.2.
-  (2018) Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 435–451. Cited by: §2.2.2, §2.2.2.
-  (2016) SoundNet: learning sound representations from unlabeled video. In Advances in neural information processing systems, pp. 892–900. Cited by: §2.2.2, §2.2.2.
Harmony in motion.
2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.2.1.
-  (2013) Nonnegative signal factorization with learnt instrument models for sound source separation in close-microphone recordings. EURASIP Journal on Advances in Signal Processing 2013 (1), pp. 184. Cited by: §1, §2.1.
-  (2011) Musical instrument sound multi-excitation model for non-negative spectrogram factorization. IEEE Journal of Selected Topics in Signal Processing 5 (6), pp. 1144–1158. Cited by: §2.1.
Monoaural audio source separation using deep convolutional neural networks. In International Conference on Latent Variable Analysis and Signal Separation, pp. 258–266. Cited by: §2.1, §2.1, §2.1.
-  (2017) Singing voice separation using generative adversarial networks,. In ML4Audio Workshop, 31st Conf. Neural Information Processing Systems (NIPS 2017), Cited by: §2.1.
-  (2019-08) Interleaved Multitask Learning for Audio Source Separation with Independent Databases. arXiv e-prints, pp. arXiv:1908.05182. External Links: Cited by: §2.1, §2.1, §2.1, §3.1, §3.1.
-  (2018) Feature-wise transformations. Distill. Note: https://distill.pub/2018/feature-wise-transformations External Links: Cited by: §2.2.2, §2.3, Fig. 1.
-  (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG) 37 (4), pp. 112. Cited by: §2.2.2, §2.2.2.
-  (2018) Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–53. Cited by: §2.2.2, §3.2.2.
-  (2019) Co-separating sounds of visual objects. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3879–3888. Cited by: §1, §2.1, §2.2.2, §2.2.2, §2.2.2, §2.2.2, §3.1, §3.2.2, §3.2.2, §4.2, §4.2, §5.5.
-  (2019) Listen to look: action recognition by previewing audio. External Links: Cited by: §2.2.2, §2.2.2, §3.2.2.
-  (2016) Combining mask estimates for single channel audio source separation using deep neural networks. Interspeech2016 Proceedings. Cited by: §2.1, §2.1, §5.1, §5.5.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.2.
-  (2000) Audio vision: using audio-visual synchrony to locate sounds. In Advances in neural information processing systems, pp. 813–819. Cited by: §1, §2.2.1.
-  (2017) Singing voice separation with deep U-Net convolutional networks. In 18th International Society for Music Information Retrieval Conference, pp. 23–27. Cited by: §2.1, §2.1, §2.1, §3.1.
-  (2019) Universal sound separation. arXiv preprint arXiv:1905.03330. Cited by: §2.1, §2.1, §2.1, §2.2.2, §3.1.
-  (2005) Pixels that sound. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1, pp. 88–95. Cited by: §1, §2.2.1.
-  (2019) Fréchet audio distance: a reference-free metric for evaluating music enhancement algorithms. Proc. Interspeech 2019, pp. 2350–2354. Cited by: §4.2.
-  (2018) Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems, pp. 7763–7774. Cited by: §1, §2.2.2, §2.2.2.
-  (2019) SDR – half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. Cited by: §2.1, §2.1, §4.2.
-  (2004) An analysis and comparison of the clarinet and viola versions of the two sonatas for clarinet (or viola) and piano Op. 120 by Johannes Brahms. Ph.D. Thesis, University of Cincinnati. Cited by: §1.
-  (2016-12) Creating a musical performance dataset for multimodal music analysis: challenges, insights, and applications. IEEE Transactions on Multimedia PP, pp. . External Links: Cited by: §1, §4.1.
-  (2017) See and listen: score-informed association of sound tracks to players in chamber music performance videos. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2906–2910. Cited by: §1, §2.2.1, §2.2.1, §3.2.2, §5.5.
-  (2019-04) Weakly-Supervised Visual Instrument-Playing Action Detection in Videos. IEEE Transactions on Multimedia 21 (4), pp. 887–901. External Links: Cited by: §2.2.2.
-  (2019) Audio–visual deep clustering for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (11), pp. 1697–1712. Cited by: §2.2.2.
-  (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. Cited by: §2.1, §2.1.
-  (2019) Conditioned-u-net: introducing a control mechanism in the u-net for multiple source separations. In 20th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §1, §2.3.
-  (2016) Score-informed source separation for multichannel orchestral recordings. Journal of Electrical and Computer Engineering 2016. Cited by: §1, §2.1.
-  (2020) Solos: a dataset for audio-visual music source separation and localization. Under review for EUSIPCO 2020. Cited by: §4.1.
-  (2018) Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 631–648. Cited by: §2.2.2, §2.2.2.
-  (2009) Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Transactions on Audio, Speech, and Language Processing 18 (3), pp. 550–563. Cited by: §2.1.
-  (2017) Guiding audio source separation by video object information. In Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017 IEEE Workshop on, pp. 61–65. Cited by: §2.1, §2.2.1, §3.2.2, §5.5.
-  (2019) Weakly supervised representation learning for audio-visual scene analysis. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 416–428. Cited by: §2.2.2.
-  (2019) Identify, locate and separate: audio-visual object extraction in large video collections using weak supervision. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 268–272. Cited by: §2.2.2, §2.2.2.
-  (2019) Finding strength in weakness: learning to separate sounds with weak supervision. External Links: Cited by: §2.3.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.1.
-  (2019) Weakly informed audio source separation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 268–272. Cited by: §2.3, §4.2.
-  (2019) Learning to localize sound sources in visual scenes: analysis and applications. External Links: Cited by: §2.2.2, §2.2.2.
-  (2019) End-to-end sound source separation conditioned on instrument labels. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 306–310. Cited by: §2.3.
-  (2018) Adversarial semi-supervised audio source separation applied to singing voice extraction. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2391–2395. Cited by: §2.1.
Open-Unmix – A Reference Implementation for Music Source Separation.
Journal of Open Source Software. External Links: Cited by: §2.1, §2.1.
-  (2019) Improving universal sound separation using sound classification. External Links: Cited by: §2.3, §2.3, §2.3.
-  (2017) Improving music source separation based on deep neural networks through data augmentation and network blending. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 261–265. Cited by: §2.1, §2.1.
-  (2006-07) Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing 14 (4), pp. 1462–1469. External Links: Cited by: §4.2.
-  (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE transactions on audio, speech, and language processing 15 (3), pp. 1066–1074. Cited by: §2.1.
-  (2019) Differentiable consistency constraints for improved deep speech enhancement. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 900–904. Cited by: §2.1, §5.5.
-  (2019) Recursive visual sound separation using minus-plus net. In Proceedings of the IEEE International Conference on Computer Vision, pp. 882–891. Cited by: §2.2.2, §2.2.2, §4.2.
-  (2019) The sound of motions. arXiv preprint arXiv:1904.05979. Cited by: §1, §1, §2.1, §2.2.2, §2.2.2, §3.1, §3.1, §3.2.2, §4.2, §5.5.
-  (2018) The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 570–586. Cited by: §1, §1, §2.1, §2.2.2, §2.2.2, §3.1, §3.1, §3.1, §3.2.2, §4.2, §4.2, §5.3, §5.5, §5.5, TABLE III.