Multi-modal perception is essential to capture the richness of real-world sensory data for objects, scenes, and events. In particular, the sounds made by objects, whether actively generated or incidentally emitted, offer valuable signals about their physical properties and spatial locations—the cymbals crash on stage, the bird tweets up in the tree, the truck revs down the block, the silverware clinks in the drawer.
Objects often generate sounds while coexisting or interacting with other surrounding objects. Thus, rather than observe them in isolation, we hear them intertwined with sounds coming from other sources. Likewise, a realistic video records the various objects with a single audio channel that mixes all their acoustic frequencies together. Automatically separating the sounds of each object in a video is of great practical interest, with applications including audio denoising, audio-visual video indexing, instrument equalization, audio event remixing, and dialog following.
Whereas traditional methods assume access to multiple microphones or carefully supervised clean audio samples [20, 46, 7], recent methods tackle the audio(-visual) source separation problem using a “mix-and-separate” paradigm to train deep neural networks in a self-supervised manner [39, 48, 6, 29, 50]. Namely, such methods randomly mix audio/video clips, and the learning objective is to recover the original unmixed signals. For example, one can create “synthetic cocktail parties” that mix clean speech with other sounds , add pseudo “off-screen” human speakers to other real videos , or superimpose audio from clips of musical instruments .
There are two key limitations with this current training strategy. First, it implicitly assumes that the original real training videos are dominated by single-source clips containing one primary sound maker. However, gathering a large number of such clean “solo” recordings is expensive and will be difficult to scale beyond particular classes like human speakers and musical instruments. Second, it implicitly assumes that the sources in a recording are independent. However, it is precisely the correlations between real sound sources (objects) that make the source separation problem most challenging at test time. Such correlations can go uncaptured by the artificially mixed training clips.
Towards addressing these shortcomings, we introduce a new strategy for learning to separate audio sources. Our key insight is a novel co-separation training objective that learns from naturally occurring multi-source videos111Throughout, we use “multi-source video” as shorthand for video containing multiple sounds in its single-channel audio.. During training, our co-separation network considers pairs of training videos and, rather than simply separate their artificially mixed soundtracks, it must also generate audio tracks that are consistently identifiable at the object level across all training samples. In particular, using noisy object detections from the unlabeled training video, we devise a loss requiring that within an individual training video, each separated audio track should be distinguishable as its proper object. For example, when two training instances both contain a guitar plus other instruments, there is pressure to make the separated guitar tracks consistently identifiable by sound. See Fig. 1.
We call our idea “co-separation” as a loose analogy to image co-segmentation , whereby jointly segmenting two related images can be easier than segmenting them separately, since it allows disentangling a shared foreground object from differently cluttered backgrounds. Note, however, that our co-separation operates during training only; unlike co-segmentation, at test time our method performs separation on an individual video input.
Our method design offers the following advantages. First, co-separation allows training with “in the wild” sound mixes. It has the potential to benefit from the variability and richness of unlabeled multi-source video. Second, it enhances the supervision beyond “mix-and-separate”. By enforcing separation within a single video at the object-level, our approach exposes the learner to natural correlations between sound sources. Finally, objects with similar appearance from different videos can partner with each other to separate their sounds jointly, thereby regularizing the learning process. In this way, our method is able to learn well from multi-source videos, and successfully separate an object sound in a test video even if the object has never been observed individually during training.
We experiment on three benchmark datasets and demonstrate the advantages discussed above. Our approach yields state-of-the-art results on separation and denoising. Most notably, it outperforms the prior methods and baselines by a large margin when learning from noisy AudioSet videos. Overall co-separation is a promising direction to learn audio-visual separation from multi-source videos.
2 Related Work
Audio-Only Source Separation
Audio source separation has a rich history in signal processing. While many methods assume audio captured by multiple microphones, some tackle the “blind” separation problem with single-channel audio [20, 5, 46, 7]
, most recently with deep learning[18, 16, 42]. Mix-and-separate style training is now commonly used for audio-only source separation to create artificial training examples [19, 16, 48]. Our approach adapts the mix-and-separate idea. However, different from all of the above, we leverage visual object detection to guide sound source separation. Furthermore, as discussed above, our co-separation framework is more flexible in terms of training data and can generalize to multi-source videos.
Audio-Visual Source Separation
Early methods for audio-visual source separation focus on mutual information , subspace analysis [40, 32], matrix factorization [31, 37], and correlated onsets [3, 25]. Recent methods leverage deep learning for separating speech [6, 29, 1, 9], musical instruments [50, 11, 49], and other objects . Similar to the audio-only methods, almost all use a “mix-and-separate” training paradigm to perform video-level separation by artificially mixing training videos. In contrast, we perform source separation at the object level to explicitly model sounds coming from different visual objects, and our model enforces separation within a video during training.
Most related to our work are the “sound of pixels” (SoP)  and multi-instance learning (AV-MIML) approaches . AV-MIML  also focuses on learning object sound models from unlabeled video, but its two-stage method heavily relies on NMF to perform separation, which limits its performance and practicability. Furthermore, whereas AV-MIML simply uses image classification to obtain weak labels on video frames, our approach detects localized objects and our end-to-end network learns visual object representations in concert with the audio streams. SoP  outputs a sound for each pixel, whereas we predict sounds for visual objects. More importantly, SoP works best when clean solo videos are available to perform video-level “mix-and-separate” training. Our method instead disentangles mixed sounds of objects within an individual training video, allowing more flexible training with multi-source data, as we demonstrate in results.
Localizing Sounds in Video Frames
Localization entails identifying the pixels where the sound of a video comes from, but not separating the audio. Multi-modal embeddings , mututal information [17, 8], and recent deep learning methods [2, 38, 43] are all ways to recover regions responsible for sounds. Different from all these methods, our goal is to separate the sounds of multiple objects from a single-channel signal. We localize potential sound sources via object detection, and use the localized object regions to guide the separation learning process.
Generating Sounds from Video
Sound generation methods synthesize a sound track from a visual input [30, 52, 4]. Given both visual input and monaural audio, recent methods generate spatial (binaural or ambisonic) audio [11, 28]. Unlike any of the above, our work aims to separate an existing real audio track, not synthesize plausible new sounds.
Our approach learns to leverage localized object detection to visually guide audio source separation. In the following, we first formalize our object-level audio-visual source separation task (Sec. 3.1). Then we introduce our framework for learning object sound models from unlabeled video and our Co-Separation deep network architecture (Sec. 3.2). Finally, we present our training criteria and inference procedures (Sec. 3.3).
3.1 Problem Formulation
Given an unlabeled video clip with accompanying audio , we denote the set of objects detected in the video frames. We treat each object as a potential sound source, and is the observed single-channel linear mixture of these sources, where are time-discrete signals responsible for each object. Our goal of object-level audio-visual source separation is to separate the sound for each object from .
Following [19, 16, 48, 50, 29, 11, 6], we start with the commonly adopted “mix-and-separate” idea to self-supervise source separation. Given two training videos and with corresponding audios and , we use a pre-trained object detector to find objects in both videos. Then, we mix the audios of the two videos and obtain the mixed signal . The mixed audio is transformed into a magnitude spectrogram consisting of frequency bins and
short-time Fourier transform (STFT) frames, which encodes the change of a signal’s frequency and phase content over time.
Our learning objective is to separate the sound each object makes from conditioned on the localized object regions. For example, Fig. 3 illustrates a scenario of mixing two videos and with two objects , detected in and one object detected in . The goal is to separate , , and for objects , , and from the mixture signal , respectively. To perform separation, we predict a spectrogram mask for each object. We use real-valued ratio masks and obtain the predicted magnitude spectrogram by soft masking the mixture spectrogram: . Finally, we use the inverse short-time Fourier transform (ISTFT)  to reconstruct the waveform sound for each object source.
Going beyond video-level mix-and-separation, the key insight of our approach is to simultaneously enforce separation within a single video at the object level. This enables our method to learn object sound models even from multi-source training videos. Our new co-separation framework can capture the correlations between sound sources and is able to learn from noisy Web videos, as detailed next.
3.2 Co-Separation Framework
Next we present our Co-Separation training framework and our network architecture to perform separation.
Firstly, we train an object detector for a vocabulary of objects. In general, this detector should cover any potential sound-making object categories that may appear in training videos. Our implementation uses the Faster R-CNN  object detector with a ResNet-101  backbone trained with Open Images . For each unlabeled training video, we use the pre-trained object detector to find objects in all video frames. Then, we gather all object detections across frames to obtain a video-level pool of objects. See Supp. for details.
We use the localized object regions to guide the source separation process. Fig. 2 illustrates our audio-visual separator network that performs audio-visual feature aggregation and source separation. A related design for multi-modal feature fusion is also used in [11, 28, 29] for audio spatialization and separation. However, unlike those models, our separator network combines the visual features of a localized object region and the audio features of the mixed audio to predict a magnitude spectrogram mask for source separation.
The network takes a detected object region and the mixed audio signal as input, and separates the portion of the sound responsible for the object. We use a ResNet-18 network to extract visual features after the ResNet block with size , where denote the frame and channel dimensions. We then pass the visual feature through a
convolution layer to reduce the channel dimension, and use a fully-connected layer to obtain an aggregated visual feature vector.
On the audio side, we adopt a U-NET  style network for its effectiveness in dense prediction tasks, similar to [50, 29, 11]. The network takes the magnitude spectrogram as input and passes it through a series of convolution layers to extract an audio feature of dimension . We replicate the visual feature vector times, tile them to match the audio feature dimension, and then concatenate the audio and visual feature maps along the channel dimension. Then a series of up-convolutions are performed on the concatenated audio-visual feature map to generate a multiplicative spectrogram mask . We find spectrogram masks to work better than direct prediction of spectrograms or raw waveforms for source separation, confirming reports in [47, 6, 11]. The separated spectrogram for the input object is obtained by multiplying the mask and the spectrogram of the mixed audio:
Finally, ISTFT is applied to the spectrogram to produce the separated real-time signal. See Supp. for more details.
Our proposed co-separation framework first detects objects in a pair of videos, then mixes their audios at the video level, and finally separates the sounds for each detected object class. As shown in Fig. 3, for each video pair, we randomly sample a high confidence object window for each class detected in either video, and use the localized object region to guide audio source separation using the audio-visual separator network. For each object , we predict a mask , and then generate the corresponding magnitude spectrogram.
Let and denote the set of objects for the two videos. We want to separate the sounds of their corresponding objects together from the audio mixture of and . For each video, summing up the separated sounds of all objects should ideally reconstruct the audio signal for that video. Namely,
where and are the number of detected objects for and . For simplicity of notation, we defer presenting how we handle background sounds (those unattributable to detected objects) until later in this section. Because we are operating in the frequency domain, the above relationship will only hold approximately due to phase interference. As an alternative, we approximate Eq. (2) by enforcing the following relationship on the predicted magnitude spectrograms:
where and are the magnitude spectrograms for and . Therefore, we minimize the following co-separation loss over the separated magnitude spectrograms:
which approximates to minimizing the following loss function over their predicted ratio masks:
where and are the ground-truth spectrogram ratio masks for the two videos, respectively. Namely,
In practice, we find that computing the loss over masks (vs. spectograms) makes the network easier to learn. We hypothesize that the sigmoid after the last layer of the audio-visual separator bounds the masks, making them more constrained and structured compared to spectrograms. In short, the proposed co-separation loss provides supervision to the network to only separate the audio portion responsible for the input visual object, so that the corresponding audios for each of the pair of input videos can be reconstructed.
In addition to the co-separation loss that enforces separation, we also introduce an object-consistency loss for each predicted audio spectrogram. The intuition is that if the sources are well-separated, the predicted “category” of the separated spectrogram should be consistent with the category of the visual object that initially guides its separation. Specifically, for the predicted spectrogram of each object, we introduce another ResNet-18 audioclassifier that targets the weak labels of the input visual objects. We use the following cross-entropy loss:
where is the number of classes, is a binary indicator on whether class label is the correct classification for predicted spectrogram , and
is the predicted probability for class.
Not all sounds in a video will be attributable to a visually detected object. To account for ambient sounds, off-screen sounds, and noise, we incorporate a “adaptable” audio class, as follows. During training, we pair each video with a visual scene feature in addition to the detected objects from the pre-trained object detector. Then an additional mask responsible for the scene context is also predicted in Eq. (5) for both and to be optimized jointly. This step arms the network with the flexibility to assign noise or unrelated sounds to this “adaptable” class, leading to cleaner separation for sounds of the detected visual objects. These adaptable objects (ideally ambient sounds, noise, ) are collectively designated as having the “extra” audio label. The separated spectrograms for these adaptable objects are also trained to match their category label by the object-consistency loss in Eq. (7).
Putting it all together, during training the network needs to discover separations for the multi-source videos that 1) minimize the co-separation loss, such that the two source videos’ object sounds reassemble to produce their original video-level audio tracks, respectively, while also 2) minimizing the object consistency loss, such that separated sounds for any instances of the same visual object are reliably identifiable as that sound. We stress that our model achieves the latter without any pre-trained audio model and without any single-source audio examples for the object class. The object consistency loss only knows that same-object sounds should be similar after training the network—not what any given object is expected to sound like.
3.3 Training and Inference
We minimize the following combined loss function and train our network end to end:
where is the weight for the object-consistency loss.
We use per-pixel loss for the co-separation loss, and weight the gradients by the magnitude of the spectrogram of the mixed audio. The network uses the weighted gradients to perform back-propagation, thereby emphasizing predictions on more informative parts of the spectrogram.
During testing, our model takes a single realistic multi-source video to perform source separation. Similarly, we first detect objects in the video frames by using the pre-trained object detector. For each detected object class, we use the most confident object region(s) as the visual input to separate the portion of the sound responsible for this object category from its accompanying audio. We use a sliding window approach to process videos segment by segment with a small hop size, and average the audio predictions on all overlapping parts.
We perform audio-visual source separation on video clips of 10s, and we pool all the detected objects in the video frames. Therefore, our approach assumes that each detected object within this period of 10s is a potential sound source, although it may only sound in some of the frames. For objects that are detected but do not make sound at all, we treat it as learning noise and expect our deep network to adapt by learning from large-scale training videos. We leave it as future work to explicitly model silent visual objects.
We now validate our approach for audio-visual source separation and compare to existing methods.
This MIT dataset contains YouTube videos crawled with keyword queries . It contains 685 untrimmed videos of musical solos and duets, with 536 solo videos and 149 duet videos. The dataset is relatively clean and collected for the purpose of training audio-visual source separation models. It includes 11 instrument categories: acccordion, acoustic guitar, cello, clarinet, erhu, flute, saxophone, trumpet, tuba, violin and xylophone. Following the authors’ public dataset file of video IDs, we hold out the first/second video in each category as validation/test data, and the rest as training data. We split all videos into 10s clips during both training and testing, for a total of 8,928/259/269 train/val/test clips, respectively.
AudioSet  consists of challenging 10s video clips, many of poor quality and containing a variety of sound sources. Following , we filter the dataset to extract video clips of 15 musical instruments. We use the videos from the “unbalanced” split for training, and videos from the “balanced” split as validation/test data, for a total of 113,756/456/456 train/val/test clips, respectively.
A dataset assembled by  of AudioSet videos containing only a single sounding object. We use the 15 videos (from the “balanced” split) of musical instruments for evaluation only.
On both MUSIC and AudioSet, we compose the test sets following standard practice [3, 50, 29, 10]—by mixing the audio from two single-source videos. This ensures the ground truth separated sounds are known for quantitative evaluation. There are 550 and 105 such test pairings for MUSIC and AudioSet, respectively (the result of pairwise mixing 10 random clips per the 15 classes for MUSIC and pairwise mixing all 15 clips for AudioSet). For qualitative results (Supp.), we apply our method to real multi-source test videos. In either case, we train our method with multi-source videos, as specified below.
4.2 Implementation Details
deep network is implemented in PyTorch. For all experiments, we sub-sample the audio at 11kHz, and the input audio sample is approximately 6s long. STFT is computed using a Hann window size of 1022 and a hop length of 256, producing aTime-Frequency audio representation. The spectrogram is then re-sampled on a log-frequency scale to obtain a magnitude spectrogram of . The settings are the same as  for fair comparison.
Our object detector is trained on images of object categories from the Open Images dataset . We filter out low confidence object detections for each video, and keep the top two222This agrees with the number of objects detected by our pre-trained detector in most training video. We did not try any other values. detected categories. See Supp. for details. During co-separation training, we randomly sample 64 pairs of videos for each batch. We sample a confident object detection for each class as its input visual object, paired with a random scene image sampled from the ADE dataset  as the adaptable object. The object window is resized to , and a randomly cropped region is used as the input to the network. We use horizontal flipping, color and intensity jittering as data augmentation. is set to 0.05 in Eq. (8). The network is trained using an Adam optimizer with weight decay with the starting learning rate set to . We use a smaller starting learning rate of
for the ResNet-18 visual feature extractor because it is pre-trained on ImageNet.
Average audio source separation results on a held out MUSIC test set. We show the performance of our method and the baselines when training on only single-source videos (solo) and multi-source videos (solo + duet). Note that NMF-MFCC is non-learned, so its results do not vary across training sets. Higher is better for all metrics. Note that SDR and SIR capture separation accuracy; SAR captures only the absence of artifacts (and hence can be high even if separation is poor). Standard error is approximately 0.2 for all metrics.
4.3 Quantitative Results on Source Separation
We compare to the following baselines:
AV-Mix-and-Separate: A “mix-and-separate” baseline using the same audio-visual separation network as our model to do video-level separation. We use multi-label hinge loss to enforce video-level consistency, i.e., the class of each separated spectrogram should agree with the objects present in that training video.
AV-MIML : An existing audio-visual source separation method that uses audio bases learned from unlabeled videos to supervise an NMF separation process. The audio bases are learned from a deep multi-instance multi-label (MIML) learning network. We use the results reported in  for AudioSet and AV-Bench; the authors do not report results in SDR and do not report results for MUSIC.
We use the widely used mir eval library  to evaluate the source separation and report the standard metrics: Signal-to-Distortion Ration (SDR), Signal-to-Interference Ratio (SIR), and Signal-to-Artifact Ratio (SAR).
Table 1 presents results on MUSIC as a function of the training source: single-source videos (solo) or multi-source videos (solo + duet). Our method consistently outperforms all baselines in separation accuracy, as captured by the SDR and SIR metrics.444Note that SAR measures the artifacts present in the separated signal, but not the separation accuracy. So, a less well-separated signal can achieve high(er) SAR values. In fact, naively copying the original input twice (i.e., doing no separation) results in SAR 80 in our setting. While the SoP method  works well when training only on solo videos, it fails to make use of the additional duets, and its performance degrades when training on the multi-source videos. In contrast, our method actually improves when trained on a combination of solos and multi-source duets, achieving its best performance. This experiment highlights precisely the limitation of the mix-and-separate training paradigm when presented with multi-source training videos, and it demonstrates that our co-separation idea can successfully overcome that limitation.
Our method also outperforms all baselines, including , when training on solos. Our better accuracy versus the AV-Mix-and-Separate baseline and  shows that our object-level co-separation idea is essential. The NMF-MFCC baseline can only return ungrounded separated signals. Therefore, we evaluate both possible matchings and take its best results (to the baseline’s advantage). Our method still achieves large gains, and we also have the benefit of matching the separated sounds to semantically meaningful visual objects in the video.
Table 2 shows the results when training on AudioSet-Unlabeled and testing on mixes of AudioSet-SingleSource. Our method outperforms all prior methods and the baselines by a large margin on this challenging dataset. It demonstrates that our framework can better learn from the noisy and less curated “in the wild” videos of AudioSet, which contains many multi-source videos.
Next we devise an experiment to test explicitly how well our method can learn to separate sound for objects it has not observed individually during training. We train our model and the best baseline  on the following four categories: violin solo, saxophone solo, violin+guitar duet, and violin+saxophone duet, and test by randomly mixing and separating violin, saxophone, and guitar test solo clips. Table 3 shows the results. We can see that although our system is not trained on any guitar solos, it can learn better from multi-source videos that contain guitar and other sounds. Our method consistently performs well on all three combinations, while 
performs well only on the violin+guitar mixture. We hypothesize the reason is that it can learn by mixing the large quantity of violin solos and the guitar solo momentswithin the duets to perform separation, but it fails to disentangle other sound source correlations. Our method scores worse in terms of SAR, which again measures artifacts, but not separation quality.
|Sound-of-Pixels ||Co-Separation (Ours)|
As a side product of our audio-visual source separation system, we can also use our model to perform visually-guided audio denoising. As mentioned in Sec. 3.3, we use an additional scene image to capture ambient/unseen sounds and noise . Therefore, given a test video with noise, we can use the top detected visual object in the video to guide our system to separate out the noise.
Table 4 shows the results on AV-Bench [32, 10]. Though our method learns only from unlabeled video and does not explicitly model the low-rank nature of noise as in , we obtain state-of-the-art performance on 2 of the 3 videos. The method of  uses motion in manually segmented regions, which may help on Guitar Solo, where the hand’s motion strongly correlates with the sound.
4.4 Qualitative Results
Audio-Visual Separation Video Examples.
Our video results555http://vision.cs.utexas.edu/projects/coseparation/ show qualitative separation results. We use our system to discover and separate object sounds for realistic multi-source videos. They lack ground truth, but the results can be manually inspected for quality.
Learned Audio Embedding.
embedding of the discovered sounds for various input objects in 20K AudioSet clips. We use the features extracted at the last layer of the ResNet-18 audio classifier as the audio representation for the separated spectrograms. The sounds our method learned from multi-source videos tend to cluster by object category, demonstrating that the separator discovers sounds characteristic of the corresponding objects.
Using Discovered Sounds to Detect Objects.
Finally, we use our trained audio-visual source separation network for visual object discovery using 912 noisy unseen videos from AudioSet. Given the pool of videos, we generate object region proposals using Selective Search . Then we pass these region proposals to our network together with the audio of its accompanying video, and retrieve the visual proposals that achieve the highest audio classification scores according to our object consistency loss.
Fig. 5 shows the top retrieved proposals for several categories after removing duplicates from the same video. We can see that our method has learned a good mapping between the visual and audio modalities; the best visual object proposals usually best activate the audio classifier. The last column shows failure cases where the wrong object is detected with high confidence. They usually come from objects of similar texture or shape, like the stripes on the man’s t-shirt and the shadow of the harp.
We presented an object-level audio-visual source separation framework that associates localized object regions in videos to their characteristic sounds. Our Co-Separation approach can leverage noisy object detections as supervision to learn from large-scale unlabeled videos. We achieve state-of-the-art results on visually-guided audio source separation and audio denoising. As future work, we plan to explore spatio-temporal object proposals and incorporate object motion to guide separation, which may especially benefit object sounds with similar frequencies.
-  T. Afouras, J. S. Chung, and A. Zisserman. The conversation: Deep audio-visual speech enhancement. In Interspeech, 2018.
-  R. Arandjelović and A. Zisserman. Objects that sound. In ECCV, 2018.
-  Z. Barzelay and Y. Y. Schechner. Harmony in motion. In CVPR, 2007.
-  L. Chen, S. Srivastava, Z. Duan, and C. Xu. Deep cross-modal audio-visual generation. In on Thematic Workshops of ACM Multimedia, 2017.
-  D. P. W. Ellis. Prediction-driven computational auditory scene analysis. PhD thesis, Massachusetts Institute of Technology, 1996.
-  A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. In SIGGRAPH, 2018.
-  C. Févotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis. Neural computation, 2009.
-  J. W. Fisher III, T. Darrell, W. T. Freeman, and P. A. Viola. Learning joint statistical models for audio-visual fusion and segregation. In NeurIPS, 2001.
-  A. Gabbay, A. Shamir, and S. Peleg. Visual speech enhancement. In Interspeech, 2018.
-  R. Gao, R. Feris, and K. Grauman. Learning to separate object sounds by watching unlabeled video. In ECCV, 2018.
-  R. Gao and K. Grauman. 2.5d visual sound. In CVPR, 2019.
-  J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017.
D. Griffin and J. Lim.
Signal estimation from modified short-time fourier transform.IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984.
-  X. Guo, S. Uhlich, and Y. Mitsufuji. Nmf-based blind source separation using a linear predictive coding error clustering criterion. In ICASSP, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe. Deep clustering: Discriminative embeddings for segmentation and separation. In ICASSP, 2016.
-  J. R. Hershey and J. R. Movellan. Audio vision: Using audio-visual synchrony to locate sounds. In NeurIPS, 2000.
-  P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis. Deep learning for monaural speech separation. In ICASSP, 2014.
P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis.
Joint optimization of masks and deep recurrent neural networks for monaural source separation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015.
-  A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural networks, 2000.
-  S. Innami and H. Kasai. Nmf-based environmental sound source separation using time-variant gain features. Computers & Mathematics with Applications, 2012.
-  R. Jaiswal, D. FitzGerald, D. Barry, E. Coyle, and S. Rickard. Clustering nmf basis functions using shifted nmf for monaural sound source separation. In ICASSP, 2011.
-  E. Kidron, Y. Y. Schechner, and M. Elad. Pixels that sound. In CVPR, 2005.
-  I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci, J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html, 2017.
-  B. Li, K. Dinesh, Z. Duan, and G. Sharma. See and listen: Score-informed association of sound tracks to players in chamber music performance videos. In ICASSP, 2017.
-  E. F. Lock, K. A. Hoadley, J. S. Marron, and A. B. Nobel. Joint and individual variation explained (jive) for integrated analysis of multiple data types. The annals of applied statistics, 2013.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. JMLR, 2008.
-  P. Morgado, N. Vasconcelos, T. Langlois, and O. Wang. Self-supervised generation of spatial audio for 360 video. In NeurIPS, 2018.
-  A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018.
-  A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman. Visually indicated sounds. In CVPR, 2016.
-  S. Parekh, S. Essid, A. Ozerov, N. Q. Duong, P. Pérez, and G. Richard. Motion informed audio source separation. In ICASSP, 2017.
-  J. Pu, Y. Panagakis, S. Petridis, and M. Pantic. Audio-visual object localization and separation using low-rank and sparsity. In ICASSP, 2017.
-  C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, D. P. Ellis, and C. C. Raffel. mir_eval: A transparent implementation of common mir metrics. In ISMIR, 2014.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 2015.
-  C. Rother, T. Minka, A. Blake, and V. Kolmogorov. Cosegmentation of image pairs by histogram matching-incorporating a global constraint into mrfs. In CVPR, 2006.
-  F. Sedighin, M. Babaie-Zadeh, B. Rivet, and C. Jutten. Two multimodal approaches for single microphone source separation. In 24th European Signal Processing Conference, 2016.
-  A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. So Kweon. Learning to localize sound source in visual scenes. In CVPR, 2018.
-  A. J. Simpson, G. Roma, and M. D. Plumbley. Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network. In International Conference on Latent Variable Analysis and Signal Separation, 2015.
-  P. Smaragdis and M. Casey. Audio/visual independent components. In International Conference on Independent Component Analysis and Signal Separation, 2003.
-  M. Spiertz and V. Gnann. Source-filter based clustering for monaural blind source separation. In 12th International Conference on Digital Audio Effects, 2009.
-  D. Stoller, S. Ewert, and S. Dixon. Adversarial semi-supervised audio source separation applied to singing voice extraction. In ICASSP, 2018.
-  Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu. Audio-visual event localization in unconstrained videos. In ECCV, 2018.
-  J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013.
-  T. Virtanen. Sound source separation using sparse coding with temporal continuity objective. In International Computer Music Conference, 2003.
-  T. Virtanen. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE transactions on audio, speech, and language processing, 2007.
-  D. Wang and J. Chen. Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018.
-  D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In ICASSP, 2017.
-  H. Zhao, C. Gan, W.-C. Ma, and A. Torralba. The sound of motions. arXiv preprint arXiv:1904.05979, 2019.
-  H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. The sound of pixels. In ECCV, 2018.
-  B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In CVPR, 2017.
-  Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg. Visual to sound: Generating natural sound for videos in the wild. In CVPR, 2018.
Appendix A Supplementary Video
In our supplementary video, we show example separation results. We use our system to discover and separate object sounds for realistic multi-source videos from AudioSet dataset and duets in MUSIC dataset. We compare to our best audio-visual baseline (Sound-of-Pixels, Zhao ECCV18) and the audio-only baseline (Spiertz & Gnann DAFx09). The AV-MIML baseline is trained on a different set of object categories and is therefore not available for comparison. The Sound-of-Pixels baseline originally performs video-level mix-and-separate source separation. To perform source separation at object-level for realistic videos during testing, we use the localized object region as the input to the visual stream and the multi-source audio as the input to the audio stream to separate the sound responsible for the input visual object. Therefore, we can then obtain the sounds grounded to each detected object as our method.
From the examples, we can see that our co-separation approach can discover and separate object sounds for realistic multi-source videos. Our method generates cleaner separation compared to the baseline methods, and it can also ground the separated sounds to the meaningful visual objects in the video. In the last separation example of piano and trumpet, the piano is silent in that video. Our model properly captures this in the separation, creating a “silent” separation track for the object even though it is visible and detected visually in the frame. In the last two failure cases, we show that our model can be constrained by the breadth of the pre-trained object detector. Furthermore, it finds difficult to perform separation in diverse scenes with unmodeled sounds such as human voice.
Appendix B Details of Object Detection
We train an object detector on 30k images of 15 object categories from the Open Images dataset . The 15 object categories include: Banjo, Cello, Drum, Guitar, Harp, Harmonica, Oboe, Piano, Saxophone, Trombone, Trumpet, Violin, Flute, Accordion, and Horn. We use the public PyTorch implementation666https://github.com/jwyang/faster-rcnn.pytorch  of Faster R-CNN to train an object detector with a ResNet-101  backbone.
Then we use our pre-trained object detector to find objects in video frames for the AudioSet-Unlabeled dataset. We extract 80 frames from each unlabeled 10s video clip, and perform object detection on each frame. We use the following filtering procedures to reduce the noise of the obtained detections: 1) We only keep object detections of confidence larger than ; 2) If two object detections of different class overlap more than , we only keep the one with the larger confidence; 3) We only keep the top two detected categories of the largest confidence, because this agrees with the number of objects detected by our pre-trained detector in most training videos.
Appendix C Details of Audio-Visual Separator Network
Our audio-visual separator network consists of a visual branch and an audio branch. The visual branch takes images of dimension 224x224x3 as input, and extracts a feature map of dimension 7x7x512 through ResNet-18 ImageNet pre-trained network. The visual feature map is then passed though a 1x1 convolution layer to reduce the channel dimension, and produces a feature map of dimension 7x7x128. The feature map is then flattened and passed through a fully-connected layer to produce an aggregated visual feature vector of dimension 512.
The audio branch is of a U-NET style architecture, namely an encoder-decoder network with skip connections. It consists of 7 convolution layers and 7 up-convolution layers. All convolutions and up-convolutions use 4 x 4 spatial filters applied with stride 2, and followed by a BatchNorm layer and a ReLU. After the last layer in the decoder, an up-convolution is followed by a Sigmoid layer to bound the values of the spectrogram mask. The encoder uses leaky ReLUs with a slope of 0.2, while ReLUs in the decoder are not leaky. Skip connections are added between each layerin the encoder and layer in the decoder, where is the total number of layers. The skip connections concatenate activations from layer to layer .
The U-NET produces an audio feature map of dimension , with 512 the channel dimension, after the last convolution layer. The visual feature vector of dimension 512 is replicated times to produce a visual feature map. Then we concatenate the audio and visual features along the channel dimension to produce an audio-visual feature map of dimension . The series of up-convolutions in U-NET is finally performed on the concatenated audio-visual feature map to generate a multiplicative spectrogram mask .
Appendix D Ablation Study
We perform an ablation study to examine the impact of the key components of our Co-Separation framework. Table 5 compares the source separation performance of several variants of our model on AudioSet dataset. We compare our model with one variant that only uses the co-separation loss; one variant that only uses the object-consistency loss; one variant that removes the “adaptable” class. We can see that object-consistency loss alone does not suffice to learn source separation, but together with the separation loss we obtain the best performance. The “adaptable” class is not essential to our system, but arms the network with the flexibility to assign noise or unrelated sounds to it, leading to better separation performance as shown in the table.
|co-separation loss only||3.65||6.13||13.2|
|object-consistency loss only||0.14||0.14||45.0|
|without “adaptable” class||3.70||5.30||14.4|