Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation

by   Yapeng Tian, et al.
University of Rochester

There are rich synchronized audio and visual events in our daily life. Inside the events, audio scenes are associated with the corresponding visual objects; meanwhile, sounding objects can indicate and help to separate their individual sounds in the audio track. Based on this observation, in this paper, we propose a cyclic co-learning (CCoL) paradigm that can jointly learn sounding object visual grounding and audio-visual sound separation in a unified framework. Concretely, we can leverage grounded object-sound relations to improve the results of sound separation. Meanwhile, benefiting from discriminative information from separated sounds, we improve training example sampling for sounding object grounding, which builds a co-learning cycle for the two tasks and makes them mutually beneficial. Extensive experiments show that the proposed framework outperforms the compared recent approaches on both tasks, and they can benefit from each other with our cyclic co-learning.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8


Learning to Separate Object Sounds by Watching Unlabeled Video

Perceiving a scene most fully requires all the senses. Yet modeling how ...

Audio-Visual Grounding Referring Expression for Robotic Manipulation

Referring expressions are commonly used when referring to a specific tar...

Audio-Visual Spatial Aligment Requirements of Central and Peripheral Object Events

Immersive audio-visual perception relies on the spatial integration of b...

Weakly-supervised Audio-visual Sound Source Detection and Separation

Learning how to localize and separate individual object sounds in the au...

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

Stereophonic audio is an indispensable ingredient to enhance human audit...

Multichannel-based learning for audio object extraction

The current paradigm for creating and deploying immersive audio content ...

The Sound of Pixels

We introduce PixelPlayer, a system that, by leveraging large amounts of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Seeing and hearing are two of the most important senses for human perception. Even though the auditory and visual information may be discrepant, the percept is unified with multisensory integration [bulkin2006seeing]. Such phenomenon is considered to be derived from the characteristics of specific neural cell, as the researchers in cognitive neuroscience found the superior temporal sulcus in the temporal cortex of the brain can simultaneously response to visual, auditory, and tactile signal [hikosaka1988polysensory, stein1993merging]. Accordingly, we tend to perform as unconsciously correlating different sounds and their visual producers, even in a noisy environment. For example, for a cocktail-party scenario contains multiple sounding and silent instruments as shown in Fig. 1, we can effortlessly filter out the silent ones and identify different sounding objects, and simultaneously separate the sound for each playing instrument, even faced with a static visual image.

Figure 1: Our model can perform audio-visual joint perception to simultaneously identify silent and sounding objects and separate sounds for individual audible objects.

For computational models, the multi-modal sound separation and sounding object alignment capacities reflect in audio-visual sound separation (AVSS) and sound source visual grounding (SSVG), respectively. AVSS aims to separate sounds for individual sound sources with help from visual information, and SSVG tries to identify objects that make sounds in visual scenes. These two tasks are primarily explored isolatedly in the literature. Such a disparity to human perception motivates us to address them in a co-learning manner, where we leverage the joint modeling of the two tasks to discover objects that make sounds and separate their corresponding sounds without using annotations.

Although existing works on AVSS [ephrat2018looking, owens2018audio, gao2018learning, zhao2018sound, gao2019co, xu2019recursive, gan2020music] and SSVG [kidron2005pixels, senocak2018learning, tian2018audio, arandjelovic2018objects, qian2020multiple] are abundant, it is non-trivial to jointly learn the two tasks. Previous AVSS methods implicitly assume that all objects in video frames make sounds. They learn to directly separate sounds guided by encoded features from either entire video frames [ephrat2018looking, owens2018audio, zhao2018sound] or detected objects [gao2019co] without parsing which are sounding or not in unlabelled videos. At the same time, the SSVG methods mostly focus on the simple scenario with single sound source, barely exploring the realistic cocktail-party environment [senocak2018learning, arandjelovic2018objects]. Therefore, these methods blindly use information from silent objects to guide separation learning, also blindly use information from sound separation to identify sounding objects.

Toward addressing the drawbacks and enabling the co-learning of both tasks, we introduce a new sounding object-aware sound separation strategy. It targets to separate sounds guided by only sounding objects, where the audiovisual scenario usually consists of multiple sounding and silent objects. To address this challenging task, the SSVG can jump in to help identify each sounding object from a mixture of visual objects, whose objective is unlike previous approaches that make great efforts on improving the localization precision of sound source in simple audiovisual scenario [senocak2018learning, tian2018audio, arandjelovic2018objects, owens2018audio]. Accordingly, it is challenging to discriminatively discover isolated sounding objects inside scenarios via the predicted audible regions visualized by heatmaps [senocak2018learning, arandjelovic2018objects, hu2019deep]. For example, two nearby sounding objects might be grouped together in a heatmap and we have no good principle to extract individual objects from a single located region.

To enable the co-learning, we propose to directly discover individual sounding objects in visual scenes from visual object candidates. With the help of grounded objects, we learn sounding object-aware sound separation. Clearly, a good grounding model can help to mitigate learning noise from silent objects and improve separation. However, causal relationship between the two tasks cannot ensure separation can further enhance grounding, because they only loosely interacted during sounding object selection. To alleviate the problem, we use separation results to help sample more reliable training examples for grounding. It makes the co-learning in a cycle and both grounding and separation performance will be improved, as illustrated in Fig. 2. Experimental results show that the two tasks can be mutually beneficial with the proposed cyclic co-learning, which leads to noticeable performance and outperforms recent methods on sounding object visual grounding and audio-visual sound separation tasks.

The main contributions of this paper are as follows: (1) We propose to perform sounding object-aware sound separation with the help of visual grounding task, which essentially analyzes the natural but previously ignored cocktail-party audiovisual scenario. (2) We propose a cyclic co-learning framework between AVSS and SSVG to make these two tasks mutually beneficial. (3) Extensive experiments and ablation study validate that our models can outperform recent approaches, and the tasks can benefit from each other with our cyclic co-learning.

Figure 2: Cyclic Co-learning of sounding object visual grounding and audio-visual sound separation, enabled by sounding object-aware sound separation. Given detected objects from a video and the sound mixture, our model can recognize whether they are audible via a grounding network (Audible?) and separate their sounds with an audio-visual sound separation network to help determine potential sounding and silent sources (Which?).

2 Related Work

Sound Source Visual Grounding: Sound source visual grounding aims to identify the visual objects associating to specific sounds in the daily audiovisual scenario. This task is closely related to the visual localization problem of sound source, which targets to find pixels that are associated with specific sounds. Early works in this field use canonical correlation analysis [kidron2005pixels] or mutual information [hershey2000audio, Hu2020CrossTaskTF] to detect visual locations that make the sound in terms of localization or segmentation. Recently, deep audio-visual models are developed to locate sounding pixels based on audio-visual embedding similarities [arandjelovic2018objects, owens2018audio, hu2019deep, hu2020curriculum], cross-modal attention mechanisms [senocak2018learning, tian2018audio, afouras2020self], vision-to-sound knowledge transfer [gan2019self], and sounding class activation mapping [qian2020multiple]. These learning fashions are capable of predicting audible regions by showing heat-map visualization of single sound source in the simple scenario, but cannot explicitly detect multiple isolated sounding objects when they are sounding at the same time. In the most recent work, Hu et al. [hu2020discriminative] propose to discriminatively identify sounding and silent object in the cocktail-party environment, but relying on reliable visual knowledge of objects learned from manually selected single source videos. Unlike the previous methods, we focus on finding each individual sounding object from cocktail scenario of multiple sounding and silent objects without any human annotations, and is cooperatively learned with audio-visual sound separation task.

Audio-Visual Sound Separation: Respecting for the long-history research on sound source separation in signal processing, we only survey recent audio-visual sound source separation methods [ephrat2018looking, owens2018audio, gao2018learning, zhao2018sound, gao2019co, xu2019recursive, zhao2019sound, rouditchenko2019self, gan2020music] in this section. These approaches separate visually indicated sounds for different audio sources (e.g., speech in [ephrat2018looking, owens2018audio], music instruments in [zhao2018sound, gao2019co, zhao2019sound, gan2020music], and universal sources in [gao2018learning, rouditchenko2019self]) with a commonly used mix-and-separate strategy to build training examples [hershey2016deep, huang2015joint]. Concretely, they generate the corresponding sounds w.r.t. the given visual objects [zhao2018sound] or object motions in the video [zhao2019sound], where the objects are assumed to be sounding during performing separation. Hence, if a visual object belonging to the training audio source categories is silent, these models usually fail to separate a all-zero sound for it. Due to the commonly existing audio spectrogram overlapping phenomenon, these methods will introduce artifacts during training and make wrong predictions during inference. Unlike previous approaches, we propose to perform sounding object-aware audio separation to alleviate the problem. Moreover, sounding object visual grounding is explored in a unified framework with the separation task.

Audio-Visual Video Understanding: Since auditory modality containing synchronized scenes as the visual modality is widely available in videos, it attracts a lot of interests in recent years. Besides sound source visual grounding and separation, a range of audio-visual video understanding tasks including audio-visual action recognition [gao2019listentolook, kazakos2019epic, korbar2019scsampler], audio-visual event localization [tian2018audio, lin2019dual, wu2019DAM], audio-visual video parsing [tian2020avvp], cross-modal generation [chen2018lip, chen2019hierarchical, gao20192, zhou2020sep, Zhou2021pose, Xu2021visual], and audio-visual video captioning [rahman2019watch, Tian_2019_CVPR_Workshops, wang2018watch] have been explored. Different from these works, we introduce a cyclic co-learning framework for both grounding and separation tasks and show that they can be mutually beneficial.

3 Method

3.1 Overview

Given an unlabeled video clip with the synchronized sound 222 is a time-discrete audio signal., are detected objects in the video frames and the sound mixture is . Here, is the separated sound of the object . When it is silent, . Our co-learning aims to recognize each sounding object and then separate its sound for the object.

The framework, as illustrated in Fig. 2, mainly contains two modules: sounding object visual grounding network and audio-visual sound separation network. The sounding object visual grounding network can discover isolated sounding objects from object candidates inside video frames. We learn the grounding model from sampled positive and negative audio-visual pairs. To learn sound separation in the framework, we adopt a commonly used mix-and-separate strategy [gao2019co, hershey2016deep, zhao2018sound] during training. Given two training video and sound pairs and , we obtain a mixed sound:


and find object candidates and from the two videos. The sounding object visual grounding network will recognize audible objects from and

and the audio-visual sound separation network will separate sounds for the grounded objects. Sounds are processed in a Time-Frequency space with the short-time Fourier transform (STFT).

With the sounding object-aware sound separation, we can co-learn the two tasks and improve audio-visual sound separation with the help of sounding object visual grounding. However, the separation performance will highly rely on the grounding model and the grounding task might not benefit from co-learning training due to the weak feedback from separation. To simultaneously evolve the both models, we propose a cyclic co-learning strategy as illustrated in Fig. 3, which has an additional backward process that utilizes separation results to directly improve training sample mining for sounding object visual grounding. In this manner, we can make the two tasks mutually beneficial.

3.2 Sounding Object Visual Grounding

Videos contain various sounds and visual objects, and not all objects are audible. To find sounding objects in videos and and further utilize grounding results for separation, we formulate sounding object visual grounding as a binary matching problem.

Sounding Object Candidates:

To better support the audio-visual matching problem, we choose to follow the widely-adopted image representation strategy of visual object proposal in image captioning 

[karpathy2015deep, anderson2018bottom], which has been also employed in the recent work on audio-visual learning [gao2019co]. Concretely, the potential audible visual objects are first proposed from videos using an object detector. In our implementation, we use the Faster R-CNN [ren2015faster] object detector trained on Open Images dataset [krasin2017openimages] from [gao2019co] to detect objects from video frames in and and obtain and . Next, we learn to recognize sounding objects in and associated with and , respectively. For simplicity, we use an object and a sound as an example to illustrate our grounding network.

Audio Network: Raw waveform is transformed to an audio spectrogram with the STFT. An VGG [simonyan2014very]

-like 2D CNN network: VGGish followed by a global max pooling (GMP) is used to extract an audio embedding

from .

Visual Network: The visual network extracts features from detected visual object . We use the pre-trained ResNet-18 [he2016deep]

model before the last fully-connected layer to extract a visual feature map and perform a GMP to obtain a visual feature vector

for .

Grounding Module: The audio-visual grounding module takes audio feature and visual object feature as inputs to predict whether the visual object is one of the sounding makers for . We solve it using a two-class classification network. It first concatenates and

and then uses a 3-layer Multi-Layer Perceptron (MLP) with a Softmax function to output a probability score

. Here, if , is a positive pair and and are matched; otherwise, is not a sound source.

Training and Inference: To train the sounding object visual grounding network, we need to sample positive/matched and negative/mismatched audio and visual object pairs. It is straightforward to obtain negative pairs with composing audio and objects from different videos. For example, from and an randomly selected object from can serve as a negative pair. However, positive audio-visual pairs are hard to extract since not all objects are audible in videos. If an object from is not audio source, the object and will be a negative pair, even though they are from the same video. To address the problem, we cast the positive sample mining as a multiple instance learning problem and sample the most confident pair as a positive sample with a grounding loss as the measurement:


where is a cross-entropy function;

is an one-hot encoding for positive pairs;


will be the positive audio-visual pair for training. With the sampled negative and positive data, we can define the loss function to learn the sounding object visual grounding:


where is the negative label. The visual grounding network can be end-to-end optimized with sampled training pairs via .

During inference, we can feed audio-visual pairs and into the trained model to find sounding objects insides the two videos. To facilitate audio-visual sound separation, we need to detect sounding objects from the sound mixture rather than and , since the individual sounds are unavailable at a testing stage for separation task.

3.3 Sounding Object-Aware Separation

Given detected objects in and , we separate sounds for each object from the sound mixture and mute separated sounds of silent objects.

Using an audio-visual sound separation network, we can predict sound spectrograms and for objects in and , respectively. According to waveform relationship in Eq. 1, we can approximate spectrogram magnitude relationship as: To learn the separation network, we can optimize it with a L1 loss function:


However, not all objects are audible and spectrograms from different objects contain overlapping content. Therefore, even an object is not sounding, it can also separate non-zero sound spectrogram from the spectrogram of sound mixture , which will introduce errors during training. To address the problem, we propose a sounding object-aware separation loss function:



is a binarized value of

. If an object is not a sound source, will be equal to zero. Thus, the sounding object-aware separation can help to reduce training errors from silent objects in Eq. 4.

In addition, we introduce additional grounding loss terms to guide the grounding model learning from the sound mixture. Since we have no sounding object annotations, we adopt a similar positive sample mining strategy as in Eq. 2 and define a grounding loss as follows:


3.4 Co-learning in a Cycle

Figure 3: Cyclic co-learning. Facilitated by sounding object visual grounding, our model can employ sounding object-aware sound separation to improve separation. Meanwhile, separation results can help to do effective training sample mining for grounding.

Combing grounding and separation loss terms, we can learn the two tasks in a unified way with a co-learning objective function:

Although our sounding object visual grounding and audio-visual sound separation models can be learned together, the two tasks loosely interact in . Clearly, a good grounding network can help improve the separation task. However, the grounding task might not be able to benefit from co-learning training since there is no strong feedback from separation to guide learning the grounding model. To further strengthen the interaction between the two tasks, we propose a cyclic co-learning strategy, which can make them benefit from each other.

If an object makes sound in video , the separated spectrogram should be close to ; otherwise, the difference between and should be larger than a separated sound spectrogram from an sounding object and . We use distance to measure dissimilarity of spectrograms: , where will be small for a sounding object . Based on the observation, we select the object with the minimum make the dominant sound in to compose positive samples for sounding object visual grounding. Let and . We can re-formulate grounding loss terms as:

In addition, if is very large, the object is very likely not be audible, which can help us sample potential negative examples for mixed sound grounding. Specifically, we select the objects that are associated with the largest , and must be larger than a threshold . Let , s.t. and , s.t. . We can update with learning from potential negative samples: Finally, we can co-learn the two tasks in a cycle with optimizing the joint cyclic co-learning loss function: . Inside cyclic co-learning as illustrated in Fig. 3, we use visual grounding to improve sound separation and enhance visual grounding based on feedback from sound separation. The learning strategy can make the tasks help each other in a cycle and significantly improve performance for both tasks.

4 Experiments

4.1 Experimental Setting

Dataset: In our experiments, 520 online available musical solo videos from the widely-used MIT MUSIC dataset [zhao2018sound] is used. The dataset includes 11 musical instrument categories: accordion, acoustic guitar, cello, clarinet, erhu, ute, saxophone, trumpet, tuba, violin, and xylophone. The dataset is relatively clean and sounding instruments are usually visible in videos. We split it into training/validation/testing sets, which have 468/26/26 videos from different categories, respectively. To train and test our cyclic co-learning model, we randomly select three other videos for each video to compose training and testing samples. Let’s denote the four videos as A, B, C, D. We compose A, B together as and C, D together as , while sounds of and are only from A and C, respectively. Thus, objects from B and D in the composed samples are inaudible. Finally, we have 18,720/260/260 composed samples in our training/val/test sets for the two tasks.

Evaluation Metrics: For sounding object visual grounding, we feeding detected audible and silent objects in videos into different grounding models and evaluate their binary classification accuracy. We use the mir eval library [raffel2014mir_eval] to measure sound separation performance in terms of two standard metrics: Signal-to-Distortion Ration (SDR) and Signal-to-Interference Ratio (SIR).

Methods OTS [arandjelovic2018objects] DMC [hu2019deep] Grounding only CoL CCoL
Single Sound 58.7 65.3 72.0 67.0 84.5
Mixed Sound 51.8 52.6 61.4 58.2 75.9
Table 1: Sounding object visual grounding performance (%). Top-2 results are highlighted.
Figure 4: Qualitative results of sounding object visual grounding for both audible and silent objects. We use two icons to denote whether objects in video frames are audible or not and grounded sounding objects from DMC and CCoL are shown in green boxes. Our CCoL model can effectively identify both sounding and silent objects, while the DMC fails. Note that 2-sound mixtures are used as inputs.
Figure 5: Qualitative results of audio-visual sound separation for both audible and silent objects. Our CCoL model can well mitigate learning noise from silent objects during training and generate more accurate sounds.

Implementation Details: We sub-sample audio signals at 11kHz, and each video sample is approximately 6 seconds. The STFT is calculated using a Hann window size of 1022 and a hop length of 256 and each 1D audio waveform is transformed to a Time- Frequency spectrogram. Then, it is re-sampled to = 256. The video frame rate is set as 1 and we randomly select 3 frames per 6 video. Objects extracted from video frames are resized to and then randomly cropped to as inputs to our network. is set to . We use a soft sound spectrogram masking strategy as in  [gao2019co, zhao2018sound] to generate individual sounds from audio mixtures and adopt a audio-visual sound separation network from [zhao2018sound]. More details about the network can be found in our appendix. Our networks are optimized by Adam [kingma2014adam]. Since the sounding object visual grounding and audio-visual sound separation tasks are mutually related, we need to learn good initial models for making them benefit from cyclic co-learning. To this end, we learn our CCoL model with three steps in a curriculum learning [bengio2009curriculum] manner. Firstly, we train the sounding object visual grounding network with . Secondly, we co-learn the grounding network initialized with pre-trained weights and the separation network optimized by . Thirdly, we use to further fine-tune the two models.

Figure 6: Real-world sounding object visual grounding and audio-visual sound separation. The guitar and violin are playing and the flute is visible but not audible. Our model can identify sounding objects: guitar and violin and silent object: flute. Moreover, it can simultaneously separate individual sounds for each instrument, while the SoP using the same noisy training data as ours fails to associate objects with the corresponding sounds, thus obtains poor separation results.

4.2 Sounding Object Visual Grounding

We compare our methods to two recent methods: OTS [arandjelovic2018objects] and DMC [hu2019deep]. In addition, we make an ablation study to investigate the proposed models. The Grounding only model is trained only with grounding losses: ; the co-learning (CoL) model jointly learn visual grounding and sound separation using the ; and the cyclic co-learning (CCoL) further strengthens the interaction between the two tasks optimized via . We evaluate sounding object visual grounding performance on both solo and mixed sounds.

Table 1 and Figure 4 show quantitative and qualitative sounding object visual grounding results,respectively. Even our grounding only has already outperformed the OTS and DMC, which can validate the effectiveness of the proposed positive sample mining approach. Then, we can see that the CoL with jointly learning grounding and separation achieves worse performance than the Grounding only model. It demonstrates that the weak interaction inside CoL cannot let the grounding task benefit from the separation task. However, using separation results to help the grounding example sampling, our CCoL is significantly superior over both Grounding only and CoL models. The results can demonstrate the sounding object visual grounding can benefit from separation with our cyclic learning.

Methods RPCA [huang2012singing] SoP [zhao2018sound] CoSep [gao2019co] Random Obj CoL CCoL Oracle
SDR -0.48 3.42 2.04 4.20 6.50 7.27 7.71
SIR 3.13 4.98 6.21 6.90 11.81 12.77 11.42
Table 2: Audio-visual sound separation performance. Top-2 results are highlighted.

4.3 Audio-Visual Sound Separation

To demonstrate the effectiveness of our CCoL framework on audio-visual sound separation, we compare it to a classical factorization-based method: RPCA [huang2012singing] and two recent state-of-the-art methods: SoP [zhao2018sound], CoSep [gao2019co], and baselines: Random Obj and CoL in Tab. 2. The SoM [zhao2019sound] and Music Gesture [gan2020music] address music sound separation by incorporating dynamic visual motions, and show promising results. However, as the SoP and CoSep, they also did not consider the silent object problem. Meanwhile, since there are no source code for the two appraoches, we will not include them into comparison. Note that SoP and CoSep are trained using source code provided by the authors and the same training data (including audio preprocessing) as ours. Moreover, we show separation results of an Oracle, which feeds ground truth grounding labels of mixed sounds to train the audio-visual separation network.

We can see that our CoL outperforms the compared SoP, CoSep, and Random Obj, and CCoL is better than CoL. The results demonstrate that sounding object visual grounding in the co-learning can help to mitigate training errors from silent video objects in separation, and separation performance can be further improved with the help of enhanced grounding model by cyclic co-learning. Compared to the Oracle model, it is reasonable to see that CCoL has slightly lower SDR. A surprising observation is that CoL and CCoL achieve better results in terms of SIR. One possible reason is that our separation networks can explore various visual objects as inputs during joint grounding and separation learning, which might make the models more robust on SIR.

Moreover, we illustrate quantitatively results for testing videos with both audible and silent objects in Tab. 3. Both SoP and CoSep are blind to whether a visual object makes sound and they will generate non-zero audio waveform for silent objects, and the Random Obj is limited in identifying object silent and sounding objects, thus they will have even lower SDRs and SIRs. However, both CoL and CCoL are capable of recognizing the audibility of objects and employ sounding object-aware separation, which helps the two models achieve significant better results. The experimental results can well validate the superiority of the proposed sounding object-aware separation mechanism.

We further show qualitative separation results for audible and silent objects in Fig. 5. We can see that both SoP and CoSep generate nonzero spectrograms for the silent Guitar and our CCoL can separate much better Erhu sound. The results can validate that the CCoL model is more robust to learning noise from silent objects during training and can effectively perform sounding object-aware sound separation.

Moreover, we train our model by letting the video ’B’ and ’D’ be randomly chosen from “with audio” and “silent”. In this way, each training video contains one or two sounding objects and the corresponding mixed sounds will be from up to four different sources. The SDR/SIR scores from SoP and CoL, and CCoL are 2.73/4.08, 5.79/11.43, and 6.46/11.72, respectively. The results further validate that the proposed cyclic co-learning framework can learn from both solo and duet videos and is superior over naive co-learning model.

From the sounding object visual grounding and audio-visual sound separation results, we can conclude that our cyclic co-learning framework can make the two tasks benefit from each other and significantly improve both visual grounding and sound separation performance.

Methods SoP [zhao2018sound] CoSep [gao2019co] Random Obj CoL CCoL GT
SDR -11.35 -15.11 -11.34 14.78 91.07 264.44
SIR -10.40 -12.81 -9.09 15.68 82.82 260.35
Table 3: Audio-visual sound separation performance (with silent objects). To help readers better appreciate the results, we include SDR and SIR from ground truth sounds.
Figure 7: Real-world sounding object visual grounding and audio-visual sound separation for a challenging outdoor video with a 4-sound mixture and five instruments. From left to right, the instruments are saxophone, accordion, trombone, clarinet and tuba. Our grounding network can successfully find audible and silent objects, and our separation network can separate the trombone music, meanwhile suppressing unrelated sounds from other playing instruments. However, the separated sound by SoP contains much stronger noisy sounds from the accordion and tuba. Note that the trombone is an unseen category, and we have no 4-sound mixture during training. The example can demonstrate the generalization capacity of our method.

4.4 Real-World Grounding and Separation

Besides synthetic data, our grounding and separation models can also handle real-world videos, in which multiple audible and silent objects might exist in single video frames. The example shown in Fig. 6 consists of three different instruments: guitar, flute, and violin in the video, in which guitar and violin are playing and making sounds. We can see that our grounding model can successfully identify the sounding objects: guitar and violin and silent object: flute. Meanwhile, our sounding object-aware separation model can separate individual sounds for each music instrument from the sound mixture. However, the SoP using the same noisy training data as our method obtains poor separation results because it is limited in associating objects with corresponding sounds.

Another more challenging example is illustrated in Fig. 7. There are five different instruments in the video, and four of them are playing together. Although our grounding and separation networks are not trained with 4-sound mixtures, our sounding object visual grounding network can accurately find audible and silent objects in the video, and our audio-visual separation network can separate the trombone sound and is more capable of suppressing unrelated sounds from other three playing instruments than the SoP.

From the above results, we learn that our separation and grounding models trained using the synthetic data can be generalized to handle real-world challenging videos. In addition, the results further demonstrate that our models can effectively identify audible and silent objects and automatically discover the association between objects and individual sounds without relying on human annotations.

5 Limitation and Discussion

Our sounding object visual grounding model first processes video frames and obtains object candidates by a Faster R-CNN [ren2015faster] object detector trained on a subset of Open Images dataset [krasin2017openimages]. As stated in Sec. 3.2, such a strategy of using visual object proposal has been widely employed in image captioning [karpathy2015deep, anderson2018bottom] and recent work on audio-visual learning [gao2019co]. Intuitively, similar to these previous works [karpathy2015deep, anderson2018bottom, gao2019co], the grounding model performance also highly relies on the quality of the object detector. If a sounding object is not detected by the detector, our model will fail to ground it. Thus, an accurate and effective object detector is very important for the grounding model. At present, one possible way to address this limitation could be performing dense proposal of object candidates to prevent the above problem to some extent.

6 Conclusion and Future Work

In this paper, we introduce a cyclic co-learning framework that can jointly learn sounding object visual grounding and audio-visual sound separation. With the help of sounding object visual grounding, we propose to perform sounding object-aware sound separation to improve audio-visual sound separation. To further facilitate sounding object visual grounding learning, we use the separation model to help training sampling mining, which makes the learning process of the two tasks in a cycle and can simultaneously enhance both grounding and separation performance. Extensive experiments can validate that the two different problems are highly coherent, and they can benefit from each other with our cyclic co-learning, and the proposed model can achieve noticeable performance on both sounding object visual grounding and audio-visual sound separation.

There is various audio and visual content with different modality compositions in real-world videos. Besides the silent object issue, there are other challenging cases that affect audio-visual learning, as discussed below.

There might be multiple instances of the same instrument in a video. Our separation model mainly uses encoded visual semantic information to separate sounds; however, the visual dynamics that can provide additional discriminative information to identify different instances for the same instrument are not exploited. In the future, we will consider how to incorporate dynamic visual information to further strengthen our model’s ability on separating multiple sounds of the same types as in [zhao2019sound, gan2020music].

Sound sources are not always visible in videos. For example, guitar sound might only be background music. In this case, there are no corresponding visual objects that can be used as conditions to separate the guitar sound for existing audio-visual sound separation methods. To address this problem, we might need to first parse input sound mixtures and video frames to recognize invisible sounds and then use other reliable conditions (, retrieved video frames or other semantic information) to separate the sounds.

7 Acknowledgement

Y. Tian and C. Xu were supported by NSF 1741472, 1813709, and 1909912. D. Hu was supported by Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China (NO. 21XNLG17), the Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098 and the 2021 Tencent AI Lab Rhino-Bird Focused Research Program (NO. JR202141). The article solely reflects the opinions and conclusions of its authors but not the funding agents.



We include this appendix to describe more details about audio-visual sound separation network and our implementation. Moreover, we provide more evaluation results for silent sound separation.

Audio-Visual Sound Separation Network

We adopt the same audio-visual separator as in [zhao2018sound], which consists of three modules: audio network, visual network, and audio-visual sound synthesizing network.

We use Time-Frequency representation of sound and project raw waveform to spectrogram with the STFT. The audio network transforms magnitude of STFT spectrogram from the input audio mixture to a -channel feature map with an U-Net [ronneberger2015u] structure, where in our experiments. We use an ResNet-18 [he2016deep] followed by a linear layer to predict a -dimension object feature for each object . The audio-visual sound synthesizing network takes and as inputs and output a spectrogram mask

with a linearly transformed dot product of the audio and visual features. The separated sound spectrogram:

can be obtained by masking the sound mixture, where is the element-wise multiplication operator. The waveform of the object can be reconstructed by the inverse short-time Fourier transform.

Implementation Details

We train our networks using PyTorch 


library with 4 NVIDIA 1080Ti GPUs. The batch size and epoch number are set to 48 and 60, respectively. The learning rate is set to

and it will decrease by multiplying at - and -th epochs, respectively. For the three-step training, we train 60 epochs for each step. We conduct ablation studies with several models: Grounding only, Random Obj, CoL, and CCoL, they are defined as follows:

Grounding Only: The Grounding only model is trained only with grounding losses: ;

Random Obj: This baseline randomly selects objects from detected object proposals to train a audio-visual sound separation model;

CoL: The co-learning model (CoL) jointly learns visual grounding and sound separation using the ;

CCoL: The cyclic co-learning (CCoL) further strengthens the interaction between the two tasks optimized by .

Additional Evaluation

To further validate silent sound separation performance, we compute the sound energy for separated sounds of silent objects. We empirically set a threshold: 20 to decide whether the separation is successful. The success rates for silent objects classification are 8.6% and 92.3% for SoP and CCoL, respectively. An interesting observation is that the success rate of our CCoL is even higher than the corresponding grounding accuracy, which demonstrates that our model tends to generate weak sounds for silent objects.