Audiovisual concurrency provides potential cues for perceiving and understanding the outside world . Such concurrency comes from the simple phenomena of “Sound is produced by the oscillation of object” , and exists through our daily life, such as the talking crowd, the barking dog, the roaring machine, etc. These inherent and pervasive correspondences provide us the reference to distinguish and correlate different audiovisual messages, then contribute to learning diversified visual appearances from their produced sounds, or perceiving various acoustic signals from their diversified sound-makers, i.e., we can visually localize the source position by hearing the sound or separate sound from chaotic acoustic scene with visual guidance.
As expected, machine intelligence should also take advantage of the inherent audiovisual concurrency for possessing human-like audiovisual processing ability. In recent years, some pioneering works have been developed to address this challenge, evolving from cross-modal knowledge transfer [4, 24] to directly determining whether audiovisual messages are corresponding or not [1, 17, 20]. However, the learning capacity of these models is pervasively limited by the heterogeneous complexity of audiovisual scene, i.e., arbitrary number of sound-sources, as shown in Fig. 1. On the one hand, it is easy to align sound and its visual source in the simple scene with single sound, whereas more difficult for the complex one with multiple sounds as lack of one-to-one audiovisual alignment annotations. On the other hand, most of these existing models indiscriminately utilize these simple and complex audiovisual data, which could confuse the models when analyzing and aligning diversified audiovisual content without auxiliary annotations. Hence, we argue that differentiated audiovisual learning w.r.t. their heterogeneous complexity should be explored for achieving robust audiovisual perception.
Further, effective audiovisual learning can reward the ability of achieving interesting cross-modal perception tasks, i.e., audiovisual sound localization and separation . Recent approaches that focus on these tasks have shown considerable performance [2, 12, 26, 31]. However, these methods usually focus on simple audiovisual scene and cannot derive concrete representations for specified sound-maker (i.e., object). Moreover, in the audiovisual sound separation task, these works normally consider that external visual knowledge from supervised training is necessary for effective separation guidance [31, 12, 30], but we argue that it should be an unnecessary procedure.
In this paper, we strive to make a step towards to the goal of human-like audiovisual learning ability. The primary challenge is how to distinguish and align different sounds and sound-makers just with the scene-level consistency, especially when faced with the heterogeneous audiovisual complexity. The second challenge is how to derive effective visual representations for various sound-makers without referring external knowledge, then server them to audiovisual sound localization and separation tasks.
To address these two challenges, we propose to grade the heterogeneous scene into a set of audiovisual curriculum in different difficulty levels and perform differentiated audiovisual learning, from the easy ones to the hard ones. The core insight is that we can easily analyze and align the audiovisual content in the simple scenes with single sound, meanwhile it can also provide prior alignment knowledge for the learning in complex scenes. As the audiovisual alignment is inferred by grouping and comparing the distinctive channel responses of both modalities, we can further use the aligned visual representations of sound-maker for audiovisual sound localization and separation. Concretely, our contributions are threefold:
We develop a flexible audiovisual learning model that derives effective unimodal representations and infers the latent alignment between sound and sound-maker for simple and complex scene. This model imposes a soft-clustering module as the pattern detector and correlates the clustered patterns via a structured alignment objective in a space shared by both modalities.
We propose a novel Curriculum AudioVisual Learning
(CAVL) strategy, where the difficulty level is determined by the number of sound-source in the scene. Amounts of experimental results show that such simple learning strategy not only makes our model much easier to train, but also improves the learning and alignment performance of audiovisual contents. Besides, we also develop a counting model for estimating the audiovisual scene complexity.
We further deploy the learned audiovisual model in cross-modal perception tasks. In audiovisual sound localization, our model shows considerable improvements over existing models. Moreover, it can provide effective visual representation for sound separation, based on which our approach has comparable performance to the methods that utilize external visual knowledge from supervised pre-training.
2 Related Work
Audiovisual self-supervised learning
As the audiovisual concurrency is inherently free of human-annotation and can provide correspondence signal of two modalities, it has drawn great attention in the self-supervised learning community, where self-supervised learning means training model from the input data itself and without human-annotation. In the early stage, the research group in MIT firstly regards the audiovisual concurrency as the connection of cross-modal knowledge transfer, where the student-network of one modality is supervised with the prediction of teacher-network of the other modality [4, 24]. Recently, a novel learning criterion is proposed to directly model the audiovisual concurrency without teacher supervision [1, 20], i.e., the audiovisual network learns to predict the sound and image from the same video or not. It is surprising that, with such simple supervision, both modality networks can learn to response to specific visual appearance and acoustic message [1, 20]. But on the other hand, as the scene-level consistency is lack of specific annotations between audiovisual components, it usually works well in the simple scenes with single sound but suffers from the defects of inefficiency and inaccuracy in the complex scene . Compared with these approaches, our method can learn to align multiple audiovisual components, even faced with complex audiovisual scene.
Sound localization in visual modality Visually localizing the sound-maker is a typical audiovisual perception task. Existing approaches address this challenge mainly by correlating pixel and sound based on their concurrency, where canonical correlation analysis , embedded scalar products , attention mechanism , and class activation map  have been proposed for effectively identifying the pixels of sound. Although these methods have shown promising visualization outcome, we consider that localization results should be more than that. We can also learn effective visual representation of sound-source for further perception of vision and sound, which is expected but ignored previously.
Audiovisual sound separation The visual information of sound-maker are considered to provide effective reference for separating the corresponding sound from complex scene . Motivated by this, a number of approaches have been developed for achieving robust audiovisual sound separation in different types of sound, which range from speech separation [10, 23], music separation [12, 30, 29] to object sound separation [11, 17]
. To achieve considerable performance in realistic environment, these methods resort to the reliable visual representation of sound-maker, which are obtained from ImageNet pre-trained visual network[11, 31, 29] or off-the-shelf object detector [10, 12], then correlate them with the sound embeddings in a common space. Compared to these methods, our approach does not need to refer external visual knowledge trained with human annotations.
Curriculum learning Compared with a disordered arrangement, starting from easier samples or tasks then gradually increasing the difficulty level could contribute to better learning performance [9, 6], which is called curriculum learning and has been widely applied in image classification 8], speech recognition . Different from conventional supervised learning, self-supervised learning is lack of effective human-annotation, which is more vulnerable to training order . Recently, Korbar et al.  employed curriculum learning for choosing the negative samples of audiovisual temporal synchronization. However, there is few work focusing on how to improve the audiovisual learning performance with curriculum learning strategy, when faced with heterogeneous scene complexity.
3.1 Audiovisual Learning Model
Given synchronized audio and visual messages111We use sound spectrogram and image for representing audio and visual message, respectively., i.e., and , which are separated from unlabelled videos , we target to effectively train an audiovisual network from cold-start and make it possess the ability of generating robust unimodal representation and performing effective cross-modal perception. The whole framework is shown in Fig. 2.
3.1.1 Learning representation via clustering
As the filters of convolution networks have shown the property of class-relevant activation [32, 17], it becomes feasible to discover and disentangle the audio and visual components by analyzing and integrating their channel representations. Concretely, we first employ convolution networks to model each modality then embed the data into feature maps with size , where and are the frame size and
is the number of channels. Then, these feature maps are reshaped into a set of vector. Due to the distinct activation of different modal components, some of the reshaped feature vectors in the
-dimension channel space should take similar distribution when they describe the same modal components but dissimilar for different components. Hence, we propose to integrate these feature vectors by performing soft K-means clustering in the channel space for each modality.
Formally, the objective function of soft K-means clustering  can be formulated as
where is the cluster center, is the number of sound-sources in the audiovisual scene, is the Euclidean distance between current feature point and center for measuring their similarity. is the indicator variable, which indicates the degree of assignments and can be achieved by performing softmax function over the distance of , i.e.,
where the hype-parameter is called stiffness parameter and controls the scalability of assignments.
is a minimization problem about two parts, i.e., the assignments and centers. The Expectation-Maximization algorithm can be employed for solving it effectively. In the E-step, we fix the cluster centers and update the assignment via Eq. 2 . In the M-step, we fix the assignment and re-compute the centers with the updated assignments from E-step, i.e.,
By alternatively executing E- and M-step, we aim to find centers, each of which should correspond to certain modal component, such as specific object or sound. Meanwhile, the corresponding cluster assignment can be interpreted as a spatial-mask over feature map and indicates the location of sound-source in both modalities, as shown in Fig. 2.
3.1.2 Audiovisual alignment objective
For a given audiovisual scene, although the contained sounds and objects have been described as different clustering centers, it is still difficult to directly perform alignment between them only with supervision at the entire scene level. Fortunately, the pervasive concurrency of sound and sound-maker can help to infer the latent alignment by comparing the matching degree of different sound-object pairs, where the valid pair should possess higher matching degree. Concretely, for each audio clustering center , we aim to find proper visual clustering center (i.e., sound-maker) based on their concurrency, which can be formulated as
where and are the number of audio center and visual centers, respectively222As the visual background is usually irrelevant with sound, we set and use additional visual center to represent it.. By minimizing Eq. 4, each audio center is aligned to the nearest visual center, meanwhile we can also derive the similarity score of sounds and objects in the entire scene, i.e., .
For an arbitrary scene, the self-supervision signal only confirms whether the audio and visual information are from the same scene (video) or not. To effectively leverage such supervision, we employ the contrastive loss to train the audiovisual network and infer the latent alignment simultaneously, which has shown the property of consistency and robust in two-stream network optimization [19, 20]. Concretely, the contrastive loss is written as
where and stand for the sound and image from scene and , respectively. is an indicator of each sound-image pair, i.e., if , otherwise . In practice, the negative samples of are randomly sampled from the training set. Generally, Eq. 5 encourages the audiovisual network to have higher matching confidence for the aligned sound-image pair than the mismatched ones by introducing the hyper-parameter of .
3.2 Curriculum Learning
3.2.1 Curriculum Procedure
Usually, the audiovisual scenes in the wild contain different amounts of sound-sources, we find that directly performing audiovisual learning with these data will make the model very difficult to optimize and also lower the alignment performance. To settle this problem, we propose to train the audiovisual model step by step, which is about starting from simple scene then gradually increasing the difficulty level, where the number of sound-sources is considered as the reference for audiovisual scene complexity. Intuitively, for the simple scene with single sound-source, it is easy to visually localize the sound-maker from background then align it to the unique sound, such as the example of accordion in Fig. 1. By contrast, we find that if training with complex audiovisual scenes (e.g., with three sound-sources) from the beginning, the model will get much lower convergence speed and worse results. While, model trained with simple scenes can contribute prior knowledge for distinguishing different sound-makers and sounds, and also provides the reference for alignment . Therefore, the audiovisual learning model can be further optimized with the complex scene, which leads to better results.
In practice, to effectively perform curriculum learning, all the audiovisual data have been sorted from simple to complex before training, according to the number of sound-sources in the scene. For different learning stages, the cluster number is accordingly set to the number of sources, e.g., and for the scene with single source. Based on these graded audiovisual data, we can train the audiovisual learning model in a curriculum fashion.
3.2.2 Complexity Estimation
Since the audiovisual scene complexity is crucial for curriculum training, it is worth to learn to model and estimate the number of sound-source in a given scene. Formally, the discrete probability distribution of Poisson for counting datais given by where is interpreted as the expected number of events in the interval and is the factorial of . In this task, is performed as the number of sound source in the audiovisual scene. Consequently, we propose to model as a function of the input sound by the audio network, which is written as . The function
means the counting network of sound-sources. By taking the negative log-likelihood w.r.t. the Poisson distribution, we can have the Poisson regression loss
where the term of can be ignored, as it is a constant to the model training. After training the counting network, we can estimate the scene complexity by identifying the number that holds the maximal probability.
3.3 Audiovisual Perception
3.3.1 Localizing sounds in visual modality
Considering that the audiovisual learning model has learned to align objects and sounds in the training phase, we can directly identify the potential object which produces given sound by comparing their similarity, i.e.,
For the clustering center of sound-sources , we compare it with all the visual centers and select the closest one as visual representation of sound source. As the corresponding assignment indicates the correlation between all the visual feature vectors and ,we can reshape it back to the size of and regard it as the location mask of sound-maker to achieve visual localization. To better visualize the object position, we can also further resize the assignment to the size of input image.
3.3.2 Audiovisual sound separation
To validate the effectiveness of inferred object representation further, we propose to perform sound separation based on visual guidance. The representative audiovisual separation network in [10, 12] is adopted, as shown in Fig. 6. Concretely, the separation network takes the visual clustering center (i.e., ) as the sound-maker representation in scene, and targets to separate its produced sound from the mixed audio signal. Alternatively, we can also use the assignment
to point out the location of sound-maker, and regard it as the object mask over the visual feature maps. Then, a sound-maker-awareness max-pooling can be performed over the masked feature maps to obtain robust object representation.
A variant of U-Net  is used to perform sound-source separation, similar to [12, 31]. The network takes the spectrogram of mixed sound as the input, then encodes it into audio feature maps via stacked convolution layers. The replication and tiling operation is performed over the visual representation to match the size of embedded audio feature maps. Then we concatenate these two modalities, and feed them into stacked up-convolution layer to generate a spectrogram mask. The separation loss is written as
where means the audiovisual sound separation network and is the mask of the spectrogram magnitudes of the target sound and mixed sound , i.e.,
. With the masked spectrogram, we can use Inversed Short-Time Fourier Transform
Inversed Short-Time Fourier Transform(ISTFT) to produce separated sound signal w.r.t. specific sound-maker.
AudioSet-Balanced Audioset is an audio event dataset, which consists of 2,084,320 human-annotated 10-second video clips. These clips are collected from YouTube, therefore many of which are in poor-quality and contain multiple sound-sources. A hierarchical ontology of 632 event classes is employed to annotate these data, which means that the same sound could be annotated as different labels. For example, the sound of barking is annotated as Animal, Pets, and Dog. Such hierarchical annotation makes it extremely difficult to precisely estimate the number of sound-sources in the clip. Hence, we propose to filter the original annotation by just keeping the third-level labels333Detailed filtering process can be found in the supplemental materials., e.g., Dog. In the original setting, all the videos are splitted into Evaluation/Balanced-Train/Unbalanced-Train set. For efficiency purpose, we only use the Balanced-Train set for training. These data are divided into different curriculums according to the number of contained sound-sources, e.g., the first curriculum consists of videos with single sound-source. Finally, all the 19,443 valid video clips are divided into 9,239/7,098/2,685/421 /// curriculum clips. For each curriculum set, the input audio is a 10s mono sound and the image is randomly selected from video. Note that, all the semantic labels are not used during training.
MUSIC The MIT MUSIC dataset contains 685 videos, with 536 musical solo and 149 duet videos. These videos contain 11 instrument categories. They are also collected from YouTube, but cleaner than the ones in AudioSet. Hence, they are more proper for the sound separation task. As the duet videos do not have ground-truths of sounds in mixtures, we only use the solo videos for training and testing. Following , the first and second video of each instrument category is selected for validating and testing, respectively. And the rest ones in solo are for training. Note that, some videos have been removed by the YouTube users, the final training data are about 467 videos. All the videos are randomly splitted into 10s clips.
4.2 Network and Implementation Details
Our audiovisual learning network is a two-stream network, where the off-the-shelf VGGish network  is employed for sound and VGG16  is for vision. For each modality, the feature maps are the outputs of final convolution layer of the network. Detailed architecture description for both audiovisual learning model and separation model can be found in the supplemental materials.
For all the experiments, the input audio of 10s long is represented in log magnitude spectrogram of , which is achieved by STFT (with the window size of 1022 and hop length of 256) and log-frequency projection. For the visual modality, we directly reshape the input image into . The network is trained with Adam optimizer, where the starting learning rate is for the first curriculum, then gradually times for the next one. For example, the learning rate for the third curriculum is .
4.3 Curriculum Learning Evaluation
4.3.1 Learning comparison
In this section, we aim to have an insight into how the curriculum strategy influences the audiovisual learning performance. Concretely, to evaluate the effects of different audiovisual complexities, the original set and the curriculum set of and are selected for training the audiovisual network, respectively. As shown in Fig. 4, it is obvious that the network trained with the simple curriculum of enjoys the fastest convergence and lowest training loss, while the one trained with suffers from the worst performance. Such phenomena indicate that the model performance is significantly affected by the audiovisual complexity, and the simple scene can provide better learning performance.
Further, we want to know what the model can benefit from pre-curriculum, i.e., the effects of curriculum initialization. In Fig. 4, we show the training accuracy of audiovisual model on the set, which are initialized from random and the model trained with the set, respectively. Surprisingly, the model initialized from enjoys the great advantages compared with the random one. Curriculum learning indeed helps to accelerate and improve the audiovisual learning performance.
4.3.2 Acoustic Scene Classification
The unimodal representation learned by audiovisual model is also influenced by the curriculum strategy. To assess such influence, we propose to perform acoustic scene classification by viewing the trained audiovisual model as a feature extractor. The ESC-50 dataset is chosen for evaluation and we follow the same pre-processing and train/test split as. Table 1 show the comparison results, where the sound-source-level alignment is Eq. 5 and the scene-level alignment is directly comparing the audio and visual representation without clustering, similar to . In Table 1, we can summarize these results into three points. First, learning with simple curriculum can provide more proper initialization. Second, sound-source matching better utilizes the audiovisual concurrency than scene-level matching, especially in the complex scene. Third, direct video-level matching in the complex scene may deteriorate the pre-trained network. This is probably because the chaotic audiovisual correlation could confuse the scene matching objective, but it was ignored before.
|+ Sound-source-level alignment||56.75|
|+ Scene-level alignment||47.25|
4.3.3 Poisson Regression
In this section, we evaluate the performance of audiovisual complexity estimation. Table 2 shows the Poisson regression results when training on AudioSet-Balanced-Train and testing on AudioSet-Evaluation, where the number of sound source ranges from 1 to 5. Compared to the chance results, the audio Poisson regression network has a great superiority in both accuracy and Mean Average Error (MAE). Moreover, the results can be further improved by adopting the network pre-trained with audiovisual learning objective, i.e., Eq. 5. Intuitively, if the network has been trained with higher level curriculum, e.g., , it will better estimate the number of sound-sources in the audio modality. Such evidences show that complex audiovisual data can be effectively modeled by self-supervised learning method, especially when adopting the curriculum learning strategy.
4.4 Audiovisual Sound Localization
In this task, we aim to visualize the object location where the sound is produced. The AudioSet-Balanced-Train dataset is adopted for training, and the human-annotated SoundNet-Flickr [4, 26] dataset is used for testing. As the training and testing datasets come from different sources, it is more challenging to perform exact localization.
Fig. 5 shows some qualitative examples w.r.t. sound-source locations. Compared with the human-annotations, our model can predict proper visual localization for the corresponding sound. And we can find that the annotated bounding boxes are sometimes too rough to provide exact source locations, while our model can address this challenge due to the clustering advantage of spatial assignments.
We further perform quantitative evaluation. Following , the same 250 audiovisual pairs are selected from the annotated SoundNet-Flickr dataset for evaluation, and consensus Intersection over Union (cIoU) and AUC area 
are used as evaluation metrics. To evaluate the effectiveness of curriculum learning, our models trained in different curriculum levels are also considered, as shown in Fig.3. First, our models outperform all the other methods by a large margin. It demonstrates that our model can better capture and align different sound-sources, even faced with multi-source scenes. Second, besides the aligned visual center, we also evaluate the unaligned visual center. As expected, they suffer from a large decline in both metrics, which indicates that our model can exactly distinguish sound-maker from background and align it with the produced sound. Third, our model trained with curriculum is worse than the one with . This is because the test videos are all single-source, the multi-source videos in may mix up the alignment knowledge learned in .
4.5 Sound Separation
We evaluate the audiovisual sound separation performance on the MIT-MUSIC dataset and more results on AudioSet are in the supplemental materials. In this task, effective separation depends on the quality of visual representation of sound-maker. To address this challenge, most of existing methods resort to the ImageNet pre-trained or fine-tuned sound-maker detector [11, 12, 30, 31]. In contrast, we use the sound localization technique to automatically extract the visual representation of sound-maker, which is implemented with audiovisual alignment objective (i.e., Eq. 5) without any human-supervision. Fig. 7 shows some examples of solo and duet scene. It is obvious that our model can localize most instruments, especially the different instruments in the duet scene. Then, we can use the corresponding visual centers or masked visual features as the representation of sound-maker for sound separation, as introduced in Sec.3.3.2. Note that, as the extraction of visual representation does not use any extra human-annotation, it is more general and flexible than existing methods.
In order to evaluate the separation results accurately, we use the synthetic mixture audios for evaluation, similar to [30, 12]. The standard metrics of Signal-to-Distortion Ration (SDR), Signal-to-Interference Ratio (SIR), and Signal-to-Artifact Ratio (SAR) are used for evaluation. Our model is compared with the audio-only separation method of NMF-MFCC  and the audiovisual separation models of AV-Mix-Sep , Sound-of-Pixels  and Co-Separation . Table 4 shows the separation results, where sound localization model is trained with solo videos. Although the previous methods use additional visual knowledge for guiding the sound separation, e.g., ImageNet-pretrained visual model in AV-Mix-Sep  and finetuned instrument detector in Co-Separation , our model still shows comparable results in both SDR and SIR444SAR measures the artifacts in the separated results instead of separation accuracy .. Note that, our results are achieved with fewer training samples compared with others, which demonstrates that our sound localization technique can provide effective visual representation of specific sound-maker. Moreover, our separation model based on masked visual features performs much better than the one with visual center. This is probably because the masked visual features could provide more detailed representation of sound-maker than the aggregated one. To validate the effectiveness of curriculum learning, the sound localization model is further trained with the duet videos. Table 5 shows the ablation results. The performance gain in SDR and SAR indicates that our audiovisual alignment model can utilize complex scenes to improve the ability of cross-modal perception further.
In this paper, we developed an audiovisual learning model that discovers, then aligns the sounds and sound-makers in arbitrary audiovisual scenes. A curriculum learning strategy is proposed to effectively train the model w.r.t. the number of sound-source. Further, we deployed the well-trained audiovisual model into practical perception tasks. We achieved noticeable audiovisual localization performance, and the localized object representation made a considerable boost to sound separation.
Look, listen and learn.
2017 IEEE International Conference on Computer Vision (ICCV), pp. 609–617. Cited by: §1, §2, §4.3.2.
-  (2017) Objects that sound. arXiv preprint arXiv:1712.06651. Cited by: §1, §2.
-  (1992) A review of the cocktail party effect.. Cited by: §2.
-  (2016) Soundnet: learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pp. 892–900. Cited by: §1, §2, §4.4.
Lecture notes on data science: soft k-means clustering. Technical report Technical Report, Univ. Bonn, DOI: 10.13140/RG. 2.1. 3582.6643. Cited by: §3.1.1, §3.1.1.
Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §2.
A curriculum learning method for improved noise robustness in automatic speech recognition. In Signal Processing Conference, Cited by: §2.
-  (2011) Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (1), pp. 2493–2537. Cited by: §2.
Learning and development in neural networks: the importance of starting small. Cognition 48 (1), pp. 71–99. Cited by: §2.
-  (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619. Cited by: §2, §3.3.2.
-  (2018) Learning to separate object sounds by watching unlabeled video. arXiv preprint arXiv:1804.01665. Cited by: §2, §4.5, §6.5.1, §6.5.1, Table 6.
-  (2019) Co-separating sounds of visual objects. arXiv preprint arXiv:1904.07750. Cited by: §1, §2, §3.3.2, §3.3.2, §4.1, §4.5, §4.5, Table 4, §6.5.1, Table 6, footnote 4.
-  (2017) Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. Cited by: §4.1, §6.5.1.
-  (2016) Multi-modal curriculum learning for semi-supervised image classification. IEEE Transactions on Image Processing 25 (7), pp. 3249–3260. Cited by: §2.
-  (2017) CNN architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 131–135. Cited by: §4.2.
-  (2005) Multisensory integration: space, time and superadditivity. Current Biology 15 (18), pp. R762–R764. Cited by: §1.
Deep multimodal clustering for unsupervised audiovisual learning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9248–9257. Cited by: §1, §1, §2, §2, §3.1.1, §4.3.2, Table 3.
-  (2005) Pixels that sound. In IEEE Computer Society Conference on Computer Vision & Pattern Recognition, Cited by: §2.
Siamese neural networks for one-shot image recognition.
ICML deep learning workshop, Vol. 2. Cited by: §3.1.2.
-  (2018) Co-training of audio and video representations from self-supervised temporal synchronization. arXiv preprint arXiv:1807.00230. Cited by: §1, §2, §2, §3.1.2.
-  (1996) The expectation-maximization algorithm. IEEE Signal processing magazine 13 (6), pp. 47–60. Cited by: §3.1.1.
-  (2018) CASSL: curriculum accelerated self-supervised learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6453–6460. Cited by: §2.
-  (2018) Audio-visual scene analysis with self-supervised multisensory features. arXiv preprint arXiv:1804.03641. Cited by: §1, §2, §2, §2.
-  (2016) Ambient sound provides supervision for visual learning. In European Conference on Computer Vision, pp. 801–816. Cited by: §1, §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.3.2, §6.4.
-  (2018) Learning to localize sound source in visual scenes. arXiv preprint arXiv:1803.03849. Cited by: §1, §2, Figure 5, §4.4, §4.4, Table 3.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.2.
-  (2009) Source-filter based clustering for monaural blind source separation. In Proceedings of the 12th International Conference on Digital Audio Effects, Cited by: §4.5, Table 4, Table 6.
-  (2019) Recursive visual sound separation using minus-plus net. arXiv preprint arXiv:1908.11602. Cited by: §2.
-  (2019) The sound of motions. arXiv preprint arXiv:1904.05979. Cited by: §1, §2, §4.5, §4.5.
-  (2018) The sound of pixels. arXiv preprint arXiv:1804.03160. Cited by: §1, §2, §3.3.2, Figure 6, §4.5, §4.5, Table 4, §6.5.1, Table 6.
Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §3.1.1.
6 Supplemental Materials
6.1 Curriculum Settings
The audio event dataset of AudioSet is annotated via a hierarchical ontology. For example, the sound of barking (the fourth-level label) is simultaneously annotated as Animal (the first-level label), Pets(the second-level label), and Dog(the third-level label), i.e., Animal Pets Dog Bark. Such hierarchical annotation makes it extremely difficult to precisely estimate the number of sound-sources in the clip. Considering that the third-level labels generally relate to the common objects in our surroundings, we propose to filter the original annotation by just keeping the third-level ones. This filtering process consists of two steps, i.e., removing the father annotations and reducing the children annotations. Concretely, for each third-level label of the video clip, we remove all its father annotations (the first and second label) if they appear in the label list of the same video clip. Meanwhile, for each fourth-level label, we reduce them into their father annotation of the corresponding third-level label then remove them. For example, we can find that the third-level label for Bark is Dog via the hierarchical ontology, then we use Dog to replace Bark. After finishing these two steps, the annotation list for each video clip should only contain the labels in the third-level. Finally, we can use the number of labels as the indicator of the number of sound-sources.
6.2 Audiovisual Learning Network
The audiovisual learning network is a two-stream network, which consists of a audio pathway and a visual pathway. We use the off-the-shelf VGGish network for the audio pathway, where we discard its last three fully-connected layers and last max-pooling layer. The output of the audio pathway is the feature maps of the final convolution layer of VGGish, with the size of . The similar procedure is performed with the visual pathway, but the model is replaced with VGG16, and the size of output visual feature maps is .
For the output features of both modalities, we use the Reshape operator to transform them into a set of vectors, i.e., reshaping from to for audio and from to for visual modality. We use a fully-connected layer of to encode these reshaped features in the channel space. Then, the modality-specific clustering module is performed, with which we expect to discover concrete audio and visual contents. And the contrastive loss is used to train the whole network.
6.3 Poisson Regression Network
The Poisson regression network is developed based on the audio network of VGGish. We remove the last three fully-connected layers of VGGish but use GlobalMaxPooling over the output feature maps to obtain a feature vector. Then, two fully-connected layers of are employed to project the feature vector into a predicted value for the Poisson average value of . Finally, we can train the regression network w.r.t. the Poisson regression loss. The SGD optimizer is used, whose momentum is set to and the learning rate is initialized at and decays by .
6.4 Sound Separation Network
Our audiovisual sound separation network consists of two parts, one is for visual representation extraction of sound-maker and the other is for sound separation. Concretely, the visual branch is basically the audiovisual learning network. It takes image and corresponding sound as inputs, and extracts the visual representation of specific sound-maker in the scene. To automate this process, we use the audiovisual scenes with single sound-source. This is because we can directly localize the sound-maker by comparing different visual representations with the unique sound and without manual distinguishment. Then, we can use the corresponding clustering center as the representation of the localized sound-maker, which is a vector.
The sound separation branch is a variant of U-Net, which consists of an encoder and a decoder. The encoder contains 6 convolution layers with channels, while the decoder contains 6 up-convolution layers with channels. The convolution layers in the encoder use filters, and followed by a BatchNormalization and a LeakyReLu (with a slope of ) layer. The up-convolution layers in the decoder also use
filters, but followed by a BatchNormalization and a ReLu layer. The last up-convolution layer is followed by a sigmoid function to match value of the spectrogram mask. Similar to the original U-net, we apply skip-connection between symmetric encoder and decoder layers.
To integrate the above two branches, the visual representation is first passed to a fully-connected layer of , then a BatchNormalization and a LeakyReLU (with a slope of ) layer. The resulted visual vector is replicated to match the size of encoded audio feature maps. Then, the audio and visual feature maps are concatenated together, and fed to the decoder of the sound separation branch.
6.5 More results
6.5.1 Sound Separation
AudioSet-Instrument Following , all the clips in AudioSet are filtered to construct a subset of 15 musical instruments. The filtered clips from the Unbalanced-Train set constitute the training dataset, the ones from Balanced-Train set are splitted for validation and testing. As some video clips in AudioSet have been removed by the uploader, the whole instrument dataset is smaller than the ones in , which is about 99,882/456/456 train/val/test clips. As 93,679 video clips in the training dataset have single sound-source, we directly use them for training the audiovisual learning model. Then, we use the well-trained model for extracting the visual representation of localized sound-source, with which we train the separation model, as shown in Fig. 7.
Table. 6 shows the sound separation results. We can find that our model shows superior performance over most compared methods, even some of them adopt additional visual knowledge. For example, AV-MIML  and Sound-of-Pixels  use ImageNet pre-trained model as the visual extractor. Besides, we also show the results of CAVL-. Such results are obtained by directly viewing the audio assignment in the audiovisual learning model as the spectrogram mask. However, as the audio assignment is achieved over the embedded feature maps, it is too coarse to perform fine-grained spectrogram prediction. Hence, such method does not work for sound separation.