Log In Sign Up

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

by   Haytham M. Fayek, et al.

Recognizing sounds is a key aspect of computational audio scene analysis and machine perception. In this paper, we advocate that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. The proposed fusion model utilizes an attention mechanism to dynamically combine the outputs of the individual audio and visual models. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 mAP (relative: 10.4


page 4

page 6


Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Leveraging both visual frames and audio has been experimentally proven e...

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

Multi-modal learning, particularly among imaging and linguistic modaliti...

Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking

Multi-modal fusion is proven to be an effective method to improve the ac...

Predicting Depression Severity by Multi-Modal Feature Engineering and Fusion

We present our preliminary work to determine if patient's vocal acoustic...

Is this Harmful? Learning to Predict Harmfulness Ratings from Video

Automatically identifying harmful content in video is an important task ...

After All, Only The Last Neuron Matters: Comparing Multi-modal Fusion Functions for Scene Graph Generation

From object segmentation to word vector representations, Scene Graph Gen...

Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification

The use of multiple and semantically correlated sources can provide comp...

1 Introduction

Sound recognition is key for intelligent agents that perceive and interact with the world, similar to how the perception of different sounds is critical for humans to understand and interact with the environment. Sounds are primarily produced by actions on or interactions between objects, and hence we often learn to immediately associate sounds with visual objects and entities. This visual association is often necessary for recognizing and understanding the acoustic activities occurring around us. For example, it is easier to distinguish between sounds produced by a Vacuum Cleaner and sounds produced by a Hair Dryer by listening to and seeing the appliances, as opposed to one or the other.

Sound recognition is inherently a multi-modal audiovisual task. Thus, we ought to build machine learning models for sound recognition that are multi-modal, inspired by how humans perceive sound. In addition to reducing uncertainties in certain situations as instantiated above, audiovisual learning of sounds can lead to a more holistic understanding of the environment. For example, an alarm clock can be designed to produce

Church Bell sounds; while an audio only sound recognition model can perhaps label the sound as Church Bells, which would not be incorrect, it nevertheless does not represent the actual event. Hence, systems designed to do audiovisual perception of sounds would lead to better and more complete understanding of these phenomena. However, most prior work on sound recognition focuses on learning to recognize sounds only from audio [32, 24].

Multi-modal learning approaches, which aim to train a single model by using the audio and visual modalities simultaneously are one class of well established audiovisual learning approaches. However, these methods present their own set of challenges. In these methods, it is often necessary to appropriately synchronize the audio and visual components and ensure temporal consistencies for a single model to consume both modalities. Moreover, training a single multi-modal model which uses both modalities is a difficult feat due to different learning dynamics for the audio and visual modalities [33]. Further, training and inference in multi-modal models is computationally cumbersome, particularly for large datasets [10].

Another class of audiovisual learning methods can be described as Fusion approaches. Fusion approaches aim to combine two independent single-modal models. This offers several advantages. We can build strong individual models for each modality separately, with the freedom to design modal-specific models without requiring to factor in the unnecessary complexities arising out of multi-modal models and training. Furthermore, fusion approaches are often more interpretable than multi-modal models as it is easier to qualitatively analyze and understand the contributions of each modality through the individual models. Finally, fusion approaches allow combining multiple models per modality, which can be advantageous in improving performance.

In this paper, we propose fusion based approaches for audiovisual learning of sounds. We propose methods that learn to combine individual audio and visual models that were respectively trained on each modality separately. We first build state-of-the-art audio and visual sound recognition systems. We then propose attention fusion models to dynamically combine these audio and visual models. Specifically, our fusion models learn to pay attention to the appropriate modality in a sample-specific class-specific

manner. Our models are designed for weakly supervised learning, where we train models using weakly labeled data 

[22]. We analyze our proposed models on the largest dataset for sound events, Audioset [11], and show that the proposed models outperform state-of-the-art single-modal models, baseline fusion models, and multi-modal models. The results and ensuing analysis attest to the importance of using both the audio and visual modalities for sound recognition and efficacy of our attention fusion models.

2 Related Work

Learning Sound Events.

Numerous prior works have proposed a variety of supervised methods for detecting and classifying sound events 

[32, 24]. Large-scale sound event detection has been possible primarily through weakly supervised learning [22] and the release of large-scale weakly labeled sound events datasets, such as Audioset [11]

. Most of the recent methods rely on deep neural networks, in particular, deep Convolutional Neural Networks (ConvNets) 

[21, 1, 19, 9]. While and pooling have been shown to be effective in handling weakly labeled data [24], more involved methods such as adaptive pooling [28] or attention based pooling [34, 19] have also been proposed.

Audiovisual Learning.

There is a profusion of work on multi-modal audiovisual learning for video recognition and understanding in recent years [7]. Progress in multi-modal learning was facilitated by the availability of large-scale video datasets [16, 11, 13], advances in self-supervised learning of multi-modal representations that exploit cross-modal information [5, 4, 29, 20]

, custom loss functions that synchronize learning multiple modalities 

[33], and others. Despite the recent progress, multi-modal audiovisual learning remains a challenging problem. Primary reasons include difficulties in designing multi-modal neural network architectures that work well across various datasets and tasks, jointly learning multiple modalities, training and inference efficiency, etc.

Audiovisual Fusion.

Late fusion approaches are a popular to combine audio and visual models [12]. Unweighted late fusion such as averaging the outputs of the audio and visual models can demonstrate competitive performance in addition to not requiring any additional parameters and training beyond the audio and visual single-modal models [25]. Nevertheless, weighted late or mid-level fusion has been extensively studied as it provides avenues to adaptively combine information between the different modalities [25, 36, 17]. Further, attention mechanisms [6] were used to dynamically combine the outputs of the single-modal models [27, 31, 15, 37, 26].

Audiovisual Learning of Sounds

Prior work on audiovisual learning of sounds, whether through single-model multi-modal learning or through fusion of individual models independently trained on the respective modalities, is scarce. [33] describes a multi-modal learning approach in which a single model is trained jointly for both the audio and visual modalities and applies this to sound recognition. Such methods are susceptible to several difficulties as outlined in Section 1. We, on the other hand, propose fusion models for audiovisual learning of sounds and show in Section 2 (Table 3) the effectiveness of our fusion models compared with the multi-modal models in [33]. Another work related to ours is [30], which attempts to learn representations from weakly labeled audiovisual data and use these representations in different tasks including sound event recognition; therein, the audio and visual models are only combined through average fusion. Perhaps worth mentioning are other works such as as [5, 4, 2]; they attempted to use the visual modality to learn representations for the audio modality that can be used for downstream sound event classification tasks. However, in all prior work, the downstream sound classification tasks are done on small-scale datasets, which does not provide in-depth understanding of audiovisual learning for sounds at scale.

3 Audiovisual Models for Sounds

We learn sound events in the weakly supervised setting, where only recording-level labels are provided for each training recording, with no temporal information. The fundamentals of weakly supervised learning of sound events are based on Multiple Instance Learning (MIL) [22]. In MIL, the learning problem is formulated in terms of bags and respective labels, ; each bag is a collection of instances [3]. A bag is labeled as positive for a class, , if at least one instance within the bag is positive. On the other hand, if all instances within the bag are negative for a given class, then the label is negative, . For weakly labeled sound recognition, recordings which are tagged with the presence of a sound class become a positive bag for that class and negative otherwise. The audio and visual models in Sections 3.1 and 3.2 follow this formulation.

3.1 Audio Model

In the MIL setting for weakly labeled sound event recognition, each recording becomes a bag as required in the MIL framework. The instances in the bag

are segments of the recording. Assuming that each instance is represented by a feature vector

; the training data takes the form of to , where the bag consists of instances .

In the audio model, the key idea is that the learner can be trained by first mapping the instance-level predictions to the corresponding bag-level predictions, and then these bag-level predictions can be used to compute a loss function to be minimized, akin to the classic MI-SVM approach for MIL [3]. Herein, the function was used to aggregate the instance-level predictions.

Let be the function we wish to learn; then the training involves minimizing the following loss function.


where, is the loss function measuring divergence between predictions and targets, and represents learnable parameters of . Note that the function , used to aggregate the instance-level predictions from to the bag-level predictions, can itself contain learnable parameters .

The audio model is a ConvNet that maps Log-scaled Mel-filter-bank feature representations of the entire audio recording to (multiple) labels, trained by minimizing the loss function defined in Equation 1. The sampling rate of all audio recordings is ; Mel-filter-bank representations are obtained for windows in the audio recording, shifted by , which amounts to 100 frames per second of audio.

The overall architecture of the network is detailed in Table 1. The network produces instance and bag representations for any given input at Block B5. The bag consists of -dimensional instance representations. The network is designed for a receptive field of ~ second ( frames). Hence, it outputs -dimensional features for ~ second of audio, every ~ seconds ( frames). Three convolutional layers (B6 to B8) are then used to predict outputs for each segment, which are then aggregated via the layer that uses global average pooling. The number of segments depends on the size of the input. As shown in Table 1, for an input with frames, segments are produced.

The loss function for the training audio recording is defined as:


where, is the output for the class and is the binary cross entropy loss function. Furthermore, our audio model training also incorporates the sequential co-supervision method described in [21].

3.2 Visual Model

Similar to the audio model, the visual model for sounds is based on the MIL framework. Herein, the entire video is the bag and instances of the bag are the frames of the video. Specifically, we sample frames from the videos to form the bags. Each frame (instance) is then represented by -feature representations obtained from a ResNet-152 [14]

, pre-trained on Imagenet 

[8]. This yields the video bag representations for each recording. We obtain a single -dimensional vector representation for each bag by averaging the feature representations for each instance. This idea of mapping bags to a single vector representations was previously employed in [23, 35] to make MIL algorithms scalable.

We then train a hidden layer Deep Fully Connected Neural Network (DNN) to recognize sound events. The number of units in each layer are respectively, where is the number of classes. The first and second hidden layers are followed by a dropout layer with a dropout rate of

. Rectified Linear Units (ReLU) activation is used in all layers except for the last layer, where sigmoid activation is used. Similar to the audio model, the network is trained with the binary cross entropy loss function.

Stage Layers Output Size
Input Unless specified – (S)tride = 1, (P)adding = 1
Block B1 Conv: 64,
Conv: 64,
Pool: (S:4)
Block B2 Conv: 128,
Conv: 128,
Pool: (S:2)
Block B3 Conv: 256,
Conv: 256,
Pool: (S:2)
Block B4 Conv: 512,
Conv: 512,
Pool: (S:2)
Block B5 Conv: 2048, (P:0)
Block B6 Conv: 1024,
Block B7 Conv: 1024,
Block B8 Conv: C,
G () Global average pooling
Table 1: Audio Convolutional Neural Network (ConvNet) architecture. The number of filters and size of each filter varies between convolutional layers; e.g., Conv: 64, denotes a convolutional layer of 64 filters, each of size .

denotes the number of classes. All convolutional layers except in B8 are followed by batch normalization and Rectified Linear Units (ReLU); the convolutional layer in B8 is followed by a sigmoid function.

Figure 1: Audiovisual fusion for sound recognition. The audio and visual models map the respective input to segment-level representations which are then used to obtain single-modal predictions, and , respectively. The attention fusion function, , ingests the single-modal predictions, and , to produce weights for each modality, and . The single-modal audio and visual predictions, and , are mapped to and via functions, and , respectively, and fused using the attention weights, and . The fused output, , is then mapped to multi-modal multi-label predictions via function .

3.3 Baseline Audiovisual Fusion

Our audiovisual learning of sounds focuses on developing fusion models that combine outputs from models trained individually on the audio and visual modalities. The following is a description of a number of baseline fusion methods.

Average Fusion.

We simply compute the arithmetic mean of the outputs of the audio and visual models. Despite its simplicity, average fusion serves as a strong baseline. [30] used average fusion to combine audio and visual predictions.


We train a linear layer with a sigmoid activation function that ingests the (mean-standard-deviation) normalized outputs of the audio and visual models, and predicts the probability of the sound classes. The model is

regularized with the regularization parameter set to .

Multi-Layer Perceptron (MLP).

We train a fully connected neural network with a single hidden layer that ingests concatenated outputs of the audio and visual models, and predicts the probability of the sound classes. The hidden layer comprises units with batch normalization and ReLU activation. Furthermore, a dropout layer with a dropout rate of is used. The output layer is a fully connected layer with units and a sigmoid activation function.

3.4 Attention Audiovisual Fusion

The audio and visual models have been separately trained to map the audio and visual modalities, respectively, to the various sound classes. The performance of each class varies between modalities. This variation can be sample-specific as well. For a given sound class, in some videos, the audio modality might be sufficient to correctly recognize the sound, whereas in other videos, the visual modality might be more suited. Thus, we ought to fuse the outputs from the audio and visual models in a sample-specific and class-specific manner. To this end, we design an attention mechanism to combine the audio and visual models.

Figure 1 shows the general schema of the attention fusion mechanism. Let and represent the outputs of the audio and visual models for a given video recording respectively. We learn an attention function, , that generates weights and for the audio and visual modalities respectively. We constrain the attention weights to sum to for each class, i.e., , where is a vector of ones.

The attention function is implemented as a neural network, :


where, denotes the learnable parameters of network , and denotes the concatenation of and . The output layer in has a sigmoid activation function.

The final output that combines the audio and visual models is obtained as follows:


where, and are neural networks that process inputs and into and , respectively, which are then combined through attention weights ( and ), as in Equation 5. and are the learnable parameters of and respectively. The combined outputs, , are passed through another neural network with a sigmoid activation function to produce the final outputs .

Note that and are generic functions representing all possible transforms: linear, non-linear, or even the identity of and respectively. For the identity case, , and

does not contain any learnable parameters. For the linear transform case,

does not contain any non-linear activation functions. Similarly, might just be a sigmoid function, i.e., , and hence, containing no learnable parameters. Otherwise, it can be a neural network with one or more layers with a sigmoid activation at the output layer. We also attempted to learn the attention weights with a single modality, either or instead of the combined . In this case, is either obtained through self-attention or through cross-modal attention . Similarly, is obtained, and both and are normalized to sum to for each class. However, attention weights learned from single modality was found to be inferior to learning them through both modalities as in Equation 3.

4 Experiments

Experiments are conducted on AudioSet which is described in Section 4.1. The experiments and results are presented in Section 4.2.

4.1 Audioset Dataset

Audioset [11] is the largest dataset for sound events. The dataset provides YouTube videos for sound events. Each video clip is approximately in length annotated by a human with multiple labels denoting the sound events present in the video clip. The average number of labels per video clip is 2.7. The dataset is weakly labeled as the labels of each video clip denote the absence or presence of sound events but do not contain any temporal information. The distribution of sound event classes in the training set is severely unbalanced, ranging from around million videos for the most represented class, Music, to approximately videos for the least represented class, Screech.

We use the predefined Unbalanced set for training and Eval set for performance evaluation. The training set comprises approximately million videos, whereas the evaluation set comprises approximately videos. The evaluation set contains at least videos for each class. In practice, the total video count is slightly less than the numbers listed above due to the unavailability of links when this work was carried out. We sample approximately videos from the training set to use as the validation set.

4.2 Performance Evaluation of Fusion Methods

Experimental Setup.

The outputs (predictions) of the audio and visual models are used as input to all fusion models. These outputs are normalized to zero mean and unit standard deviation prior to feeding them into the fusion models (except for average fusion). The mean and standard deviation are computed using the training set only.

The details of the audio model training can be found in [21]

. The network in the video model is trained for 20 epochs using the Adam optimizer 


with a mini-batch size of 144. Hyperparameters such as the learning rate are selected using the validation set, which is also used to select the best model during training.

In the fusion experiments, all neural networks are trained using Adam for 100 epochs. The mini-batch size is set to . The validation set is used to select the best hyperparameters, such as the learning rate for each model, and the best model during training. Similar to prior work on Audioset, we use Average Precision (AP) and Area Under Receiver Operating Characteristic (ROC) Curve (AUC) to measure the performance per class. Mean Average Precision (mAP) and mean Area Under ROC Curve (mAUC) computed over all classes are used for comparison.

Model mAP mAUC
Audio 38.35 97.12
Visual 25.73 91.30
Average Fusion 42.84 97.63
Regression Fusion 43.10 95.35
MLP Fusion 45.60 96.83
Attention Fusion 46.16 97.51
Table 2: Comparison of mAP and mAUC for different fusion methods for combining audio and visual models.
Model mAP mAUC
Audio (ResNet-50)  [9] 38.0 97.10
Audio (Ours) 38.35 97.12
Visual (R(2+1)D-101)  [33] 18.80 91.80
Visual (Ours) 25.73 91.30
AudioVisual (G-Blend)  [33] 41.80 97.50
AudioVisual (Ours) 46.16 97.51
Table 3: mAP and mAUC for state-of-the-art audio, visual, and audiovisual sound recognition models on AudioSet.

Comparison of Fusion Methods.

Table 2 summarizes the results. The audio model achieves 38.35 mAP and 97.12 mAUC, whereas the visual model achieves 25.73 mAP and 91.30 mAUC. The superiority of the audio model over the visual model is expected due to the nature of the task.

Average fusion of the audio and visual outputs achieves 42.84 mAP, an absolute improvement of 4.49 mAP (relative: 11.7%) over the audio model, and an absolute improvement of 17.11 mAP (relative: 66.5%) over the visual model. The regression fusion model leads to a minor improvement over average fusion: +0.26 mAP. The MLP fusion model leads to considerable improvement over average fusion: +2.76 mAP.

Our attention fusion model achieves 46.16 mAP, which is an absolute improvement of 7.81 mAP (relative: 20.4%) over the audio model. It also outperforms all baseline fusion methods: an absolute improvement of 3.32 mAP (relative: 7.7%) over average fusion. In this case, takes in the concatenated normalized outputs from the audio () and visual () models. is a neural network with a single hidden layer of 512 units; batch normalization and ReLU activation is applied on this layer which is followed by a dropout layer with a dropout rate of 0.5. The output layer has 527 units and a sigmoid activation function. , , and are all single layer networks with 512 units and sigmoid activations.

We also study different variations of the proposed fusion model. The unimodal approaches uses only a single modality in the attention network, either or as opposed to using and combined through concatenation. When is obtained by using and using , we denote it as self unimodal attention. However, if uses and vice versa, we call it cross unimodal attention. In these cases, there are two attention networks, and but other aspects of the architecture remains identical to the one described above.

We also consider another variation where and are simply identity functions that do not contain learnable parameters. Similarly, in another variation, we make an identity mapping. The performance of these fusion models are listed in Table 4

. The multi-modal attention model is slightly better than the unimodal attention models. However, removing neural networks

, , and leads to significant deterioration in performance.

Model mAP mAUC
Attention (self unimodal) 46.04 97.48
Attention (cross unimodal) 46.06 97.47
Attention (multi-modal w/o and ) 38.84 95.34
Attention (multi-modal w/o ) 40.84 95.15
Attention (multi-modal) 46.16 97.51
Table 4: Ablation of attention fusion models.

Comparison with State-of-the-Art.

Table 3 shows comparison with state-of-the-art models on Audioset. Our audio model is slightly better than the state-of-the-art performance on Audioset. Note that [9] also shows a performance of 39.2 mAP. However, this is obtained by a ensembeling outputs from multiple models and the best single model performance (which is fairer to compare with) is 38.0. To the best of our knowledge, [33] is the only prior work that reported visual and audiovisual models for sound events on Audioset. Our visual model is +6.93 mAP (relative: 36.8%) better than [33]. More importantly, our audiovisual model is +4.35 mAP (relative: 10.4%) compared with [33] and sets a new state-of-the-art on Audioset.

5 Discussions

Figure 2: Audio, visual, and audiovisual models Average Precision (AP). The horizontal and vertical axes denote the audio and visual models APs respectively for all classes. The diameter of each bubble denotes the absolute change in AP over the single-modal models (audio or visual, whichever is better). Blue denotes improvement over the single-modal model, whereas red indicates deterioration over the single-modal model, as depicted in the color bar. Several sound classes of interest are annotated accordingly.

5.1 Class Specific Analysis

Figure 2 provides a per class depiction of the audio, visual, and audiovisual models. As expected, the audio model outperforms the video model for the majority of sound events. For sounds such as Heartbeat, Jingle Bell, and Whispering, audio recognition of sounds is considerably better than visual recognition of sounds. However, for classes such as Patter, Shuffling Cards, and Bus, the visual model outperforms the audio model by a large margin. Overall, the visual model outperforms the audio model for 117 sound classes by a small or large margin. For some classes, both audio and visual models perform well (e.g., Turkey, Train, and Music Accordion), or both fail to achieve good performance (e.g., Squish and Hum).

For the audiovisual fusion model, we see that for most classes, audiovisual learning leads to an improvement over the better of the audio or visual models, as observable by mostly blue bubbles in Figure 2. Figure 3 shows a histogram of the absolute deterioration and improvement in performance through audiovisual learning compared to the better of the audio or visual models. For over 83% of classes (439 of 527), we see an improvement, whereas for the remaining classes, audiovisual learning leads to deterioration in performance, compared with the better of the audio or visual models. Furthermore, for over 17% of classes (91 out of 527), audiovisual learning achieves an absolute improvement of more than +10 mAP. On the other hand, for only 15 classes audiovisual learning leads to an absolute deterioration of more than 5 mAP, with only 3 classes with more than 10 deterioration in mAP. For most cases, the deterioration is minor.

Figure 3: Histogram of change in AP for our audiovisual attention fusion model with respect to single modality models (audio or visual whichever is better).
Figure 4: Mean of attention weights for all classes in AudioSet.

From Figure 2, we can see some examples of classes with large improvements in the audiovisual model, e.g., Explosion, Ringtone Clicking, Vacuum Cleaner, and Bathtub. Several of these classes with large improvement have similar AP using audio and visual models.

For several sound classes, one of the modalities is much more superior. One may argue for using only the appropriate single-modal models. However, we see that in several of these cases the other modality, despite being inferior on its own, can provide complimentary information, which can lead to improved performance through audiovisual learning. Consider for example, classes such as Angry Music and Plop, as shown in Figure 2. While the audio modality is much more useful for these classes compared to the visual modality, audiovisual learning still leads to considerable improvement as the information from the visual modality compliments the audio modality. The same inference applies for classes such as Bus and Motorboat, where the visual modality is much more superior compared to the audio modality. However, there are exceptions to the above observations. On the audio side, Whispering and Police Siren, and on the video side, Shuffling Cards and Writing, are examples of such exceptions.

One final observation we would like to point out from Figure 2 is that, in general for classes with low performance for the audio and visual models (less than 10 AP), audiovisual fusion also does not lead to any improvement. We believe that a multi-modal framework in which the audio and visual modalities are jointly used to train a single model might be more suitable in these situations.

5.2 Analysis of Attention

We analyze the attention weights obtained for each sample for each class. For each class, we compute the mean attention weight () by averaging over all evaluation examples which contains the

class. This gives us an estimate of the importance of the audio modality for the

class (and consequently for the visual modality as well). Figure 4 is a histogram plot of the mean attention weights. The right side of the plot with higher

represents dominance of the audio modality, whereas on the left side the visual modality is the dominant modality. The distribution is clearly skewed towards the audio modality, which is expected since the audio model is better than the visual model for more than 75% of the sound classes. There are more than 110 sounds for which on average, the audio modality gets all the weight (

). Interestingly, there are around 38 sound classes where the visual modality has complete dominance (). Furthermore, for more than 70 sound classes the visual modality on average gets more than 90% weight (). This signifies that for several sounds the visual modality can play a vital role in their recognition.

6 Conclusions

We propose audiovisual models for learning sounds. Instead of jointly using audio and visual modalities in a single model, we propose to combine the individual models that were trained separately on each modality. To this end, we use a state-of-the-art audio based sound event recognition model and propose a novel vision based model for recognizing sound events. On the Audioset dataset of sound events, it outperforms a prior vision model for sounds by +6.93 mAP (relative: 36.9%). We then propose an attention mechanism for audiovisual fusion that dynamically fuses the outputs from the audio and visual models in a sample-specific and class-specific manner. Our proposed audiovisual learning model improves state-of-the-art performance on Audioset by approximately +4.35 mAP (relative: 10.4%). We also provide thorough analysis of the role of the audio and visual modalities for various sound classes. The outcomes of this analysis can play a crucial role in the development of future models for audiovisual learning of sounds.


  • [1] S. Adavanne, H. M. Fayek, and V. Tourbabin (2019-10) Sound event classification and detection with weakly labeled data. In Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), New York, USA, pp. 15–19. Cited by: §2.
  • [2] H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran (2019) Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667. Cited by: §2.
  • [3] S. Andrews, I. Tsochantaridis, and T. Hofmann (2003) Support vector machines for multiple-instance learning. In NIPS, pp. 577–584. Cited by: §3.1, §3.
  • [4] R. Arandjelovic and A. Zisserman (2017) Look, listen and learn. In CVPR, pp. 609–617. Cited by: §2, §2.
  • [5] Y. Aytar, C. Vondrick, and A. Torralba (2016) Soundnet: learning sound representations from unlabeled video. In NIPS, pp. 892–900. Cited by: §2, §2.
  • [6] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.
  • [7] T. Baltrušaitis, C. Ahuja, and L. Morency (2018) Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41 (2), pp. 423–443. Cited by: §2.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §3.2.
  • [9] L. Ford, H. Tang, F. Grondin, and J. Glass (2019) A deep residual network for large-scale acoustic scene analysis. In Interspeech, Cited by: §2, §4.2, Table 3.
  • [10] R. Gao, T. Oh, K. Grauman, and L. Torresani (2019) Listen to look: action recognition by previewing audio. arXiv preprint arXiv:1912.04487. Cited by: §1.
  • [11] J. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawrence, C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In ICASSP, pp. 776–780. Cited by: §1, §2, §2, §4.1.
  • [12] B. Ghanem, J. Niebles, C. Snoek, F. Heilbron, H. Alwassel, V. Escorcia, R. Khrisna, S. Buch, and C. Dao (2018) The activitynet large-scale activity recognition challenge 2018 summary. arXiv preprint arXiv:1808.03766. Cited by: §2.
  • [13] C. Gu, C. Sun, D. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In CVPR, pp. 6047–6056. Cited by: §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.2.
  • [15] C. Hori, T. Hori, G. Wichern, J. Wang, T. Lee, A. Cherian, and T. K. Marks (2018) Multimodal attention for fusion of audio and spatiotemporal features for video description. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 2528–2531. Cited by: §2.
  • [16] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §2.
  • [17] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen (2019-10) EPIC-fusion: audio-visual temporal binding for egocentric action recognition. In ICCV, Cited by: §2.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [19] Q. Kong, C. Yu, Y. Xu, T. Iqbal, W. Wang, and M. D. Plumbley (2019) Weakly labelled audioset tagging with attention neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 27 (11), pp. 1791–1802. Cited by: §2.
  • [20] B. Korbar, D. Tran, and L. Torresani (2018) Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, pp. 7763–7774. Cited by: §2.
  • [21] A. Kumar and V. K. Ithapu (2020) SeCoST: sequential co-supervision for weakly labeled audio event detection. In ICASSP, Cited by: §2, §3.1, §4.2.
  • [22] A. Kumar and B. Raj (2016) Audio event detection using weakly labeled data. In ACM Multimedia, pp. 1038–1047. Cited by: §1, §2, §3.
  • [23] A. Kumar and B. Raj (2016) Weakly supervised scalable audio content analysis. In 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §3.2.
  • [24] A. Kumar (2018) Acoustic intelligence in machines. Ph.D. Thesis, Carnegie Mellon University. Cited by: §1, §2.
  • [25] Z. Lan, L. Bao, S. Yu, W. Liu, and A. G. Hauptmann (2012) Double fusion for multimedia event detection. In International Conference on Multimedia Modeling, pp. 173–185. Cited by: §2.
  • [26] Y. Lin, Y. Li, and Y. F. Wang (2019) Dual-modality seq2seq network for audio-visual event localization. In ICASSP, pp. 2002–2006. Cited by: §2.
  • [27] X. Long, C. Gan, G. De Melo, J. Wu, X. Liu, and S. Wen (2018) Attention clusters: purely attention based local feature integration for video classification. In CVPR, pp. 7834–7843. Cited by: §2.
  • [28] B. McFee, J. Salamon, and J. P. Bello (2018) Adaptive pooling operators for weakly labeled sound event detection. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 26 (11), pp. 2180–2193. Cited by: §2.
  • [29] A. Owens and A. A. Efros (2018) Audio-visual scene analysis with self-supervised multisensory features. In ECCV, pp. 631–648. Cited by: §2.
  • [30] S. Parekh, S. Essid, A. Ozerov, N. Duong, P. Pérez, and G. Richard (2019) Weakly supervised representation learning for audio-visual scene analysis. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 28, pp. 416–428. Cited by: §2, §3.3.
  • [31] G. Sterpu, C. Saam, and N. Harte (2018)

    Attention-based audio-visual fusion for robust automatic speech recognition

    In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 111–115. Cited by: §2.
  • [32] T. Virtanen, M. D. Plumbley, and D. Ellis (2017) Computational analysis of sound scenes and events. Springer. Cited by: §1, §2.
  • [33] W. Wang, D. Tran, and M. Feiszli (2019) What makes training multi-modal networks hard?. arXiv preprint arXiv:1905.12681. Cited by: §1, §2, §2, §4.2, Table 3.
  • [34] Y. Wang, J. Li, and F. Metze (2019) A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. In ICASSP, pp. 31–35. Cited by: §2.
  • [35] X. Wei, J. Wu, and Z. Zhou (2014) Scalable multi-instance learning. In IEEE International Conference on Data Mining, pp. 1037–1042. Cited by: §3.2.
  • [36] Z. Wu, Y. Jiang, X. Wang, H. Ye, and X. Xue (2016) Multi-stream multi-class fusion of deep networks for video classification. In ACM Multimedia, pp. 791–800. Cited by: §2.
  • [37] P. Zhou, W. Yang, W. Chen, Y. Wang, and J. Jia (2019) Modality attention for end-to-end audio-visual speech recognition. In ICASSP, pp. 6565–6569. Cited by: §2.