Log In Sign Up

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

by   Tanvir Mahmud, et al.
The University of Texas at Austin

An audio-visual event (AVE) is denoted by the correspondence of the visual and auditory signals in a video segment. Precise localization of the AVEs is very challenging since it demands effective multi-modal feature correspondence to ground the short and long range temporal interactions. Existing approaches struggle in capturing the different scales of multi-modal interaction due to ineffective multi-modal training strategies. To overcome this limitation, we introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer to effectively operate on different temporal scales of video frames. Our contributions are three-fold: (1) We introduce a multi-stage training framework to incorporate AudioCLIP pre-trained with audio-image pairs into the AVE localization task on video frames through contrastive fine-tuning, effective mean video feature extraction, and multi-scale training phases. (2) We propose a multi-domain attention mechanism that operates on both temporal and feature domains over varying timescales to fuse the local and global feature variations. (3) We introduce a temporal refining scheme with event-guided attention followed by a simple-yet-effective post processing step to handle significant variations of the background over diverse events. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9 approaches.


MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

To build Video Question Answering (VideoQA) systems capable of assisting...

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Leveraging both visual frames and audio has been experimentally proven e...

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

Temporal action localization (TAL) is an important task extensively expl...

Hierarchical Local-Global Transformer for Temporal Sentence Grounding

This paper studies the multimedia problem of temporal sentence grounding...

Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

The objective of this paper is audio-visual synchronisation of general v...

Learning Cross-Scale Visual Representations for Real-Time Image Geo-Localization

Robot localization remains a challenging task in GPS denied environments...

Multi-modal Video Chapter Generation

Chapter generation becomes practical technique for online videos nowaday...

1 Introduction

Temporal reasoning of multi-modal data plays a significant role in human perception in diverse environmental conditions [10, 38]. Grounding the multi-modal context is critical to current and future tasks of interest, especially those that guide current research efforts in this space, , embodied perception of automated agents [29, 4, 8], human-robot interaction with multi-sensor guidance [25, 6, 2], and active sound source localization [35, 22, 18, 27]. Similarly, audio-visual event (AVE) localization demands complex multi-modal correspondence of grounded audio-visual perception [24, 7]. The simultaneous presence of the audio-visual cues over a video frame denotes an audio-visual event. As shown in Fig. 1, the speech of the person is audible in all of the frames. However, the individual speaking is visible in only a few particular frames which represent the AVE. Precise detection of such events greatly depends on the contextual understanding of the multi-modal features over the video frame.

Figure 1: Example of an audio-visual event (AVE) representing the event of an individual speaking. The person’s voice is audible in all of the frames. Only when the person is visible, an AVE is identified.

Learning the inter-modal audio-visual feature correspondence over the video frames is one of the major challenges of AVE localization. Effective multi-modal training strategies can significantly improve performance by enhancing the relevant features. Earlier work integrates audio and image encoders pre-trained on large scale unimodal (image/audio) datasets [5, 9] to improve performance [36, 17, 32, 7, 30]. However, such a uni-modal pre-training scheme struggles to extract relevant inter-modal features that are particularly significant for AVEs. Recently, following the wide-spread success of CLIP [19] pre-trained on large-scale vision-language datasets, AudioCLIP [12]

has integrated an audio encoder into the vision-language models with large-scale pre-training on audio-image pairs. To enhance the audio-visual feature correspondence for AVEs, we integrate the image and audio encoders from AudioCLIP with effective contrastive fine-tuning that exploits the large-scale pre-trained knowledge from multi-modal datasets instead of uni-modal ones.

Effective audio-visual fusion for multi-modal reasoning over entire video frames is another major challenge for proper utilization of the uni-modal features. Recently, several approaches have focused on using the grounded multi-modal features to generate temporal attention for operating on the intra-modal feature space [37, 17, 32]. Other recent work has applied recursive temporal attention on the aggregated multi-modal features [7, 17, 31]. However, these existing approaches attempt to generalize audio-visual context over the whole video frame and hence struggle to extract local variational patterns that are particularly significant at event transitions. Though generalized multi-modal context over long intervals is of great importance for categorizing diverse events, local changes of multi-modal features are critical for precise event detection at transition edges. To solve this dilemma, we introduce a multi-window temporal transformer based fusion scheme that operates on different timescales to guide attention over sharp local changes with short temporal windows, as well as extract the global context across long temporal windows.

The background class representing uncorrelated audio-visual frames varies a lot over different AVEs for diverse surroundings (Figure 1). In many cases, it becomes difficult to distinguish the background from the event regions due to subtle variations [37]. Xu  [30] suggests that joint binary classification of the event regions (event/background) along with the multi-class event prediction improves overall performance for better discrimination of the event oriented features. Inspired by this, we introduce a temporal feature refining scheme for guiding temporal attentions over the event regions to introduce sharp contrast with the background. Moreover, we introduce a simple post-processing algorithm that filters out such incorrect predictions in between event transitions by exploiting the high temporal locality of event/background frames in AVEs (Figure 1). By unifying these strategies in the AVE-CLIP framework, we achieve state-of-the-art performance on the AVE dataset which outperforms existing approaches by a considerable margin.

The major contributions of this work are summarized as follows:

  • We introduce AVE-CLIP to exploit AudioCLIP pre-trained on large-scale audio-image pairs for improving inter-modal feature correspondence on video AVEs.

  • We propose a multi-window temporal transformer based fusion scheme that operates on different timescales of AVE frames to extract local and global variations of multi-modal features.

  • We introduce a temporal feature refinement scheme through event guided temporal attention followed by a simple, yet-effective post-processing method to increase contrast with the background.

Figure 2: Schematic representation of the proposed method. In stage 1, contrastive fine-tuning is carried on the pre-trained AudioCLIP [12] image and audio encoders with audio-image pairs. In stage 2, video and audio features are extracted with the fine-tuned enocoders. In stage 3, multi-scale training is carried at various temporal scales with the proposed multi-window temporal fusion module followed by temporal event refinement and post-processing to enhance event detection.

2 Related Work

Audio Visual Event Localization

AVE localization, introduced by Tian  [24] targets the identification of different types of events (, individual man/woman speaking, crying babies, frying food, musical instruments, etc.) at each temporal instance based on audio-visual correspondence. The authors introduced a residual learning method with LSTM guided audio-visual attention relying on simple concatenation and addition fusion. A dual attention matching (DAM) module is introduced by Wu  [28] for operating on event-relevant features. Zhou  [37] proposed a positive sample propagation scheme by pruning out the weaker multi-modal interactions. Xuan  [32, 33] proposed a discriminative multi-modal attention module for sequential learning with an eigen-value based objective function. Duan  [7] introduced joint co-learning with cyclic attention over the aggregated multi-modal features. Lin and Wang [17] introduced a transformer-based approach that operates on groups of video frames based on audio-visual attention. Xu  [30] introduced multi-modal relation-aware audio-visual representation learning with an interaction module. Different from existing approaches, AVE-CLIP exploits temporal features from various windows by extracting short and long range multi-modal interactions along with temporal refinement of the event frames.

Sound Source Localization

The sound source localization task [35] identifies the sounding object in the corresponding video based on the auditory signal. Arda  [22] introduced an audio-visual classification model that can be adapted for sound source localization without explicit training by utilizing simple multi-modal attention. Wu  [27] proposed an encoder-decoder based framework to operate on the continuous feature space through likelihood measurements of the sounding sources. Qian  [18] attempted multiple source localization by exploiting gradient weighted class activation map (Grad-CAM) correspondence on the audio-visual signal. A self-supervised audio-visual matching scheme is introduced by Hu  [15] with a dictionary learning of the sounding objects. Afouras  [1] utilized optical flow features along with multimodal attention maps targeting both source localization and audio source separation.

Large Scale Contrastive Pre-training

To improve the data-efficiency on diverse target tasks, large-scale pre-training of very deep neural networks has been found to be effective for transfer learning 

[16]. CLIP has introduced vision-language pre-training with self-supervised contrastive learning on large-scale datasets, an approach that received great attention for achieving superior performance on numerous multimodal vision-language tasks [21, 26, 34]. Recently, AudioCLIP [12] has extended the existing CLIP framework by integrating the audio modality with large-scale training utilizing audio-image pairs [9]. Such large-scale pre-training on audio-visual data can be very effective for enhancing multi-modal feature correspondence.

3 Proposed Method

In this paper, we introduce AVE-CLIP, a framework that integrates image and audio encoders from AudioCLIP with a multi-window temporal transformer based fusion scheme for AVE localization. Our method comprises three training stages, as presented in Figure 2. Initially, we start with the pre-trained weights of image and audio encoders from AudioCLIP. In stage 1, we extract image and audio segments of corresponding events to initiate fine-tuning of the pre-trained encoders on target AVE-localization frames (Section 3.2). In stage 2, these fine-tuned encoders are deployed to extract the video and audio features from successive video frames and audio segments, respectively (Section 3.3). Later, in stage 3, we introduce the multi-scale training on the extracted audio and video features with the multi-window temporal fusion (MWTF) module that operates on different temporal windows for generalizing the local and global temporal context (Section 3.4). This is followed by the temporal refinement of the fusion feature through event-guided temporal attention generated with event-label supervision (Section 3.5

) along with a hybrid loss function used in training (Section 

3.6) and a simple post-processing algorithm that primarily enhances prediction performance during inference by exploiting the temporal locality of the AVEs (Section 3.7).

3.1 Preliminary

Given a video sequence of duration , a set of non-overlapping video segments , and synchronized audio segments of duration are extracted. Each video segment consists of image frames and the corresponding audio segment consists of samples, respectively. If the audio-video segment pair represents an event, it is labeled either as event or background

. Along with the generic event/background label, each segment of the entire video is labeled with a particular event category. Hence, the set of one-hot encoded labels


One-hot encoding changes the dimension over vector space over

for the video sequence across categories is denoted by, . For example, let consider where represents background and denote an event of class and , respectively. Here, we utilize the class labels () to generate event label () as to distinguish between event and background .

Figure 3: The three phases of the Multi-window Temporal Fusion (MWTF) module. In the split phase, the aggregated features are divided into separate temporal blocks based on the window length. In the fusion phase, multi-domain fusion is carried out on particular window. In the aggregation phase, temporal merging (‘M’) followed by feature concatenation is carried out. The window length can be varied and shared weights can be used in all fusion modules irrespective of window lengths.
Figure 4: Representations of the Multi-domain Fusion process. The attention maps are generated for both of the temporal and feature axis that are applied on the input features through non-linear projection.

3.2 Contrastive Fine-Tuning on Audio-Image Pairs

We extract positive and negative audio-image pairs from the target dataset where a positive pair corresponds to the same AVE and a negative one represents a mismatch. Initially, we start with the pre-trained Audio and Image encoders from AudioCLIP [12]. Afterwards, we intiate fine-tuning on the extracted audio-image pairs utilizing the InfoNCE loss , where represents the image-to-audio matching loss and represents the audio-to-image matching loss. is given by


where denotes total number of audio-image pairs in a batch, , represent normalized audio and image features of pair, respectively,

represents the identity matrix element with

and , and is a trainable temperature. Similarly, we construct the audio-to-image matching loss .

3.3 Video Feature Extraction

The fine-tuned audio and image encoders are deployed to extract the features from the whole video sequence . To generate feature map from each video segment containing image frames, we take the mean of feature maps . Afterwards, all feature maps from video segments are concatenated to generate the video feature of a particular sequence . Similarly, audio feature of each segment are concatenated to generate audio feature of a video sequence, and thus:


where denotes feature concatenation and denotes the number of segments in a sequence.

3.4 Multi-scale Training with Multi-window Temporal Fusion (MWTF) Transformer

For better discrimination of the local feature variations particularly at the event transition edges, it is required to fuse multi-modal features over short temporal windows. However, the general context of the entire video is essential for better event classification. The proposed Multi-Window Temporal Fusion (MWTF) module effectively solves this issue by incorporating multi-domain attention over various temporal scales of the entire video, an approach that addresses the impact of both local and global variations (Figure  3).

Initially, to re-scale the video feature representation () in accordance with the audio representation () for early fusion, we adopt the Audio-Guided Video Attention (AGVA) module presented before [24, 37, 7]. The re-scaled video feature and corresponding audio feature

are processed with separate bidirectional long short term memory (BiLSTM) layers for generating

and , respectively. Afterwards, temporal aggregation of audio-visual features is carried out to generate .

In the MWTF module, we incorporate sub-modules that operate on different timescales depending on the window length, . The basic operations in each sub-module are divided into three stages: split, fusion, and aggregation. In the split phase of the sub-module, the aggregated feature is segmented into blocks based on window length that generates , and thus,


In addition, it is possible to use varying window lengths in a sub-module totaling the number of time steps .

Following the split action, the multi-domain attention guided fusion operation is carried out on each feature block of the sub-module. The multi-domain attention operation is illustrated in Figure 4. Considering the two-domain distribution of each block , we introduce joint temporal (TA) and feature attention (FA) mechanisms by reusing the weights of similar transformations.

Firstly, each block of features is transformed to generate query vector , key vector , and value vector , such that,


where , , , , , , and , denote the and activation, respectively.

Afterwards, we process each query and key vectors on the temporal and feature domains to generate and , respectively, such that


Then, we generate the temporal attention map by applying over the rows () of . In addition, the feature attention map is generated by applying over the columns () of .

These multi-domain attention maps are sequentially applied on each axis of to generate the modified feature map by,


Finally, in the aggregate phase, the modified feature maps of each block in a sub-module are temporally concatenated to generate such that,


where ‘’ denotes the feature concatenation along temporal axis.

Afterwards, all the modified feature maps from each sub-module are concatenated along channel axis maintaining temporal relation to generate as,


where denotes feature concatenation along channel axis.

3.5 Event-Guided Temporal Refinement

As the background class represents miss-aligned audio-visual pairs from all other classes, it often becomes difficult to distinguish them from event classes in case of subtle variations. To enhance the contrast between the event-segments and backgrounds, we introduce supervised event-guided temporal attention (EGTA) over the event region. Following EGTA, we refine the contrasted event segments with an additional stage of single window fusion for better discrimination of the event categories.

To generate EGTA , the fusion vector is passed through a BiLSTM module as,


where , , and represents the activation function.

Afterwards, in the refining phase, we apply the EGTA mask over the fusion vector to generate by,


where denotes a broadcasting vector with all ones, and represents the element-wise multiplication.

As a last step, we incorporate a single window fusion with window length to generate by refining event-concentrated vector . Finally, we obtain the final event category prediction after applying another sequential BiLSTM layer as,


where for categories of AVE.

3.6 Loss Function

To guide the event-attention for refinement, we use an event label loss . Also, for the multi-class prediction , an event category loss is incorporated. The total loss is obtained by combining and as,


where , denote weighting factors, denotes the binary event label, and denotes the one-hot encoded multi-class event categories over the time-frame.

3.7 Post-Processing Algorithm during Inference

Due to the sequential nature of AVEs, it is expected that they have high locality over the entire video frame. Therefore, AVEs are typically clustered together and isolated non-AVEs can be viewed as anomalies. We exploit this property to filter the generated event prediction during inference for obtaining the final prediction . Here, we consider a window length to represent the minimum number of consecutive predictions required for considering any change as an anomaly. As such, all non-matching values are corrected according to the prevailing one.

4 Experiments and Analysis

4.1 Experimental Setup

Audio-Visual Event Dataset

The Audio-Visual Event Dataset, introduced by Tian  [24], is widely used for the audio-visual event localization task. The dataset contains video clips along with the audio containing different events including daily human activities, instrument performances, animal actions, and vehicle activities. Each video clip is seconds long with temporal start/end annotations for all events. According to existing work [24, 37, 17], training/validation/test splits of are considered for evaluation of the proposed method.

Evaluation Metrics

Following existing work [32, 37, 17, 7]

, we consider the final classification accuracy of multi-class events over the entire video as the evaluation metric. Along with the background,

event classes are considered for per-second prediction over the video duration where the video sampling rate varies from to . The background category includes all the misaligned audio-visual segments that don’t belong to any of the main categories.

Implementation Details

We incorporate the pre-trained audio and image encoders from the AudioCLIP [12] framework that fine-tunes the pre-trained CLIP [19] framework with audio-image pairs extracted from the large-scale AudioSet [9] dataset. The audio encoder is the ESResNeXt model [11] that is based on the ResNeXt-50 [3] architecture and the image encoder is a ResNet-50 [13] model. We used combination of four MWTF modules for experiments that is defined empirically. The weights of the hybrid loss function are empirically chosen to be . For evaluation, each network combination is trained for epochs on AMD EPYC CPUs with Quadro GV100 and A100-PCIE-40GB GPUs.

4.2 Comparison of State-of-the-Art methods

AVE-CLIP is compared against several state-of-the art methods in Table 1. The multi-modal approaches outperform the uni-modal ones by a great margin. This is expected given the richer context provided by the multi-modal analysis over the uni-modal counterpart. The multi-modal audio-visual fusion strategies play a very critical role in the AVE localization performance. Traditionally, various audio-visual co-attention fusion schemes have been explored to enhance the temporal event features that provide comparable performances. Recently, Lin  [17] introduced a transformer based multi-modal fusion approach that incorporates instance-level attention to follow visual context over consecutive frames. However, the proposed AVE-CLIP architecture achieves the best performance with an accuracy of 83.7% that outperforms the corresponding transformer based approach by 6.9%. Moreover, the AVE-CLIP provides 5.9% higher accuracy compared to the best-performing co-attention approach proposed by Zhou  [37].

Method Accuracy(%)
uni-modal Audio-based [14] 59.5
Video-based [23] 55.3
(with Co-Attention
AVEL [24] 74.7
DAM [28] 74.5
PSP [37] 77.8
AVIN [20] 75.2
RFJC [7] 76.2
(with Transformer
AV-Transformer [17] 76.8
Table 1: Performance comparison of the state-of-the-art methods on AVE classification. Various uni-modal methods and multi-modal fusion strategies are compared.
Image/Audio Encoders Accuracy(%)
with AudioClip-Encoders (w/o Fine-tuning) 81.1
with AudioClip-Encoders (with Fine-tuning) 83.7
without AudioCLIP-Encoders 79.3
Table 2: Impact of the pre-trained AudioCLIP image and audio encoders with contrastive fine-tuning on the AVE-CLIP framework.

4.3 Ablation Study

To analyze the effect of individual modules in proposed AVE-CLIP, we carried out a detailed ablation study over the baseline approach. The final AVE-CLIP architecture integrates the best performing structures of the building blocks.

Strategies Accuracy(%)
with PSP [37] 77.8
with AV-Transformer [17] 76.8
with only MWTF and AudioCLIP encoders (ours) 82.0
with MWTF + Refiner (ours) 82.5
with MWTF + EGTA + Refiner (ours) 83.2
with MWTF + EGTA + Refiner
+ Post Processing (ours)
Table 3: Impact of various building blocks on the performance of the proposed AVE-CLIP architecture.
Window Length, w
2s 5s 10s Variable
Temporal 78.1 79.3 81.2 79.4
Feature 78.5 79.1 81.6 79.2
Multi-domain 79.0 79.8 81.6 79.7
Table 4: Accuracy (%) obtained with a single fusion block of MWTF module for different attention domains with various window lengths. The variable window length denotes a combination of windows.
(with Two Att.)
10s - 81.6
10s + 5s 82.4 82.7
10s + 3s* 82.1 82.5
5s + 2s 80.6 81.2
10s + 5s + 3s* 83.2 82.8
10s + 5s + 2s 82.9 83.0
10s + 5s + 3s* + 2s 83.7 83.3
Table 5: Impact of using different combinations of window for the Multi-window Temporal Fusion (MWTF) module. ‘*’ indicates the use of variable window lengths for .

Effect of Contrastive Fine-tuning in Encoders

We incorporate pre-trained image and audio encoders into the AVE-CLIP framework from AudioCLIP that are subjected to contrastive fine-tuning in training stage (Section 3.2). The effect of these encoders on the final performance of AVE-CLIP is summarized in Table 2

. For the baseline comparisons, we adopted the similar VGG-19 backbone pretrained on ImageNet 

[5] to initially extract the video features, and another VGG-like network pre-trained on AudioSet [9] to extract audio features following existing work [32, 17, 37]. We observe that the best performance of is achieved by the AudioCLIP encoders with contrastive fine-tuning which improves accuracy by over the uni-modal encoders. Moreover, the contrastive fine-tuning phase improves the accuracy of AVE-CLIP by which shows its effectiveness on AVE localization.

Effect of Multi-window Temporal Fusion (MWTF)

For analyzing the effect of the MWTF module in AVE-CLIP, the rest of the modules (temporal refining, post-processing) are replaced with simple fully connected layers followed by a softmaxclassifier. Moreover, to compare with other multi-modal fusion schemes, the PSP based fusion [37] and the AV-Transformer based fusion from [17] are considered. The resulting accuracy is summarized in Table 3. The proposed MWTF module with AudioCLIP encoders provides significant improvements over existing counterparts which shows its effectiveness.

The MWTF module provides a generic framework to operate on various temporal resolutions with shared weights for effectively guiding the multi-modal attention. To analyze the effect of various temporal window lengths, the performance when using a single fusion block in MWTF module is explored in Table 4. The model performs better with increasing window length while achieving the best performance with window length. Moreover, we can observe the consistent performance improvements with multi-domain attention over their single domain counterparts. Although the fusion scheme with a smaller attention window achieves more discriminating features emphasizing high frequency local variations, it misses the global context which is particularly significant for differentiating event categories.

The performance for combinations of varying window lengths is provided in Table 5. By incorporating these fusion windows into the window, performance increases considerably when compared with the baseline. Despite the lower performance for smaller window lengths, these configurations are better at extracting discriminatory features which are particularly critical to determining the sharp edges of event transitions. Hence, the combination of larger and smaller window features is effective for generalizing global low frequency features as well as local high frequency variations at the transition edges. Moreover, we observe that independents weights over different fusion module perform better compared to their shared counter parts with fewer windows. However, when the number of windows increases, such advantages appear to shrink due to the increased complexity of the fusion mechanism.

Effect of Event-Guided Temporal Feature Refining

Downstream from the fusion module, AVE-CLIP includes temporal feature refining which consists of two phases: the event-guided temporal attention (EGTA) mask generation, and the corresponding feature refinement. From Table 3, it can be seen the effect of temporal refinement with EGTA which produces an improvement of in accuracy. Moreover, the performance of different combinations for feature refining is provided in Table 6. It is possible to generate the EGTA module without event guided supervision which, as a result, simplifies the loss function to the simple cross-entropy loss. However, with event-label supervision, the model distinguishes the event frames better against backgrounds which in turn provides better performance. For the refiner, the single-window based fusion with generates the best performance since multi-window fusion becomes gradually saturated in this phase.

Effect of Post-Processing Algorithm

Considering the sequential nature of the events, the proposed post-processing method is found to be very effective for achieving better prediction during inference. As the prediction of event category is generated on a per-second basis, incorrect predictions can be reduced by considering a window of consecutive predictions. The effects of different window lengths on the post-processing method are summarized in Table 7. The best performance is achieved for a window length. With a smaller window length, the effect of filtering is reduced over longer events whereas the larger windows reduces the performance in shorter events.

Method Accuracy(%)
EGTA with supervision 83.7
without supervision 83.1
Refiner window = 10s 83.7
window = 5s 83.2
window = (10s +5s) 83.6
Table 6: The effect of event label supervision on the Event-Guided Temporal Attention (EGTA) module and the effect of various fusion window lengths on the Refiner module.
window length 1s 2s 3s 4s 5s
Accuracy(%) 83.2 83.5 83.7 82.9 82.3
Table 7: The effect of different window lengths of the post-processing module during Inference. The best performing modules are considered in AVE-CLIP architecture for the baseline.

4.4 Qualitative Analysis

The qualitative performance of the proposed AVE-CLIP is demonstrated in Figure 5 for two audio-visual events. For comparative analysis, we have shown the performance of the PSP model ([37]) as well. In the first event, the AVE represents a moving helicopter. Though the helicopter is visible in the first frame, it is a background event due to the absence of the flying helicopter sound. Only the middle three frames capture the AVE through audio-visual correspondence. Our proposed method perfectly distinguishes the helicopter event whereas PSP ([37]) fails at the challenging first frame. The second event representing a person playing the violin is very challenging given that the violin is hardly visible. Though the sound of violin is present throughout, the image of violin is visible in only few frames that represents the AVE. The PSP ([37]) method generates some incorrect predictions at event transitions. However, the proposed AVE-CLIP perfectly distinguishes the event frames, which demonstrates its effectiveness for generalizing local variations. Furthermore, AVE-CLIP achieves better performance in many challenging cases that demands different scales of temporal reasoning throughout video.

Figure 5: Visual representation of performances of PSP ([37]) and AVE-CLIP on two AVE events (helicopter and violin). The AVE-CLIP performs better to localize event transitions.

5 Conclusion

In this paper, we introduced AVE-CLIP that uses AudioCLIP encoders in conjunction with a multi-scale temporal fusion based transformer architecture for improving AVE localization performance. We show that the effect of AudioCLIP encoders with contrastive fine-tuning is significant in AVE-localization for generating improved multi-modal representation. Our results show that local feature variations are essential for event transition detection while global variations are critical for identifying different event classes. The proposed multi-window fusion module exploits both local and global variations with multi-domain attention thereby significantly improving performance. The temporal refining of the event frames simplifies the event classification task which improves multi-class AVE localization performance. Finally, by exploiting the sequential nature of AVEs with a simple post-processing scheme, we were able to achieve state-of-the-art performance on the AVE dataset.


This research was supported in part by the Office of Naval Research, Minerva Program, and a UT Cockrell School of Engineering Doctoral Fellowship.


  • [1] T. Afouras, A. Owens, J. S. Chung, and A. Zisserman (2020)

    Self-supervised learning of audio-visual objects from video

    In ECCV, pp. 208–224. Cited by: §2.
  • [2] P. Chakraborty, S. Ahmed, M. A. Yousuf, A. Azad, S. A. Alyami, and M. A. Moni (2021) A human-robot interaction system calculating visual focus of human’s attention level. IEEE Access 9, pp. 93409–93421. Cited by: §1.
  • [3] F. Chollet (2017)

    Xception: deep learning with depthwise separable convolutions


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1251–1258. Cited by: §4.1.
  • [4] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Embodied question answering. In CVPR, pp. 1–10. Cited by: §1.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §1, §4.3.
  • [6] G. Du, M. Chen, C. Liu, B. Zhang, and P. Zhang (2018) Online robot teaching with natural human–robot interaction. IEEE Transactions on Industrial Electronics 65 (12), pp. 9571–9581. Cited by: §1.
  • [7] B. Duan, H. Tang, W. Wang, Z. Zong, G. Yang, and Y. Yan (2021) Audio-visual event localization via recursive fusion by joint co-attention. In WACV, pp. 4013–4022. Cited by: §1, §1, §1, §2, §3.4, §4.1, Table 1.
  • [8] C. Gan, Y. Zhang, J. Wu, B. Gong, and J. B. Tenenbaum (2020) Look, listen, and act: towards audio-visual embodied navigation. In ICRA, pp. 9701–9707. Cited by: §1.
  • [9] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In ICASSP, pp. 776–780. Cited by: §1, §2, §4.1, §4.3.
  • [10] E. Ghaleb, M. Popa, and S. Asteriadis (2019) Multimodal and temporal perception of audio-visual cues for emotion recognition. In ACII, pp. 552–558. Cited by: §1.
  • [11] A. Guzhov, F. Raue, J. Hees, and A. Dengel (2021) Esresne (x) t-fbsp: learning robust time-frequency transformation of audio. In 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §4.1.
  • [12] A. Guzhov, F. Raue, J. Hees, and A. Dengel (2022) Audioclip: extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 976–980. Cited by: Figure 2, §1, §2, §3.2, §4.1.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • [14] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. (2017) CNN architectures for large-scale audio classification. In ICASSP, pp. 131–135. Cited by: Table 1.
  • [15] D. Hu, R. Qian, M. Jiang, X. Tan, S. Wen, E. Ding, W. Lin, and D. Dou (2020) Discriminative sounding objects localization via self-supervised audiovisual matching. NeurIPS 33, pp. 10077–10087. Cited by: §2.
  • [16] Y. Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick (2021) Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429. Cited by: §2.
  • [17] Y. Lin and Y. F. Wang (2020) Audiovisual transformer with instance attention for audio-visual event localization. In ACCV, Cited by: §1, §1, §2, §4.1, §4.1, §4.2, §4.3, §4.3, Table 1, Table 3.
  • [18] R. Qian, D. Hu, H. Dinkel, M. Wu, N. Xu, and W. Lin (2020) Multiple sound sources localization from coarse to fine. In ECCV, pp. 292–308. Cited by: §1, §2.
  • [19] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In

    International Conference on Machine Learning

    pp. 8748–8763. Cited by: §1, §4.1.
  • [20] J. Ramaswamy (2020) What makes the sound?: a dual-modality interacting network for audio-visual event localization. In ICASSP, pp. 4372–4376. Cited by: Table 1.
  • [21] A. Sanghi, H. Chu, J. G. Lambourne, Y. Wang, C. Cheng, M. Fumero, and K. R. Malekshan (2022) Clip-forge: towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18603–18613. Cited by: §2.
  • [22] A. Senocak, H. Ryu, J. Kim, and I. S. Kweon (2022) Less can be more: sound source localization with a classification model. In WACV, pp. 3308–3317. Cited by: §1, §2.
  • [23] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Table 1.
  • [24] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu (2018) Audio-visual event localization in unconstrained videos. In ECCV, pp. 247–263. Cited by: §1, §2, §3.4, §4.1, Table 1.
  • [25] A. Tsiami, P. P. Filntisis, N. Efthymiou, P. Koutras, G. Potamianos, and P. Maragos (2018) Far-field audio-visual scene perception of multi-party human-robot interaction for children and adults. In ICAASP, pp. 6568–6572. Cited by: §1.
  • [26] C. Wang, M. Chai, M. He, D. Chen, and J. Liao (2022) Clip-nerf: text-and-image driven manipulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3835–3844. Cited by: §2.
  • [27] Y. Wu, R. Ayyalasomayajula, M. J. Bianco, D. Bharadia, and P. Gerstoft (2021) Sslide: sound source localization for indoors based on deep learning. In ICASSP, pp. 4680–4684. Cited by: §1, §2.
  • [28] Y. Wu, L. Zhu, Y. Yan, and Y. Yang (2019) Dual attention matching for audio-visual event localization. In ICCV, pp. 6292–6300. Cited by: §2, Table 1.
  • [29] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese (2018) Gibson env: real-world perception for embodied agents. In CVPR, pp. 9068–9079. Cited by: §1.
  • [30] H. Xu, R. Zeng, Q. Wu, M. Tan, and C. Gan (2020) Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 3893–3901. Cited by: §1, §1, §2.
  • [31] X. Xu, B. Dai, and D. Lin (2019) Recursive visual sound separation using minus-plus net. In ICCV, pp. 882–891. Cited by: §1.
  • [32] H. Xuan, L. Luo, Z. Zhang, J. Yang, and Y. Yan (2021) Discriminative cross-modality attention network for temporal inconsistent audio-visual event localization. IEEE Transactions on Image Processing 30, pp. 7878–7888. Cited by: §1, §1, §2, §4.1, §4.3.
  • [33] H. Xuan, Z. Zhang, S. Chen, J. Yang, and Y. Yan (2020) Cross-modal attention network for temporal inconsistent audio-visual event localization. In AAAI, Vol. 34, pp. 279–286. Cited by: §2.
  • [34] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li (2022) Pointclip: point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8552–8562. Cited by: §2.
  • [35] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba (2018) The sound of pixels. In ECCV, pp. 570–586. Cited by: §1, §2.
  • [36] H. Zhou, X. Xu, D. Lin, X. Wang, and Z. Liu (2020) Sep-stereo: visually guided stereophonic audio generation by associating source separation. In ECCV, pp. 52–69. Cited by: §1.
  • [37] J. Zhou, L. Zheng, Y. Zhong, S. Hao, and M. Wang (2021) Positive sample propagation along the audio-visual event line. In CVPR, pp. 8436–8444. Cited by: §1, §1, §2, §3.4, Figure 5, §4.1, §4.1, §4.2, §4.3, §4.3, §4.4, Table 1, Table 3.
  • [38] Z. Zhu, W. Wu, W. Zou, and J. Yan (2018) End-to-end flow correlation tracking with spatial-temporal attention. In CVPR, pp. 548–557. Cited by: §1.