Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

08/13/2020
by   Ying Cheng, et al.
0

When watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be utilized as free supervised information to train a neural network by solving the pretext task of audio-visual synchronization. In this paper, we propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks. Specifically, we explore three different co-attention modules to focus on discriminative visual regions correlated to the sounds and introduce the interactions between them. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods. To further evaluate the generalizability and transferability of our approach, we apply the pre-trained model on two downstream tasks, i.e., sound source localization and action recognition. Extensive experiments demonstrate that our model provides competitive results with other self-supervised methods, and also indicate that our approach can tackle the challenging scenes which contain multiple sound sources.

READ FULL TEXT

page 1

page 5

page 6

page 8

research
11/09/2021

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity

We present CrissCross, a self-supervised framework for learning audio-vi...
research
07/07/2022

Self-Supervised Learning of Music-Dance Representation through Explicit-Implicit Rhythm Synchronization

Although audio-visual representation has been proved to be applicable in...
research
02/05/2021

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

People can easily imagine the potential sound while seeing an event. Thi...
research
06/26/2022

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

We present a simple yet effective self-supervised framework for audio-vi...
research
01/04/2022

Sound and Visual Representation Learning with Multiple Pretraining Tasks

Different self-supervised tasks (SSL) reveal different features from the...
research
03/23/2023

Egocentric Audio-Visual Object Localization

Humans naturally perceive surrounding scenes by unifying sound and sight...
research
10/16/2020

What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions

Learning effective representations of visual data that generalize to a v...

Please sign up or login with your details

Forgot password? Click here to reset