Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity

11/09/2021
by   Pritam Sarkar, et al.
0

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We show that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong time-invariant representations. Our experiments show that strong augmentations for both audio and visual modalities with relaxation of cross-modal temporal synchronicity optimize performance. To pretrain our proposed framework, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics-400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and retrieval. CrissCross shows state-of-the-art performances on action recognition (UCF101 and HMDB51) and sound classification (ESC50). The codes and pretrained models will be made publicly available.

READ FULL TEXT

page 17

page 18

research
11/28/2019

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

The visual and audio modalities are highly correlated yet they contain d...
research
06/26/2022

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

We present a simple yet effective self-supervised framework for audio-vi...
research
08/13/2020

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

When watching videos, the occurrence of a visual event is often accompan...
research
02/05/2021

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

People can easily imagine the potential sound while seeing an event. Thi...
research
04/07/2021

Contrastive Learning of Global and Local Audio-Visual Representations

Contrastive learning has delivered impressive results in many audio-visu...
research
08/23/2023

AdVerb: Visually Guided Audio Dereverberation

We present AdVerb, a novel audio-visual dereverberation framework that u...
research
03/29/2021

Robust Audio-Visual Instance Discrimination

We present a self-supervised learning method to learn audio and video re...

Please sign up or login with your details

Forgot password? Click here to reset