Audio-Visual Contrastive Learning with Temporal Self-Supervision

02/15/2023
by   Simon Jenni, et al.
0

We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. In contrast to images that capture the static scene appearance, videos also contain sound and temporal scene dynamics. To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting and integrates it with multi-modal contrastive objectives. As temporal self-supervision, we pose playback speed and direction recognition in both modalities and propose intra- and inter-modal temporal ordering tasks. Furthermore, we design a novel contrastive objective in which the usual pairs are supplemented with additional sample-dependent positives and negatives sampled from the evolving feature space. In our model, we apply such losses among video clips and between videos and their temporally corresponding audio clips. We verify our model design in extensive ablation experiments and evaluate the video and audio representations in transfer experiments to action recognition and retrieval on UCF101 and HMBD51, audio classification on ESC50, and robust video fingerprinting on VGG-Sound, with state-of-the-art results.

READ FULL TEXT

page 2

page 3

page 7

research
12/15/2022

MAViL: Masked Audio-Video Learners

We present Masked Audio-Video Learners (MAViL) to train audio-visual rep...
research
12/01/2021

PreViTS: Contrastive Pretraining with Video Tracking Supervision

Videos are a rich source for self-supervised learning (SSL) of visual re...
research
04/06/2021

Strumming to the Beat: Audio-Conditioned Contrastive Video Textures

We introduce a non-parametric approach for infinite video texture synthe...
research
04/26/2022

Robust Audio-Visual Instance Discrimination via Active Contrastive Set Mining

The recent success of audio-visual representation learning can be largel...
research
05/12/2022

Weakly-Supervised Action Detection Guided by Audio Narration

Videos are more well-organized curated data sources for visual concept l...
research
06/24/2020

Labelling unlabelled videos from scratch with multi-modal self-supervision

A large part of the current success of deep learning lies in the effecti...
research
04/29/2022

On Negative Sampling for Audio-Visual Contrastive Learning from Movies

The abundance and ease of utilizing sound, along with the fact that audi...

Please sign up or login with your details

Forgot password? Click here to reset