Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization

06/30/2018
by   Bruno Korbar, et al.
0

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective features for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +16.7 recognition accuracy on UCF101 and a boost of +13.0

READ FULL TEXT

page 4

page 7

research
11/03/2020

Learning Representations from Audio-Visual Spatial Alignment

We introduce a novel self-supervised pretext task for learning represent...
research
07/16/2022

LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training

Generating representations of video data is of key importance in advanci...
research
01/23/2020

Audiovisual SlowFast Networks for Video Recognition

We present Audiovisual SlowFast Networks, an architecture for integrated...
research
02/13/2020

Self-supervised learning for audio-visual speaker diarization

Speaker diarization, which is to find the speech segments of specific sp...
research
01/16/2020

Learning Spatiotemporal Features via Video and Text Pair Discrimination

Current video representations heavily rely on learning from manually ann...
research
03/29/2021

Robust Audio-Visual Instance Discrimination

We present a self-supervised learning method to learn audio and video re...
research
12/14/2018

On Attention Modules for Audio-Visual Synchronization

With the development of media and networking technologies, multimedia ap...

Please sign up or login with your details

Forgot password? Click here to reset