Contrastive Learning of Global and Local Audio-Visual Representations

04/07/2021
by   Shuang Ma, et al.
0

Contrastive learning has delivered impressive results in many audio-visual representation learning scenarios. However, existing approaches optimize for learning either global representations useful for tasks such as classification, or local representations useful for tasks such as audio-visual source localization and separation. While they produce satisfactory results in their intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose a versatile self-supervised approach to learn audio-visual representations that generalize to both the tasks which require global semantic information (e.g., classification) and the tasks that require fine-grained spatio-temporal information (e.g. localization). We achieve this by optimizing two cross-modal contrastive objectives that together encourage our model to learn discriminative global-local visual information given audio signals. To show that our approach learns generalizable video representations, we evaluate it on various downstream scenarios including action/sound classification, lip reading, deepfake detection, and sound source localization.

READ FULL TEXT

page 1

page 8

research
11/09/2021

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity

We present CrissCross, a self-supervised framework for learning audio-vi...
research
11/03/2020

Learning Representations from Audio-Visual Spatial Alignment

We introduce a novel self-supervised pretext task for learning represent...
research
07/18/2022

Contrastive Environmental Sound Representation Learning

Machine hearing of the environmental sound is one of the important issue...
research
11/30/2019

Probing the State of the Art: A Critical Look at Visual Representation Evaluation

Self-supervised research improved greatly over the past half decade, wit...
research
06/26/2022

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

We present a simple yet effective self-supervised framework for audio-vi...
research
08/13/2020

What Should Not Be Contrastive in Contrastive Learning

Recent self-supervised contrastive methods have been able to produce imp...
research
04/29/2022

On Negative Sampling for Audio-Visual Contrastive Learning from Movies

The abundance and ease of utilizing sound, along with the fact that audi...

Please sign up or login with your details

Forgot password? Click here to reset