Contrastive Learning of Global and Local Audio-Visual Representations

04/07/2021
by   Shuang Ma, et al.
0

Contrastive learning has delivered impressive results in many audio-visual representation learning scenarios. However, existing approaches optimize for learning either global representations useful for tasks such as classification, or local representations useful for tasks such as audio-visual source localization and separation. While they produce satisfactory results in their intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose a versatile self-supervised approach to learn audio-visual representations that generalize to both the tasks which require global semantic information (e.g., classification) and the tasks that require fine-grained spatio-temporal information (e.g. localization). We achieve this by optimizing two cross-modal contrastive objectives that together encourage our model to learn discriminative global-local visual information given audio signals. To show that our approach learns generalizable video representations, we evaluate it on various downstream scenarios including action/sound classification, lip reading, deepfake detection, and sound source localization.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

11/09/2021

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity

We present CrissCross, a self-supervised framework for learning audio-vi...
11/30/2019

Probing the State of the Art: A Critical Look at Visual Representation Evaluation

Self-supervised research improved greatly over the past half decade, wit...
08/13/2020

What Should Not Be Contrastive in Contrastive Learning

Recent self-supervised contrastive methods have been able to produce imp...
08/14/2020

Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound

In medical imaging, manual annotations can be expensive to acquire and s...
04/01/2021

Enriched Music Representations with Multiple Cross-modal Contrastive Learning

Modeling various aspects that make a music piece unique is a challenging...
09/18/2021

V-SlowFast Network for Efficient Visual Sound Separation

The objective of this paper is to perform visual sound separation: i) we...
11/23/2021

Towards Learning Universal Audio Representations

The ability to learn universal audio representations that can solve dive...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.