Contrastive Learning of Global and Local Audio-Visual Representations

by   Shuang Ma, et al.

Contrastive learning has delivered impressive results in many audio-visual representation learning scenarios. However, existing approaches optimize for learning either global representations useful for tasks such as classification, or local representations useful for tasks such as audio-visual source localization and separation. While they produce satisfactory results in their intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose a versatile self-supervised approach to learn audio-visual representations that generalize to both the tasks which require global semantic information (e.g., classification) and the tasks that require fine-grained spatio-temporal information (e.g. localization). We achieve this by optimizing two cross-modal contrastive objectives that together encourage our model to learn discriminative global-local visual information given audio signals. To show that our approach learns generalizable video representations, we evaluate it on various downstream scenarios including action/sound classification, lip reading, deepfake detection, and sound source localization.



There are no comments yet.


page 1

page 8


Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity

We present CrissCross, a self-supervised framework for learning audio-vi...

Probing the State of the Art: A Critical Look at Visual Representation Evaluation

Self-supervised research improved greatly over the past half decade, wit...

What Should Not Be Contrastive in Contrastive Learning

Recent self-supervised contrastive methods have been able to produce imp...

Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound

In medical imaging, manual annotations can be expensive to acquire and s...

Enriched Music Representations with Multiple Cross-modal Contrastive Learning

Modeling various aspects that make a music piece unique is a challenging...

V-SlowFast Network for Efficient Visual Sound Separation

The objective of this paper is to perform visual sound separation: i) we...

Towards Learning Universal Audio Representations

The ability to learn universal audio representations that can solve dive...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.