Robust Audio-Visual Instance Discrimination

03/29/2021
by   Pedro Morgado, et al.
0

We present a self-supervised learning method to learn audio and video representations. Prior work uses the natural correspondence between audio and video to define a standard cross-modal instance discrimination task, where a model is trained to match representations from the two modalities. However, the standard approach introduces two sources of training noise. First, audio-visual correspondences often produce faulty positives since the audio and video signals can be uninformative of each other. To limit the detrimental impact of faulty positives, we optimize a weighted contrastive learning loss, which down-weighs their contribution to the overall loss. Second, since self-supervised contrastive learning relies on random sampling of negative instances, instances that are semantically similar to the base instance can be used as faulty negatives. To alleviate the impact of faulty negatives, we propose to optimize an instance discrimination loss with a soft target distribution that estimates relationships between instances. We validate our contributions through extensive experiments on action recognition tasks and show that they address the problems of audio-visual instance discrimination and improve transfer learning performance.

READ FULL TEXT

page 4

page 14

page 15

research
04/27/2020

Audio-Visual Instance Discrimination with Cross-Modal Agreement

We present a self-supervised learning approach to learn audio-visual rep...
research
04/26/2022

Robust Audio-Visual Instance Discrimination via Active Contrastive Set Mining

The recent success of audio-visual representation learning can be largel...
research
11/09/2021

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity

We present CrissCross, a self-supervised framework for learning audio-vi...
research
06/30/2018

Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization

There is a natural correlation between the visual and auditive elements ...
research
03/30/2023

Soft Neighbors are Positive Supporters in Contrastive Visual Representation Learning

Contrastive learning methods train visual encoders by comparing views fr...
research
04/28/2022

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

We present an approach to learn voice-face representations from the talk...
research
10/13/2020

Are all negatives created equal in contrastive instance discrimination?

Self-supervised learning has recently begun to rival supervised learning...

Please sign up or login with your details

Forgot password? Click here to reset