Audio-Visual Instance Discrimination with Cross-Modal Agreement

04/27/2020
by   Pedro Morgado, et al.
0

We present a self-supervised learning approach to learn audio-visual representations from video and audio. Our method uses contrastive learning for cross-modal discrimination of video from audio and vice versa. We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio. With this simple but powerful insight, our method achieves state-of-the-art results when finetuned on action recognition tasks. While recent work in contrastive learning defines positive and negative samples as individual instances, we generalize this definition by exploring cross-modal agreement. We group together multiple instances as positives by measuring their similarity in both the video and the audio feature spaces. Cross-modal agreement creates better positive and negative sets, and allows us to calibrate visual similarities by seeking within-modal discrimination of positive instances.

READ FULL TEXT
research
03/29/2021

Robust Audio-Visual Instance Discrimination

We present a self-supervised learning method to learn audio and video re...
research
04/28/2022

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

We present an approach to learn voice-face representations from the talk...
research
02/11/2021

A Fractal Approach to Characterize Emotions in Audio and Visual Domain: A Study on Cross-Modal Interaction

It is already known that both auditory and visual stimulus is able to co...
research
06/13/2021

Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

Cross-modal correlation provides an inherent supervision for video unsup...
research
04/01/2021

Enriched Music Representations with Multiple Cross-modal Contrastive Learning

Modeling various aspects that make a music piece unique is a challenging...
research
04/26/2022

Robust Audio-Visual Instance Discrimination via Active Contrastive Set Mining

The recent success of audio-visual representation learning can be largel...
research
04/26/2021

Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data

This paper studies the problem of novel category discovery on single- an...

Please sign up or login with your details

Forgot password? Click here to reset