Learning Audio-Visual Correlations from Variational Cross-Modal Generation

02/05/2021
by   Ye Zhu, et al.
0

People can easily imagine the potential sound while seeing an event. This natural synchronization between audio and visual signals reveals their intrinsic correlations. To this end, we propose to learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner, the learned correlations can be then readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval. We introduce a novel Variational AutoEncoder (VAE) framework that consists of Multiple encoders and a Shared decoder (MS-VAE) with an additional Wasserstein distance constraint to tackle the problem. Extensive experiments demonstrate that the optimized latent representation of the proposed MS-VAE can effectively learn the audio-visual correlations and can be readily applied in multiple audio-visual downstream tasks to achieve competitive performance even without any given label information during training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2021

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity

We present CrissCross, a self-supervised framework for learning audio-vi...
research
08/13/2020

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

When watching videos, the occurrence of a visual event is often accompan...
research
10/28/2022

Multimodal Transformer for Parallel Concatenated Variational Autoencoders

In this paper, we propose a multimodal transformer using parallel concat...
research
08/26/2023

Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Cross-modal retrieval has become popular in recent years, particularly w...
research
08/18/2023

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Building artificial intelligence (AI) systems on top of a set of foundat...
research
12/30/2021

Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning

Could we automatically derive the score of a piano accompaniment based o...
research
08/15/2021

Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

Generating conversational gestures from speech audio is challenging due ...

Please sign up or login with your details

Forgot password? Click here to reset