Deep Cross-Modal Audio-Visual Generation

04/26/2017
by   Lele Chen, et al.
0

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite works in computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluations demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.

READ FULL TEXT

page 2

page 3

page 4

page 6

page 8

research
07/01/2021

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross...
research
11/22/2017

CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

Visual and audio modalities are two symbiotic modalities underlying vide...
research
04/19/2019

Listen to the Image

Visual-to-auditory sensory substitution devices can assist the blind in ...
research
12/02/2020

Cross-Modal Terrains: Navigating Sonic Space through Haptic Feedback

This paper explores the idea of using virtual textural terrains as a mea...
research
08/20/2022

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

Sight and hearing are two senses that play a vital role in human communi...
research
08/18/2023

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Building artificial intelligence (AI) systems on top of a set of foundat...
research
09/27/2021

Audio-to-Image Cross-Modal Generation

Cross-modal representation learning allows to integrate information from...

Please sign up or login with your details

Forgot password? Click here to reset