Audio-to-Image Cross-Modal Generation

09/27/2021
by   Maciej Żelaszczyk, et al.
3

Cross-modal representation learning allows to integrate information from different modalities into one representation. At the same time, research on generative models tends to focus on the visual domain with less emphasis on other domains, such as audio or text, potentially missing the benefits of shared representations. Studies successfully linking more than one modality in the generative setting are rare. In this context, we verify the possibility to train variational autoencoders (VAEs) to reconstruct image archetypes from audio data. Specifically, we consider VAEs in an adversarial training framework in order to ensure more variability in the generated data and find that there is a trade-off between the consistency and diversity of the generated images - this trade-off can be governed by scaling the reconstruction loss up or down, respectively. Our results further suggest that even in the case when the generated images are relatively inconsistent (diverse), features that are critical for proper image classification are preserved.

READ FULL TEXT

page 5

page 6

page 7

page 8

research
07/20/2022

Cross-Modal Contrastive Representation Learning for Audio-to-Image Generation

Multiple modalities for certain information provide a variety of perspec...
research
11/22/2017

CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

Visual and audio modalities are two symbiotic modalities underlying vide...
research
04/26/2017

Deep Cross-Modal Audio-Visual Generation

Cross-modal audio-visual perception has been a long-lasting topic in psy...
research
07/03/2019

Cascade Attention Guided Residue Learning GAN for Cross-Modal Translation

Since we were babies, we intuitively develop the ability to correlate th...
research
12/30/2021

Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning

Could we automatically derive the score of a piano accompaniment based o...
research
07/12/2021

Visual-Tactile Cross-Modal Data Generation using Residue-Fusion GAN with Feature-Matching and Perceptual Losses

Existing psychophysical studies have revealed that the cross-modal visua...
research
02/16/2022

Cross-Modal Common Representation Learning with Triplet Loss Functions

Common representation learning (CRL) learns a shared embedding between t...

Please sign up or login with your details

Forgot password? Click here to reset