Cascade Attention Guided Residue Learning GAN for Cross-Modal Translation

07/03/2019
by   Bin Duan, et al.
7

Since we were babies, we intuitively develop the ability to correlate the input from different cognitive sensors such as vision, audio, and text. However, in machine learning, this cross-modal learning is a nontrivial task because different modalities have no homogeneous properties. Previous works discover that there should be bridges among different modalities. From neurology and psychology perspective, humans have the capacity to link one modality with another one, e.g., associating a picture of a bird with the only hearing of its singing and vice versa. Is it possible for machine learning algorithms to recover the scene given the audio signal? In this paper, we propose a novel Cascade Attention-Guided Residue GAN (CAR-GAN), aiming at reconstructing the scenes given the corresponding audio signals. Particularly, we present a residue module to mitigate the gap between different modalities progressively. Moreover, a cascade attention guided network with a novel classification loss function is designed to tackle the cross-modal learning task. Our model keeps the consistency in high-level semantic label domain and is able to balance two different modalities. The experimental results demonstrate that our model achieves the state-of-the-art cross-modal audio-visual generation on the challenging Sub-URMP dataset.

READ FULL TEXT

page 2

page 6

page 7

research
10/26/2021

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Learning common subspace is prevalent way in cross-modal retrieval to so...
research
07/07/2022

AV-Gaze: A Study on the Effectiveness of Audio Guided Visual Attention Estimation for Non-Profilic Faces

In challenging real-life conditions such as extreme head-pose, occlusion...
research
07/20/2022

Cross-Modal Contrastive Representation Learning for Audio-to-Image Generation

Multiple modalities for certain information provide a variety of perspec...
research
11/11/2021

Learning Signal-Agnostic Manifolds of Neural Fields

Deep neural networks have been used widely to learn the latent structure...
research
10/06/2019

Neural Multisensory Scene Inference

For embodied agents to infer representations of the underlying 3D physic...
research
02/11/2021

A Fractal Approach to Characterize Emotions in Audio and Visual Domain: A Study on Cross-Modal Interaction

It is already known that both auditory and visual stimulus is able to co...
research
09/27/2021

Audio-to-Image Cross-Modal Generation

Cross-modal representation learning allows to integrate information from...

Please sign up or login with your details

Forgot password? Click here to reset