CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

11/22/2017
by   Wangli Hao, et al.
0

Visual and audio modalities are two symbiotic modalities underlying videos, which contain both common and complementary information. If they can be mined and fused sufficiently, performances of related video tasks can be significantly enhanced. However, due to the environmental interference or sensor fault, sometimes, only one modality exists while the other is abandoned or missing. By recovering the missing modality from the existing one based on the common information shared between them and the prior information of the specific modality, great bonus will be gained for various vision tasks. In this paper, we propose a Cross-Modal Cycle Generative Adversarial Network (CMCGAN) to handle cross-modal visual-audio mutual generation. Specifically, CMCGAN is composed of four kinds of subnetworks: audio-to-visual, visual-to-audio, audio-to-audio and visual-to-visual subnetworks respectively, which are organized in a cycle architecture. CMCGAN has several remarkable advantages. Firstly, CMCGAN unifies visual-audio mutual generation into a common framework by a joint corresponding adversarial loss. Secondly, through introducing a latent vector with Gaussian distribution, CMCGAN can handle dimension and structure asymmetry over visual and audio modalities effectively. Thirdly, CMCGAN can be trained end-to-end to achieve better convenience. Benefiting from CMCGAN, we develop a dynamic multimodal classification network to handle the modality missing problem. Abundant experiments have been conducted and validate that CMCGAN obtains the state-of-the-art cross-modal visual-audio generation results. Furthermore, it is shown that the generated modality achieves comparable effects with those of original modality, which demonstrates the effectiveness and advantages of our proposed method.

READ FULL TEXT
research
04/26/2017

Deep Cross-Modal Audio-Visual Generation

Cross-modal audio-visual perception has been a long-lasting topic in psy...
research
08/23/2022

CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding

Authors make their videos visually accessible by adding audio descriptio...
research
09/27/2021

Audio-to-Image Cross-Modal Generation

Cross-modal representation learning allows to integrate information from...
research
10/02/2020

Training Strategies to Handle Missing Modalities for Audio-Visual Expression Recognition

Automatic audio-visual expression recognition can play an important role...
research
07/07/2022

AV-Gaze: A Study on the Effectiveness of Audio Guided Visual Attention Estimation for Non-Profilic Faces

In challenging real-life conditions such as extreme head-pose, occlusion...
research
01/20/2022

Omnivore: A Single Model for Many Visual Modalities

Prior work has studied different visual modalities in isolation and deve...
research
10/04/2021

Cross-Modal Virtual Sensing for Combustion Instability Monitoring

In many cyber-physical systems, imaging can be an important but expensiv...

Please sign up or login with your details

Forgot password? Click here to reset