M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

07/09/2019
by   Shuang Ma, et al.
0

Generative adversarial networks have led to significant advances in cross-modal/domain translation. However, typically these networks are designed for a specific task (e.g., dialogue generation or image synthesis, but not both). We present a unified model, M3D-GAN, that can translate across a wide range of modalities (e.g., text, image, and speech) and domains (e.g., attributes in images or emotions in speech). Our model consists of modality subnets that convert data from different modalities into unified representations, and a unified computing body where data from different modalities share the same network architecture. We introduce a universal attention module that is jointly trained with the whole network and learns to encode a large range of domain information into a highly structured latent space. We use this to control synthesis in novel ways, such as producing diverse realistic pictures from a sketch or varying the emotion of synthesized speech. We evaluate our approach on extensive benchmark tasks, including image-to-image, text-to-image, image captioning, text-to-speech, speech recognition, and machine translation. Our results show state-of-the-art performance on some of the tasks.

READ FULL TEXT

page 1

page 2

page 7

page 8

research
08/19/2019

Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck

Deep generative models have led to significant advances in cross-modal g...
research
04/02/2018

SyncGAN: Synchronize the Latent Space of Cross-modal Generative Adversarial Networks

Generative adversarial network (GAN) has achieved impressive success on ...
research
10/25/2018

Reducing over-smoothness in speech synthesis using Generative Adversarial Networks

Speech synthesis is widely used in many practical applications. In recen...
research
11/03/2020

Robust Latent Representations via Cross-Modal Translation and Alignment

Multi-modal learning relates information across observation modalities o...
research
05/27/2020

TIME: Text and Image Mutual-Translation Adversarial Networks

Focusing on text-to-image (T2I) generation, we propose Text and Image Mu...
research
08/14/2023

GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text

Large language models have made significant strides in natural language ...
research
07/17/2019

OmniNet: A unified architecture for multi-modal multi-task learning

Transformer is a popularly used neural network architecture, especially ...

Please sign up or login with your details

Forgot password? Click here to reset