Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for Speech-to-Image Generation

05/17/2023
by   Zhenxing Zhang, et al.
0

The goal of a speech-to-image transform is to produce a photo-realistic picture directly from a speech signal. Recently, various studies have focused on this task and have achieved promising performance. However, current speech-to-image approaches are based on a stacked modular framework that suffers from three vital issues: 1) Training separate networks is time-consuming as well as inefficient and the convergence of the final generative model strongly depends on the previous generators; 2) The quality of precursor images is ignored by this architecture; 3) Multiple discriminator networks are required to be trained. To this end, we propose an efficient and effective single-stage framework called Fusion-S2iGan to yield perceptually plausible and semantically consistent image samples on the basis of given spoken descriptions. Fusion-S2iGan introduces a visual+speech fusion module (VSFM), constructed with a pixel-attention module (PAM), a speech-modulation module (SMM) and a weighted-fusion module (WFM), to inject the speech embedding from a speech encoder into the generator while improving the quality of synthesized pictures. Fusion-S2iGan spreads the bimodal information over all layers of the generator network to reinforce the visual feature maps at various hierarchical levels in the architecture. We conduct a series of experiments on four benchmark data sets, i.e., CUB birds, Oxford-102, Flickr8k and Places-subset. The experimental results demonstrate the superiority of the presented Fusion-S2iGan compared to the state-of-the-art models with a multi-stage architecture and a performance level that is close to traditional text-to-image approaches.

READ FULL TEXT

page 1

page 9

page 11

research
11/05/2020

DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image Generation

Most existing text-to-image generation methods adopt a multi-stage modul...
research
05/14/2020

S2IGAN: Speech-to-Image Generation via Adversarial Learning

An estimated half of the world's languages do not have a written form, m...
research
08/13/2020

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Synthesizing high-resolution realistic images from text descriptions is ...
research
11/17/2021

DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse Text-to-Image Generation

In this paper, we present an efficient and effective single-stage framew...
research
03/18/2021

TSTNN: Two-stage Transformer based Neural Network for Speech Enhancement in the Time Domain

In this paper, we propose a transformer-based architecture, called two-s...
research
03/23/2022

A Context-Aware Feature Fusion Framework for Punctuation Restoration

To accomplish the punctuation restoration task, most existing approaches...
research
04/07/2020

Direct Speech-to-image Translation

Direct speech-to-image translation without text is an interesting and us...

Please sign up or login with your details

Forgot password? Click here to reset