S2IGAN: Speech-to-Image Generation via Adversarial Learning

An estimated half of the world's languages do not have a written form, making it impossible for these languages to benefit from any existing text-based technologies. In this paper, a speech-to-image generation (S2IG) framework is proposed which translates speech descriptions to photo-realistic images without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed S2IG framework, named S2IGAN, consists of a speech embedding network (SEN) and a relation-supervised densely-stacked generative model (RDG). SEN learns the speech embedding with the supervision of the corresponding visual information. Conditioned on the speech embedding produced by SEN, the proposed RDG synthesizes images that are semantically consistent with the corresponding speech descriptions. Extensive experiments on two public benchmark datasets CUB and Oxford-102 demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.


Semantics Disentangling for Text-to-Image Generation

Synthesizing photo-realistic images from text descriptions is a challeng...

Text-to-Face Generation with StyleGAN2

Synthesizing images from text descriptions has become an active research...

MirrorGAN: Learning Text-to-image Generation by Redescription

Generating an image from a given text description has two goals: visual ...

Adma-GAN: Attribute-Driven Memory Augmented GANs for Text-to-Image Generation

As a challenging task, text-to-image generation aims to generate photo-r...

Show and Speak: Directly Synthesize Spoken Description of Images

This paper proposes a new model, referred to as the show and speak (SAS)...

Direct Speech-to-image Translation

Direct speech-to-image translation without text is an interesting and us...

1 Introduction

The recent development of deep learning and Generative Adversarial Networks (GAN)

[goodfellow2014generative, mirza2014conditional, balaji2019conditional] led to many efforts being carried out on the task of image generation conditioned on natural languages [reed2016generative, zhang2018stackgan++, xu2018attngan, qiao2019mirrorgan, yin2019semantics, tan2019semantics]. Although great progress has been made, most of the existing natural language-to-image generation systems use text descriptions as their input, also referred to as Text-to-Image Generation (T2IG). Recently, a speech-based task was proposed in which face images are synthesized conditioned on speech [oh2019speech2face, wen2019face]. This task, however, only considers the acoustic properties of the speech signal, but not the language content. Here, we present a natural language-to-image generation system that is based on a spoken description, bypassing the need for text. We refer to this new task as Speech-to-Image Generation (S2IG). This is similar to the recently proposed task of speech-to-image translation task [li2020direct].

This work is motivated by the fact that an estimated half of the 7,000 languages in the world do not have written forms [lewis2015ethnologue] (so-called unwritten languages), which makes it impossible for these languages to benefit from any existing text-based technologies, including text-to-image generation. The Linguistic Rights as included in the Universal Declaration of Human Rights state that it is a human right to communicate in one’s native language. For these unwritten languages, it is essential to develop a system that bypasses text and maps speech descriptions to images. Moreover, even though existing knowledge and methodology make ‘speech2text2image’ transfer possible, directly mapping speech to images might be more efficient and straightforward.

In order to synthesize plausible images based on speech descriptions, speech embeddings that carry the details of semantic information in the image need to be learned. To that end, we decompose the task of S2IG into two stages, i.e., a speech semantic embedding stage and an image generation stage. Specifically, the proposed speech-to-image generation model via adversarial learning (which we refer to as S2IGAN) consists of a Speech Embedding Network (SEN), which is trained to obtain speech embeddings by modeling and co-embedding speech and images together, and a novel Relation-supervised Densely-stacked Generative Model (RDG), which takes random noise and the speech embedding embedded by SEN as input to synthesize photo-realistic images in a multi-step (coarse-to-fine) way.

In this paper, we present our attempt to generate images directly from the speech signal bypassing text. This task requires specific training material consisting of speech and image pairs. Unfortunately, no such database, with the right amount of data, exists for an unwritten language. The results for our proof-of-concept are consequently presented on two databases with English descriptions, i.e., CUB [wah2011caltech] and Oxford-102 [nilsback2008automated]. The benefit of using English as our working language is that we can compare our S2IG results to T2IG results in the literature. Our results are also compared to those of [li2020direct].

2 Approach

Given a speech description, our goal is to generate an image that is semantically aligned with the input speech. To this end, S2IGAN consists of two modules, i.e., SEN to create the speech embeddings and RDG to synthesize the images using these speech embeddings.

Figure 1: Framework of the relation-supervised densely-stacked generative model (RDG). represents a real image from the same class as the ground-truth image (), represents a fake image synthesized by the framework. represents a real image from a different class as . indicates labels for three types of relations. SED and IED are pre-trained in SEN.

2.1 Datasets

CUB [wah2011caltech] and Oxford-102 [nilsback2008automated] are two commonly-used datasets in the field of T2IG [reed2016generative, zhang2018stackgan++], and were also adopted in the most recent S2IG work [li2020direct]. CUB is a fine-grained bird dataset that contains 11,788 bird images belonging to 200 categories and Oxford-102 is a fine-grained flower dataset contains 8,189 images of flowers from 102 different categories. Each image in both datasets has 10 text descriptions collected by [reed2016learning]. Since there are no speech descriptions available for both datasets, we generated speech from the text descriptions using tacotron2 [shen2018natural] which is a text-to-speech system111https://github.com/NVIDIA/tacotron2.

2.2 Speech Embedding Network (SEN)

Given an image-speech pair, SEN tries to find a common space for both modalities, so that we can minimize the modality gap and obtain visually-grounded speech embeddings. SEN is a dual encoder framework, including an image encoder and a speech encoder, which is similar to the model structure in [merkx2019language].

The image encoder (IED) adopts the Inception-v3 [szegedy2016rethinking]

pre-trained on ImageNet

[russakovsky2015imagenet] to extract visual features. On top of it, a single linear layer is employed to convert the visual feature to a common space of visual and speech embeddings. As a result, we obtain an image embedding from IED.

The speech encoder (SED) employs a structure similar to that of [merkx2019language]

. Specifically, it consists of a two-layer 1-D convolution block, two-layer bi-directional gated recurrent units (GRU)

[cho2014learning] and a self-attention layer. Finally, speech is represented by a speech embedding in the common space. The input of the SED are log Mel filter bank spectrograms, which are obtained from the speech signal using 40 Mel-spaced filter banks with 25 ms Hamming window and 10 ms shift.

More details of SEN, including the framework illustration, can be found on the project website222For more details on the model and results, please see: https://xinshengwang.github.io/project/s2igan/.

2.2.1 Objective Function

To minimize the distance between a matched pair of an image feature and speech feature while maintaining discrimination of the features compared to features from other bird (CUB) or flower (Oxford-102) classes, matching loss and distinctive loss are proposed.

Matching loss is designed to minimize the distance of a matched image-speech pair. Specifically, in a batch of image-speech embedding pairs , where

is the batch size, the probability for the speech embedding

matching with the image embedding is


where is a smoothing factor, set as 10 following [xu2018attngan].

is a cosine similarity score of

and . As in a mini-batch, we only treat as a positive matched pair, therefore we use a mask to deactivate the effect of pairs from the same class. Specifically,


where matches

means they come from the same class. The loss function is then defined as the negative log probability of



Reversely, we also calculate for matching . The matching loss is then calculated as


Distinctive loss is designed to ensure that the space is optimally discriminative regarding the instance classes. Specifically, both speech and image features in the embedding space are converted to a label space by adding a perception layer, i.e., and , where and is the number of classes. The loss function is given by


where and represent softmax probabilities for and belonging to their corresponding class .

Total loss for training SEN is finally given by


2.3 Relation-supervised Densely-stacked Generative Model (RDG)

After learning the visually-grounded and class-discriminative speech embeddings, we employ RDG to generate images conditioned on these speech embeddings. RDG consists of two sub-modules, which are a Densely-stacked Generator (DG) and a Relation Supervisor (RS), see Figure 1.

2.3.1 Densely-stacked Generator (DG)

RDG uses the multi-step generation structure [zhang2018stackgan++, qiao2019mirrorgan, yin2019semantics] because of its previously shown performance. This structure generates images from small scale (low-resolution) to large scale (high-resolution) step by step. Specifically, in our model, , , and pixel images were generated in multi-steps. To fully exploit the information of the hidden feature of each step, we design a densely-stacked generator. With the speech embedding as input, the generated image in each stacked generator can be expressed as follows:



is a noise vector sampled from a normal distribution.

represents Conditioning Augmentation [zhang2017stackgan, zhang2018stackgan++] that augments the speech features thus produces more image-speech pairs. It is a popular and useful strategy which is used in most recent text-to-speech generation tasks [tan2019semantics, xu2018attngan, qiao2019mirrorgan].

is the hidden feature from the non-linear transformation

. is fed to the generator to obtain image .

2.3.2 Relation Supervisor (RS)

To ensure that the generator produces high-quality images that are semantically aligned with the spoken description, we propose a relation supervisor to provide a strong relation constraint to the generation process. Specifically, we form an image set for each generated image , i.e., indicating the generated fake image, the ground-truth image, a real image from the same class as , and a real image from a different randomly-sampled class, respectively. We then define three types of relation classes: 1) a positive relation , between and ; 2) a negative relation , between and ; 3) an undesired relation , between and

. A relation classifier is trained to classify these three relations. We expect the relation between

and to be close to the positive relation , because should semantically align with its corresponding , however, it should not be identical to to ensure the diversity of the generated results. Therefore, the loss function for training the RS is defined as:


where is a relation vector produced by RS with the input of a pair of images with relation , e.g., . is the vector of relation between and . Note that we apply RS only to the last generated image, i.e., , for computational efficiency.

2.3.3 Objective Function

The final objective function of RDG is defined as:


where the loss function for the - generator is defined as:


The loss function for the corresponding discriminator of RDG is given by:


where the loss function for the - generator is given by:


Here, the first two items are unconditional loss that discriminate the fake and real images, and the last two items are conditional loss discriminating whether the image and the speech description match or not. The is from the model distribution at the scale, and is from the real image distribution at the same scale. The generators and discriminators were trained alternately.

2.4 Evaluation Metrics

We use two metrics to evaluate the performance of our SI2GAN model. To evaluate diversity and quality

of the generated images, we used two popular evaluation metrics for quantitative evaluation of generative models as that in

[zhang2018stackgan++]: Inception score (IS) [salimans2016improved] and fréchet inception distance (FID) [heusel2017gans], where, a higher IS means more diversity and a lower FID means a smaller distance between the generated and real image distributions, which indicates better generated images.

The visual-semantic consistency

between the generated images and their speech descriptions is evaluated through a content-based image retrieval experiment between the real images and the generated images, and evaluated using mAP scores. Specifically, we randomly chose two real images from each class of the test set, resulting in a query pool. Then we used these query images to retrieve generated fake images that belong to their corresponding classes. We used the pre-trained Inception-v3 to extract features of all images. Higher mAP indicates a closer feature distance between fake images and their ground truth images, which indirectly shows a higher semantic consistency between generated images and their corresponding speech descriptions.

3 Results

CUB (Bird) Oxford-102 (Flower)
Evaluation Metric Input mAP FID IS mAP FID IS
StackGAN-v2 text 7.01 20.94 4.020.03 9.88 50.38 3.350.07
MirrorGAN text 4.560.05
SEGAN text 4.670.04
[li2020direct] speech 18.37 4.090.04 54.76 3.230.05
StackGAN-v2 speech 8.09 18.94 4.140.04 12.18 54.33 3.690.08
S2IGAN speech 9.04 14.50 4.290.04 13.40 48.64 3.550.04
Table 1: Performance of S2IGAN compared to other methods. means that the results are taken from the original paper. The best performance is shown in bold.

3.1 Objective Results

We compare our results with several state-of-the-art T2IG methods, including StackGAN-v2 [zhang2018stackgan++], MirrorGAN [qiao2019mirrorgan] and SEGAN [tan2019semantics]. StackGAN-v2 is a strong baseline for the T2IG task and provides the effective stacked structure for the following methods. Both MirrorGAN and SEGAN are based on the stacked structure. MirrorGAN utilizes word-level [xu2018attngan] and sentence-level attention mechanisms, and a “text-to-image-to-text” structure for T2IG, and SEGAN also uses word-level attention with extra proposed attention regularization and a siamese structure. In order to allow for a direct comparison on the S2IG task to StackGAN-v2, we reimplemented StackGAN-v2 and replaced the text embedding with our speech embedding. Moreover, we compare our results to the recently released speech-based model by [li2020direct].

The results are shown in Table 1. First, our method outperformed [li2020direct] on all evaluation metrics and datasets. Compared with the StackGAN-v2 that took our speech embedding as input, our S2IGAN also achieved higher mAP and lower FID on both datasets. These results indicate that our method is effective in generating high-quality and semantically consistent images on the basis of spoken descriptions. The comparison of our S2IGAN with three state-of-the-art T2IG methods show that the S2IGAN method is competitive, and thus establishes a solid new baseline for the S2IG task.

Speech input is generally considered to be more difficult to deal with than text because of its high variability, its long duration, and the lack of pauses between words. Therefore, S2IG is more challenging than T2IG. However, the comparison of the performances of StackGAN-v2 on the S2IG and T2IG tasks shows that StackGAN-v2 generated better images using speech embeddings learned by our SEN. Moreover, the StackGAN-v2 based on our learned speech embeddings outperforms [li2020direct] on almost all evaluation metrics and datasets, except for the slightly higher FID on CUB dataset. Note that [li2020direct] takes the native StackGAN-v2 as the generator, which means that the only difference between [li2020direct] and the speech-based StackGAN-v2 in Table 1 is the speech embedding method. These results confirm that our learned speech embeddings are competitive compared to text input and the speech embeddings in [li2020direct], showing the effectiveness of our SEN module.

3.1.1 Subjective Results

The subjective visual results are shown in Figure 2. As can be seen, the images synthesized by our S2IGAN (d) are photo-realistic and convincing. By comparing the images generated by (d) S2IGAN and (c) StackGAN-v2 conditioned on speech embeddings, we can see that the images generated by S2IGAN are clearer and sharper, showing the effectiveness of the proposed S2IGAN on synthesizing visually high-quality images. The comparison of StackGAN-v2 conditioned on (b) text and (c) speech features embedded by the proposed SEN shows that our learned speech embeddings are competitive compared with the text features embedded by StackGAN-v2, showing the effectiveness of SEN. More results are shown on the project websiteLABEL:ft:project.

To further illustrate S2IGAN’s ability to catch subtle semantic differences in the speech descriptions, we generated images conditioned on speech descriptions in which color keywords were changed. As Figure 3 shows, the visual semantics of the generated birds, specifically, the colors of the belly and the wings, are consistent with the corresponding semantic information in the spoken descriptions. These visualization results indicate that SEN successfully learned the semantic information in the speech signal, and that our RDG is capable of capturing these semantics and generating discriminative images that are semantically aligned with the input speech.

Figure 2: Examples of images generated by different methods.
Figure 3: Generated examples by S2IGAN. The generated images are based on speech descriptions with different color keywords.

3.2 Component analysis

An extensive ablation study investigated the effectiveness of key components of SI2GAN. Specifically, the effects of the densely-stacked structure of DG, RS, and SEN were investigated by removing each of these components respectively. Removing any component resulted in a clear decrease of the generation performance, showing the effectiveness of each component. Details can be found on the project websiteLABEL:ft:project.

4 Discussion and Conclusion

This paper introduces a novel speech-to-image generation (S2IG) task and we developed a novel generative model, called S2IGAN, which tackles S2IG in two steps. First, semantically discriminative speech embeddings are learned by a speech embedding network. Second, high-quality images are generated on the basis of the speech embeddings. The results of extensive experiments show that our S2IGAN has state-of-the-art performance, and that the learned speech embeddings capture the semantic information in the speech signal.

The current work is based on synthesized speech, which makes the current S2IG baseline an upper-bound baseline. The future work will focus on several directions. First, we will investigate this task with natural speech instead of synthesized speech. Second, it will be highly interesting to test the proposed methodology on a true unwritten language rather than the well-resourced English language. Third, we will further improve our methods in terms of efficiency and accuracy, for example, by making end-to-end training more effective and efficient and by applying attention mechanisms to our generator to further improve the quality of the generated images. An interesting avenue for future research would be to automatically discover speech units based on corresponding visual information from the speech signal [harwath2019towards] to segment the speech signal. This would allow us to use segment- and word-level attention mechanisms, which have shown to lead to improved performance on the text-to-image generation task [xu2018attngan], to improve the performance of speech-to-image generation.