FA-GAN: Feature-Aware GAN for Text to Image Synthesis

by   Eunyeong Jeon, et al.

Text-to-image synthesis aims to generate a photo-realistic image from a given natural language description. Previous works have made significant progress with Generative Adversarial Networks (GANs). Nonetheless, it is still hard to generate intact objects or clear textures (Fig 1). To address this issue, we propose Feature-Aware Generative Adversarial Network (FA-GAN) to synthesize a high-quality image by integrating two techniques: a self-supervised discriminator and a feature-aware loss. First, we design a self-supervised discriminator with an auxiliary decoder so that the discriminator can extract better representation. Secondly, we introduce a feature-aware loss to provide the generator more direct supervision by employing the feature representation from the self-supervised discriminator. Experiments on the MS-COCO dataset show that our proposed method significantly advances the state-of-the-art FID score from 28.92 to 24.58.



There are no comments yet.


page 1

page 4


Self-supervised GANs with Label Augmentation

Recently, transformation-based self-supervised learning has been applied...

Label Geometry Aware Discriminator for Conditional Generative Networks

Multi-domain image-to-image translation with conditional Generative Adve...

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Existing conditional image synthesis frameworks generate images based on...

JigsawGAN: Self-supervised Learning for Solving Jigsaw Puzzles with Generative Adversarial Networks

The paper proposes a solution based on Generative Adversarial Network (G...

Unsupervised Single Image Deraining with Self-supervised Constraints

Most existing single image deraining methods require learning supervised...

Self-Supervised Feature Learning by Learning to Spot Artifacts

We introduce a novel self-supervised learning method based on adversaria...

DeshuffleGAN: A Self-Supervised GAN to Improve Structure Learning

Generative Adversarial Networks (GANs) triggered an increased interest i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) [c1] have led remarkable success in image generation with various types of conditions [c2, c3, c4, c6, c7, c8]. Research on text-to-image synthesis is in the limelight because of the expressiveness of natural language unlike random noises or labels. Nevertheless, it is challenging to synthesize high-quality images while satisfying the diverse constraints of text descriptions. For example, generating ‘a few people on skis standing on a mountain top’ is more difficult than generating ‘people’.

Most existing works [c9, c10, c11, c12, c13] have achieved remarkable progress by proposing effective structures of GANs. StackGAN [c9] uses the stacked structure of multiple GANs to decompose the hard problem of generating high-resolution images into tractable subproblems. Subsequent studies [c10, c11, c12, c13] have refined the architecture based on StackGAN [c9]. AttnGAN [c10] adopts cross-modal attention mechanisms for fine-grained generation. DM-GAN [c11] leverages dynamic memory modules to supplement the generation procedure. Obj-GAN [c12] and OP-GAN [c13] use an additional input, pre-generated scene layout, to concentrate on creating objects. However, it is not easy to train multiple GANs at one time. Recently, DF-GAN [c14] shows that a single pair of generator and discriminator can produce realistic images using deep fusion blocks.

Figure 1: Text-to-image synthesis examples of the proposed model FA-GAN and the baseline models AttnGAN, DF-GAN. The baseline models suffer from generating intact objects or details.
Figure 2:

The architecture of a proposed FA-GAN. (a) The self-supervised discriminator determines whether the input sentence and image match. The self-supervised discriminator has an auxiliary decoder, which is trained to reconstruct the real images. (b) The generator generates an image from the noise vector

and sentence vector. The feature-aware loss maximizes the feature similarity between the fake image and the real image.

To encourage the text-image consistency, AttnGAN [c10] suggests DAMSM loss and MirrorGAN [mirrorgan] designs text-to-image-to-text cycle consistency loss. These methods compute the semantic similarity using a pre-trained network which slows down the training process. To improve this, DF-GAN [c14] proposes a MA-GP which is a regularization method on the discriminator using the real images and does not require an extra network. On account of these efforts, the previous models can catch what to draw. However, they still struggle with generating intact objects or clear textures (Fig 1).

To circumvent this problem, we focus on the method to directly affect the generator by effectively exploiting the training data. To this end, we propose Feature-Aware Generative Adversarial Network (FA-GAN) to produce a more realistic image. Our method consists of two techniques, a self-supervised discriminator and a feature-aware loss. The self-supervised discriminator with an extra decoder extracts better feature representation by auto-encoding training. Feature-aware loss, which utilizes the representation, provides more direct supervision to the generator by indicating features that the generated image should have, which is a kind of regularization loss. The method can be easily applied to other existing models with small modification. Extensive experiments and ablation studies on the MS-COCO [c16] dataset are conducted to demonstrate the superiority of FA-GAN. For the quantitative evaluation, we use Fréchet Instance Distribution (FID) [c17]. Our proposed method advances the state-of-the-art FID from 28.92 to 24.58.

Overall, our contributions are as follows:

  • We propose FA-GAN for the text-to-image synthesis, a method to give useful feedback to the generator to produce high-fidelity images.

  • Our proposed method is more effective than other similar regularization losses which use real data and generated data.

  • FA-GAN outperforms the state-of-the-art models in terms of FID.

2 Proposed Method

2.1 Model Architecture

We use the one stage GAN proposed in [c14]. The overall architecture is illustrated in Fig 2. It is composed of a pre-trained text encoder from [c10], a generator , and a discriminator

. The pre-trained text encoder is a bi-directional Long Short-Term Memory that extracts a semantic vector from the text. The last hidden states are employed as the sentence vector

. The generator generates an image from the given sentence vector and a noise vector

sampled from Gaussian distribution

. With the addition of text condition, the role of the discriminator is different from the traditional discriminator which distinguishes whether the input image is real or fake. The discriminator has two inputs, a sentence vector and an image , and it determines whether the image and the text match or not. There are three cases to be considered while training ; (1) Match, when the input is a real image and matched sentence, (2) Mismatch, when the input is a real image and mismatched sentence, (3) Mismatch, when the input is a fake image regardless of the sentence. We use a hinge loss [c18] to stabilize training GAN procedure and a MA-GP loss from [c14]. The adversarial losses are defined as follows:


where is the matched image and sentence from real distribution , and is the mismatched image and sentence.

2.2 Self-supervised Discriminator

Inspired by [c15], we design a self-supervised discriminator. We attach a single decoder to the intermediate layers of and denote the layers of the before the decoder as . The has four convolution layers and the architecture is shown in the Fig 2. The reconstructs the input image from the output of the and it is only optimized on the real samples during training . We employ a perceptual loss [c19] for the reconstruction loss:



is the perceptual loss function. Such auto-encoding training makes

to learn better feature representation from the inputs. To take an advantage of the representation, we utilize it for computing the feature-aware loss.

2.3 Feature-Aware Loss

The self-supervised discriminator is prerequisite in feature-aware loss. It has an autoencoder internally since it contains an auxiliary decoder. Autoencoders are designed to encode the input into a meaningful representation. Thanks to the autoencoder structure in the self-supervised discriminator, we can regard the features of the real image

as the guidelines, allowing the generator to know which features the generated image should have. Consequently, we propose a feature-aware loss, a regularization loss to synthesize an image that retains the features of a real image. Specifically, it is designed to enforce the to generate an image that maximizes the similarity of the feature representations between the generated image and the corresponding real image with the same textual description . In this paper, we employ the distance to calculate the similarity and convert the maximization problem into a minimization problem. The proposed feature-aware loss is formulated as follows:


In Sec 3.4, we compare our method with perceptual loss. Perceptual loss reconstructs the features obtained from the pre-trained VGG-16 [c20] network. The main difference between our method and perceptual loss is which features to use for the regularization.

2.4 Total Objective

Finally, the total objective functions of our FA-GAN are defined as:


Our method can be easily applied to other models by adding the small number of parameters to the decoder .

3 Experiments

We take extensive experiments to evaluate the performance of FA-GAN. We provide the quantitative and qualitative evaluation of FA-GAN against the state-of-the-arts on the MS-COCO dataset. In addition, we conduct ablation studies to investigate the effectiveness of the proposed method. Furthermore, we compare the feature-aware loss and other similar regularization losses. The results manifest the superiority of our proposed method.

3.1 Evaluation Details

Inception Score (IS) [c21] and Fréchet Inception Distance (FID) are widely used metrics for evaluating GANs. In contrast to IS, which evaluates only the distribution of generated images, the FID compares the distribution of generated images with the distribution of real images, which concludes that the FID is much more robust than IS. In other words, if the model generates the same image, the FID will be higher (the lower the FID, the better), but IS can not penalize this case. [c12, c13, c14] found that IS is not an appropriate metric to evaluate the text-to-image synthesis models since some models tend to generate the same image when the text contains the same word, which is not good generative models but IS could be high (the higher the IS, the better). Thus, we use FID to evaluate our models. Following the prior works [c11, c14], we generate 30,000 images from randomly selected sentences in the test set and compute the FID score.

Methods FID
AttnGAN 35.49
DM-GAN 32.64
DF-GAN 28.92
FA-GAN (Ours) 24.58
Table 1: Comparison with state-of-the-art text-to-image synthesis models. FA-GAN significantly outperforms the state-of-the-art. The best score is indicated in bold.
Methods FID
baseline 27.67
baseline + SSD 37.32
baseline + FA loss 28.25
baseline + SSD + FA loss (Ours) 24.58
Table 2: Ablation study on our proposed method. SSD indicates the self-supervised discriminator and FA indicates the feature-aware loss without the self-supervised discriminator.

3.2 Quantitative Evaluation

We compare our model with state-of-the-art text-to-image synthesis models on MS-COCO dataset. As shown in the Table 1, the proposed FA-GAN significantly outperforms other methods. Compared with DF-GAN [c14], the FA-GAN improves the FID from 28.92 to 24.58. The results show that our proposed FA-GAN can generate images with better qualities.

Figure 3: Generated images from the similar sentences by baseline + perceptual loss and baseline + feature-aware loss (Ours). Our method can generate more variants and clearer images.
Figure 4: Examples for the baseline models and FA-GAN on MS-COCO test set. FA-GAN generates more realistic images than the baseline models AttnGAN, DM-GAN, DF-GAN.

3.3 Ablation Study

We perform ablation studies to verify the effectiveness of our proposed method. The components are the self-supervised discriminator (SSD) and the feature-aware loss (FA). We denote the variation that only uses feature-aware loss without SSD as FA. We compare four configurations, (1) FA-GAN w/o SSD and FA, (2) FA-GAN w/o FA, (3) FA-GAN w/o SSD, (4) FA-GAN. We set (1) as our baseline model which has a slightly lower FID score than DF-GAN [c14]. The results are reported in Table 2. It shows that (2), (3) do not guarantee the improvement over the baseline. We found that the effect of (2) SSD without regularization accelerates the convergence speed of the discriminator, making the GAN training unstable and (3) has the side effects of regularization explained in Sec 3.4. However, our proposed method (4) that integrates both SSD and FA significantly improves the model’s performance. We speculate that this is because regularizing meaningful representation can provide more useful signals to the generators.

Methods FID
baseline 27.67
baseline + Perceptual loss 37.84
baseline + FA loss 28.25
baseline + FA loss (Ours) 24.58
Table 3: Experiments on the superiority of the feature-aware loss. Feature-aware loss is an effective regularization loss.

3.4 Comparison with other regularization methods

To verify the superiority of our proposed feature-aware loss, we compare the feature-aware loss with other similar regularization losses using feature representations. Perceptual loss uses the representation from the pre-trained VGG network for image classification, and FA is from Sec 3.3. The performances are reported in Table 3. Regularization loss is not always helpful for the GANs because it can hinder the diversity which causes the FID to increase. The side effects depend on the features which should contain meaningful information. We found that other regularization losses have the side effects, but our method has fewer side effects and more benefits. Fig 3 shows that our model can generate more variants and clear images. The results demonstrate that the feature-aware loss is beneficial regularization and our method can effectively exploit the training data. We conjecture that this is because the self-supervised discriminator extracts more meaningful high-level semantic features than others.

3.5 Qualitative Results

The Fig 4 shows the several examples of synthesized images by our proposed model FA-GAN and other baseline models. We observe that our model can retain the shape or generate clearer details compared to the baselines, which demonstrates the effectiveness of our proposed feature-aware loss.

4 Conclusion

In this paper, we propose FA-GAN for text-to-image synthesis which is a method to give the generator more useful signals. We design a self-supervised discriminator which is trained in an auto-encoding manner for extracting meaningful features. A feature-aware loss enables a model to generate images that have the same features as real images. Experimental results show the effectiveness of our method and our model can generate more realistic and clearer images. Moreover, the FA-GAN significantly advances the state-of-the-art FID from 28.92 to 24.58. The proposed method can be easily applied to other existing methods with only small modification.

5 Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.B0101-15-0266, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis) and (No.2017-0-00897, Development of Object Detection and Recognition for Intelligent Vehicles)