Generative Adversarial Networks (GANs) [c1] have led remarkable success in image generation with various types of conditions [c2, c3, c4, c6, c7, c8]. Research on text-to-image synthesis is in the limelight because of the expressiveness of natural language unlike random noises or labels. Nevertheless, it is challenging to synthesize high-quality images while satisfying the diverse constraints of text descriptions. For example, generating ‘a few people on skis standing on a mountain top’ is more difficult than generating ‘people’.
Most existing works [c9, c10, c11, c12, c13] have achieved remarkable progress by proposing effective structures of GANs. StackGAN [c9] uses the stacked structure of multiple GANs to decompose the hard problem of generating high-resolution images into tractable subproblems. Subsequent studies [c10, c11, c12, c13] have refined the architecture based on StackGAN [c9]. AttnGAN [c10] adopts cross-modal attention mechanisms for fine-grained generation. DM-GAN [c11] leverages dynamic memory modules to supplement the generation procedure. Obj-GAN [c12] and OP-GAN [c13] use an additional input, pre-generated scene layout, to concentrate on creating objects. However, it is not easy to train multiple GANs at one time. Recently, DF-GAN [c14] shows that a single pair of generator and discriminator can produce realistic images using deep fusion blocks.
To encourage the text-image consistency, AttnGAN [c10] suggests DAMSM loss and MirrorGAN [mirrorgan] designs text-to-image-to-text cycle consistency loss. These methods compute the semantic similarity using a pre-trained network which slows down the training process. To improve this, DF-GAN [c14] proposes a MA-GP which is a regularization method on the discriminator using the real images and does not require an extra network. On account of these efforts, the previous models can catch what to draw. However, they still struggle with generating intact objects or clear textures (Fig 1).
To circumvent this problem, we focus on the method to directly affect the generator by effectively exploiting the training data. To this end, we propose Feature-Aware Generative Adversarial Network (FA-GAN) to produce a more realistic image. Our method consists of two techniques, a self-supervised discriminator and a feature-aware loss. The self-supervised discriminator with an extra decoder extracts better feature representation by auto-encoding training. Feature-aware loss, which utilizes the representation, provides more direct supervision to the generator by indicating features that the generated image should have, which is a kind of regularization loss. The method can be easily applied to other existing models with small modification. Extensive experiments and ablation studies on the MS-COCO [c16] dataset are conducted to demonstrate the superiority of FA-GAN. For the quantitative evaluation, we use Fréchet Instance Distribution (FID) [c17]. Our proposed method advances the state-of-the-art FID from 28.92 to 24.58.
Overall, our contributions are as follows:
We propose FA-GAN for the text-to-image synthesis, a method to give useful feedback to the generator to produce high-fidelity images.
Our proposed method is more effective than other similar regularization losses which use real data and generated data.
FA-GAN outperforms the state-of-the-art models in terms of FID.
2 Proposed Method
2.1 Model Architecture
We use the one stage GAN proposed in [c14]. The overall architecture is illustrated in Fig 2. It is composed of a pre-trained text encoder from [c10], a generator , and a discriminator
. The pre-trained text encoder is a bi-directional Long Short-Term Memory that extracts a semantic vector from the text. The last hidden states are employed as the sentence vector. The generator generates an image from the given sentence vector and a noise vector
sampled from Gaussian distribution. With the addition of text condition, the role of the discriminator is different from the traditional discriminator which distinguishes whether the input image is real or fake. The discriminator has two inputs, a sentence vector and an image , and it determines whether the image and the text match or not. There are three cases to be considered while training ; (1) Match, when the input is a real image and matched sentence, (2) Mismatch, when the input is a real image and mismatched sentence, (3) Mismatch, when the input is a fake image regardless of the sentence. We use a hinge loss [c18] to stabilize training GAN procedure and a MA-GP loss from [c14]. The adversarial losses are defined as follows:
where is the matched image and sentence from real distribution , and is the mismatched image and sentence.
2.2 Self-supervised Discriminator
Inspired by [c15], we design a self-supervised discriminator. We attach a single decoder to the intermediate layers of and denote the layers of the before the decoder as . The has four convolution layers and the architecture is shown in the Fig 2. The reconstructs the input image from the output of the and it is only optimized on the real samples during training . We employ a perceptual loss [c19] for the reconstruction loss:
is the perceptual loss function. Such auto-encoding training makesto learn better feature representation from the inputs. To take an advantage of the representation, we utilize it for computing the feature-aware loss.
2.3 Feature-Aware Loss
The self-supervised discriminator is prerequisite in feature-aware loss. It has an autoencoder internally since it contains an auxiliary decoder. Autoencoders are designed to encode the input into a meaningful representation. Thanks to the autoencoder structure in the self-supervised discriminator, we can regard the features of the real imageas the guidelines, allowing the generator to know which features the generated image should have. Consequently, we propose a feature-aware loss, a regularization loss to synthesize an image that retains the features of a real image. Specifically, it is designed to enforce the to generate an image that maximizes the similarity of the feature representations between the generated image and the corresponding real image with the same textual description . In this paper, we employ the distance to calculate the similarity and convert the maximization problem into a minimization problem. The proposed feature-aware loss is formulated as follows:
In Sec 3.4, we compare our method with perceptual loss. Perceptual loss reconstructs the features obtained from the pre-trained VGG-16 [c20] network. The main difference between our method and perceptual loss is which features to use for the regularization.
2.4 Total Objective
Finally, the total objective functions of our FA-GAN are defined as:
Our method can be easily applied to other models by adding the small number of parameters to the decoder .
We take extensive experiments to evaluate the performance of FA-GAN. We provide the quantitative and qualitative evaluation of FA-GAN against the state-of-the-arts on the MS-COCO dataset. In addition, we conduct ablation studies to investigate the effectiveness of the proposed method. Furthermore, we compare the feature-aware loss and other similar regularization losses. The results manifest the superiority of our proposed method.
3.1 Evaluation Details
Inception Score (IS) [c21] and Fréchet Inception Distance (FID) are widely used metrics for evaluating GANs. In contrast to IS, which evaluates only the distribution of generated images, the FID compares the distribution of generated images with the distribution of real images, which concludes that the FID is much more robust than IS. In other words, if the model generates the same image, the FID will be higher (the lower the FID, the better), but IS can not penalize this case. [c12, c13, c14] found that IS is not an appropriate metric to evaluate the text-to-image synthesis models since some models tend to generate the same image when the text contains the same word, which is not good generative models but IS could be high (the higher the IS, the better). Thus, we use FID to evaluate our models. Following the prior works [c11, c14], we generate 30,000 images from randomly selected sentences in the test set and compute the FID score.
|baseline + SSD||37.32|
|baseline + FA loss||28.25|
|baseline + SSD + FA loss (Ours)||24.58|
3.2 Quantitative Evaluation
We compare our model with state-of-the-art text-to-image synthesis models on MS-COCO dataset. As shown in the Table 1, the proposed FA-GAN significantly outperforms other methods. Compared with DF-GAN [c14], the FA-GAN improves the FID from 28.92 to 24.58. The results show that our proposed FA-GAN can generate images with better qualities.
3.3 Ablation Study
We perform ablation studies to verify the effectiveness of our proposed method. The components are the self-supervised discriminator (SSD) and the feature-aware loss (FA). We denote the variation that only uses feature-aware loss without SSD as FA. We compare four configurations, (1) FA-GAN w/o SSD and FA, (2) FA-GAN w/o FA, (3) FA-GAN w/o SSD, (4) FA-GAN. We set (1) as our baseline model which has a slightly lower FID score than DF-GAN [c14]. The results are reported in Table 2. It shows that (2), (3) do not guarantee the improvement over the baseline. We found that the effect of (2) SSD without regularization accelerates the convergence speed of the discriminator, making the GAN training unstable and (3) has the side effects of regularization explained in Sec 3.4. However, our proposed method (4) that integrates both SSD and FA significantly improves the model’s performance. We speculate that this is because regularizing meaningful representation can provide more useful signals to the generators.
|baseline + Perceptual loss||37.84|
|baseline + FA loss||28.25|
|baseline + FA loss (Ours)||24.58|
3.4 Comparison with other regularization methods
To verify the superiority of our proposed feature-aware loss, we compare the feature-aware loss with other similar regularization losses using feature representations. Perceptual loss uses the representation from the pre-trained VGG network for image classification, and FA is from Sec 3.3. The performances are reported in Table 3. Regularization loss is not always helpful for the GANs because it can hinder the diversity which causes the FID to increase. The side effects depend on the features which should contain meaningful information. We found that other regularization losses have the side effects, but our method has fewer side effects and more benefits. Fig 3 shows that our model can generate more variants and clear images. The results demonstrate that the feature-aware loss is beneficial regularization and our method can effectively exploit the training data. We conjecture that this is because the self-supervised discriminator extracts more meaningful high-level semantic features than others.
3.5 Qualitative Results
The Fig 4 shows the several examples of synthesized images by our proposed model FA-GAN and other baseline models. We observe that our model can retain the shape or generate clearer details compared to the baselines, which demonstrates the effectiveness of our proposed feature-aware loss.
In this paper, we propose FA-GAN for text-to-image synthesis which is a method to give the generator more useful signals. We design a self-supervised discriminator which is trained in an auto-encoding manner for extracting meaningful features. A feature-aware loss enables a model to generate images that have the same features as real images. Experimental results show the effectiveness of our method and our model can generate more realistic and clearer images. Moreover, the FA-GAN significantly advances the state-of-the-art FID from 28.92 to 24.58. The proposed method can be easily applied to other existing methods with only small modification.
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.B0101-15-0266, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis) and (No.2017-0-00897, Development of Object Detection and Recognition for Intelligent Vehicles)