Recent studies on generative adversarial networks (GAN) [Goodfellow et al. (2014), Mao et al. (2017), Chen et al. (2016), Radford et al. (2016), Gregor et al. (2015)] have achieved impressive success in generating images. However, since most of the generated images through GAN do not exactly satisfy the users’ expectation, people use auxiliary information in various forms such as base images and texts as main cues for controlling the generated images. In this line of research, it is actively studied to create images under specific conditions such as transferring the style of an image [Choi et al. (2018), Kim et al. (2017), Zhu et al. (2017), Ma et al. (2017), Chen and Koltun (2017)] or generating an image based on text description [Zhang et al. (2017b), Zhang et al. (2017a), Reed et al. (2016b), Mansimov et al. (2016), Reed et al. (2016c)].
Among these, text-to-image generation is meaningful in that it can fine-tune the generated image through the guide of text description. Although GAN can be applied to diverse text to image generation works, most of the applications are focused on controlling the shape and texture of the foreground and relatively less attention has been paid to the background. For example, Reed et al. (2016d) created images from the text containing information on the appearance of the object to generate. The method can generate a target image in a given location or with a specified pose. However, they also cannot control the background. There are some works Dong et al. (2017); Ma et al. (2017) that have considered the background. Dong et al. (2017) considered multi-modal conditions based both on image and text and can change the base image according to the text description. However, the method has a restriction that a similar object to the generated one should be in the base image. Thus, it can be considered as a style transfer problem. Ma et al. (2017) also solved the multi-modal style transfer problem from a reference person image to a target pose. They kept the background and changed the reference person’s pose to a target pose.
|Reed et al. (2016d)|
|Dong et al. (2017)|
In this paper, we define a novel problem of conditional GAN which generates a new image by synthesizing the background of an original base image and a new object described by the text description in a specific location. Different from the existing works Reed et al. (2016d); Dong et al. (2017), we aim to draw a target object on a base image that does not contain similar objects. To the best of our knowledge, our research is the first attempt to synthesize a target image by combining the background of an original image and a text-described foreground object. As shown in Fig. 1, our approach is different from other studies that try to create a random image at a desired location Reed et al. (2016d) or to change the foreground style Dong et al. (2017) in that we want to independently apply separate foreground and background conditions for image synthesis. This problem is not trivial because the generated foreground object and the background from the base image should be smoothly mixed with a plausible pose and a layout.
To tackle this problem of image synthesis, we introduce a new architecture of multi-conditional GAN (MC-GAN) using a synthesis block which acts like a pixel-wise gating function controlling the amount of information from the base background image by the help of the the text description for a foreground object. With the help of this synthesis block, MC-GAN can generate a natural image of an object described by the text in a specified location with the desired background. To show the effectiveness of our method, we trained MC-GAN using the Caltech-200 bird dataset Wah et al. (2011) and the Oxford-102 flower dataset Nilsback and Zisserman (2008, 2007) and compared the performances with those of a baseline model Dong et al. (2017).
Our main contributions can be summarized as follows: (1) We define a novel multi-modal conditional synthesis problem using base image, text and location. (2) To handle complex multi-modal conditions in GAN, we suggest a new architecture of MC-GAN using synthesis blocks. (3) The proposed architecture was shown to generate a plausible natural scene as shown in Fig. 1 and 2 by training publicly available data regardless of whether the base image contains a similar object to create or not.
2 Related Work
Among the diverse variants of GAN Zhu et al. (2017); Choi et al. (2018, 2018); Reed et al. (2016b); Wang et al. (2018), we can mainly categorize the studies into three large groups: 1) the style transfer problem 2) the text-to-image problem 3) the multi-modal conditional problem.
Style Transfer: The style transfer problem uses an image as input and converts the foreground to a different style. Zhu et al. (2017); Choi et al. (2018) and Kim et al. (2017) transfered images to a different domain style, such as a smile face to an angry face, or a handbag to shoes. In addition to these applications, some works created a real image from a map of segmentation label Wang et al. (2018); Chen and Koltun (2017), or used a map of part location in combination with an original image of a person to generate images of a person with different poses Ma et al. (2017).
Text-to-Image: The text-to-image problem uses text description as an input to generate an image. It has great advantages over other methods in that it can easily generate an image with the attributes that a user really wants, because text can express detailed high-level information on the appearance of an object with detailed attributions. The raw text is usually embedded according to the method in Reed et al. (2016a)2016b) proposed a novel text-to-image generation model, and Zhang et al. (2017a, b) improved the image quality later by stacking multiple GANs.
Multi-modal Conditional Image Generation: A multi-modal conditional problem is to create images satisfying multiple input conditions in different modalities such as a pair of (image, location) or (image, text). Reed et al. (2016d) provided a desired object position by a bounding box or a set of object part locations by points in an empty image in combination with the text description to generate an object image (see Fig. 1). Dong et al. (2017) used both an image and a text as input to GAN for image generation (see also Fig. 1). They intended to keep the image part irrelevant to the text and to change the style of the object contained in the base image based on the text description. Although Reed et al. (2016d) is similar to our work in terms of using the location information, our method generates an object with an appropriate pose by understanding the semantic information of the background image automatically. Compared to our method, the method of using parts’ locations in Reed et al. (2016d), which requires a user to select parts’ locations, is somewhat unnatural and time-consuming. However, bounding box condition in Reed et al. (2016d) is similar to our problem, thus it can be said that our study partially includes the problem defined in Reed et al. (2016d). The method in Dong et al. (2017) also uses image and text conditions together. However, our study does not have a restriction that the same kind of object to generate must be in the base image. In other words, ours does not change the style of an already existing object but synthesize a new object with a slightly but properly changed background. (See the last row of Fig. 1 and 2.)
Fig. 3 represents the overall structure of the proposed MC-GAN. The generator of MC-GAN firstly encodes the input text sentence into the text embedding using the method in reed2016txt_emb. As in Zhang et al. (2017a) and Zhang et al. (2017b),
is concatenated with a noise vectorto which fully connected (FC) layers are applied to constitute a seed feature map. After then, we use a series of synthesis blocks whose inputs are the seed feature map, which in combination with the image features from the background image generates an output image and a segmentation mask. The synthesis blocks are used to prevent overlapping between the generated object and the background. In Section 3.1, we describe the characteristics of the proposed synthesis block in more detail and we introduce the detailed explanation of the model structure and the training strategy in Section 3.2,
3.1 Synthesis Block
Fig. 4 (a) describes the framework of the proposed synthesis block. In the synthesis block, the background (BG) feature is extracted from the given image without non-linear function (i.e
only using convolution and batch normalization (BN))Ioffe and Szegedy (2015)
and the foreground (FG) feature is the feature map from the previous layer. As shown in the figure, the BG feature is controlled by multiplying it with an activated switch feature map, a fraction of the FG feature map. The sizes of the BG and FG feature maps are the same, and the depth of FG feature map is doubled as it passes through the convolution layers. A half of the doubled feature map denoted as switch in the figure is used as an input to the sigmoid function, while the other half is forwarded to generate a larger FG feature map for the next synthesis block. The switch determines what amount of BG information should be retained for the next synthesis block. After switch feature map is activated by the sigmoid function, it is multiplied element-wise with the BG feature map for the suppression of background information where the object is to be generated. Finally, the spatial dimension of feature map is doubled by the upsampling layer after element-wise addition of the suppressed BG feature map with the FG feature map. Because the MC-GAN hasloss comparing the background of the created image with that of the base image, the switch has an effect of suppressing the base image in the object area and mimicking the base image in the background. Therefore, a visualized switch map has the opposite concept to that of the segmentation mask.
Fig 4 (b) and (c) show a couple of output examples. From left to right they are 1) a cropped background image from a specific location, 2) the generated image, 3) the generated mask and 4) the switch feature map from the final synthesis block. A close look at the switch feature map in Fig 4 (b) shows that it does not change the original background much because the object naturally goes with the background. Thus, the background region is highly activated in the switch while the object region is suppressed (see the last column). On the other hand, in in Fig 4 (c), because a picture of bird standing on the air is unnatural, the generator adds a branch in the figure. At this time, since the original background should not overlap with the newly generated branch, the background information of the new branch area deactivates the corresponding area of the switch map.
3.2 Network Design and Loss Function
MC-GAN encodes the input text sentence using the method in reed2016txt_emb. However, the vector generated by this method lies on a high dimensional manifold, while the number of available data is relatively small. Zhang et al. (2017a, b) pointed out this problem and proposed the conditioning augmentation method and used fully connected layers to make an initial seed feature map from the text embedding and a noise vector . Here, we follow this method of creating the initial seed feature map as in Zhang et al. (2017a, b). A cropped region from the base image as well as the text sentence is inputted to MC-GAN. The spatial size of the input image is and the number of synthesis blocks is . The size of the seed feature map after fully connected layer is and the resolution is doubled with each pass through the upsampling layer. The upsampling layer uses the nearest neighborhood method to double the resolution and a
convolution is applied with BN and ReLU activation to improve the quality of the image. The number of channels is halved for each block of upstream. Conversely, the method of creating an image feature (downstream) does not use any non-linear function, and each step consists of a
convolution layer with BN. At each step of downstream, the spatial resolution is halved using stride 2 and the number of channels is doubled. By combining the upstream and downstream, the finalfeature map is obtained which is converted into 4 channels (3 for RGB and 1 for segmentation masks).
The discriminator takes a tuple of image-mask-text (--) as an input. The convolution followed by BN and Leaky ReLU downsamples the image and the mask into an image code and an image-mask code separately both of which have a resolution of . The image-mask code is concatenated with the replicated text code , which is obtained by the conditioning augmentation technique Zhang et al. (2017a, b) using the text embedding . We apply a convolution layer to the associated image-mask-text code, then perform BN and Leaky ReLU activation to reduce the dimension. The image code , image-mask code , and image-mask-text code are trained by the method proposed in Mao et al. (2017). In our case, the discriminator learns the following four types of input tuples.
Here, the subscript indicates whether the tuple matches or not. (e.gthe tuple (
) means that the image matches with the text but the segmentation mask is mismatched.) Using the four types of tuples, the discriminator loss function for the outputbecomes
Here, and denote the distributions of real and generated data respectively. means the output of the generator using the base image , text and noise . The first term enforces the discriminator to output 1 for the true input (type 1), the second term tries to distinguish mismatching texts (type 2), the third term for distinguishing false masks (type 3), and finally the last term is to distinguish the fake image and mask from the real one. Likewise, the loss functions for and become
In the training of the generator, to the general loss term of GAN, regularization terms for the conditioning augmentation and the background reconstruction are added as follows:
. The last term, the background reconstruction loss, affects the feature extraction of the base image and suggests the activation of the switch determining which part should be taken for synthesis. The operatordenotes the morphological erode operation to the mask for smoothing and is element-wise multiplication. The areas in the fake image and the base image excluding the object part are taken and trained by the loss.
4.1 Dataset and training details
We validated the proposed algorithm using the publicly available Caltech-200 bird Wah et al. (2011) and Oxford-102 flower Nilsback and Zisserman (2008) datasets. For comparison, in addition to several ablation methods, we tested recent Dong et al. (2017)’s work which uses image and text based multi-modal conditions. Reed et al. (2016d) also used multi-modal conditions for generating images but they did not consider image condition and both Dong et al. (2017) and Reed et al. (2016d) are commonly based on the method in Reed et al. (2016b). Thus, only the work of Dong et al. (2017) was compared.
Caltech-200 bird dataset consists of 200 categories of bird images (150 categories for train and 50 categories for test), and gives ground-truth segmentation mask maps for all the 11,788 bird images. For the text attributes, we used the captions from Reed et al. (2016a) which contains captions for each image. The captions describe the attributes of a bird such as appearance and colors. For the background image, we cropped the image patches from the Caltech-200 bird dataset excluding birds by using the segmentation mask. Separate sets of background images were used for training and test.
Oxford-102 flower dataset includes categories of flower images. The dataset is divide into a training set with categories and a test set with categories. To achieve the ground truth segmentation mask, we used a segmentation method of Nilsback and Zisserman (2007). For the background images, we crawled 1,352 images (1,217 for training and 135 for testing) from the web with keywords ’flower leaf’ or ’flower foliage’. The captions of the flower images from Reed et al. (2016a) were used, which describes the shape and colors of the flowers.
In the training, an initial learning rate of 0.0002 and Adam optimization Kingma and Ba (2015) with a momentum of were used. To generate bird images, we used a batchsize of and trained the network for epochs. For the flower dataset, we trained epochs. We set for experiments on both datasets while was set to 15 for the bird and 30 for the flower. We also used an image augmentation technique including flipping, zooming and cropping randomly. The size of the seed feature map was and 4 synthesis blocks were used to generate an image.
4.2 Comparison with the Baseline Method
We compared our method with the baseline Dong et al. (2017), which is also an image-text multi-conditonal GAN, for our new synthesis problem. Originally, Dong et al. (2017) reduced the learning rate by 0.5 in every 100 epochs and trained until 600 epochs. In this experiment, we decreased the learning rate in every 200 epochs, and trained the network with the same number of epochs, for fair comparison. Fig. 5 shows some examples of generated images from the proposed MC-GAN and the baseline work of Dong et al. (2017). From the figures, we confirmed that the results from Dong et al. (2017) did not preserve the background information, or only generated background images without the target object.
4.3 Comparison using the Synthesis Problem in Dong et al. (2017)
Here, we show that the synthesis block solves the multi-modal conditional problem more reliably than the baseline Dong et al. (2017). The semantic synthesis problem of Dong et al. (2017) aims to keep the features of input images that are irrelevant to target text descriptions and to transfer the relevant part of the input image to the one that matches the target text description. We used the same text embedding method without segmentation mask, and we reduced the learning rate by 0.5 for every 100 epochs until 600 epochs under the same condition as Dong et al. (2017). The proposed structure of MC-GAN is applied to the generator and discriminator networks, but only the image-text pair loss is used for the discriminator as in Dong et al. (2017).
Fig. 6 shows some examples of baseline method and ours. Dong et al. (2017) worked well if the background image is not complicated (column 3 and 6), but if the background is complex (column 5), it fails to generate a plausible object. Even though the object was generated well, the irrelevant part of image was also changed a lot. On the other hand, our generator using the proposed synthesis block stably maintained the shape of the object and the texture of the background even in a complex background, and the irrelevant background part to the text description rarely changes. Although the segmentation mask is not used in the training, image features were provided to each synthesis block to keep the background and the shape of the object. Therefore, the background part which was irrelevant to the target texts was maintained and at least the shape was not changed strangely even if the color of the object changed as the text description.
4.4 Interpolation and Variety
To generate images appropriately for various sentences, the latent manifold of the text embedding should be continuously trained. We generated continuously changing images by linearly interpolating the two different text embeddings from the sentences shown at the bottom of Fig.8. The figure shows some example images that changes its color smoothly (mainly from orange to blue / from gray to yellow) following interpolated text embedding under the same noise and image conditions.
As another experiment, we generated images using linearly interpolating the two noise vectors, which are all-zero () and all-one () vectors, under the same text and image conditions to demonstrate our model’s variety and stability. Fig 9 shows some resultant images. Although it depends on the text and image conditions, but we usually got half of visually successful samples as can be seen in Fig 9.
4.5 The effect of switch
The switch in the synthesis block prevents the background and the foreground from overlapping each other. We compared the images generated by changing the switch value (the output value of the sigmoid function) under the fixed trained model of MC-GAN with the same base image, text and noise conditions. Fig. 10 shows some results. If we turn off all the switches (by zeroing the values) to prevent background information from being added, the original background disappears and only the generated object and the changed background are present. If all the switch values are set to 0.5 (half on), the background is reconstructed, but the object and the background slightly overlap and the image gets blurred compared to that of the original MC-GAN. Finally, when all the switches are turned on, the original background information are added without suppression. In this case, the object and the background overlap with each other and the object is not properly visualized. By this experiment, we can verify that the switch in the synthesis block analyzes the current image and text conditions and adjusts the image feature flexibly to assist the proper synthesis of an image.
4.6 High Resolution Image Generation by Stacked GAN
Based on the proposed MC-GAN, we additionally introduce a model to generate high resolution images by adding the StackGAN style two stage generator. The initial spatial size of feature map in MC-StackGAN is and two multiple GANs were stacked for generating images like Zhang et al. (2017a, b). The structure of the first GAN is the same as our original method and the second GAN takes a final feature map from the first GAN and a text embedding code from conditioning augmentation without noise vector as mentioned in Zhang et al. (2017a, b). The first GAN’s final feature map is concatenated with the replicated text embedding code. We applied a convolution to the associated feature map with batch normalization and ReLU. Finally, we used one more synthesis block and an upsampling layer to generate images. MC-StackGAN generates objects more stably than MC-GAN, but tends to transform the original base image compared to MC-GAN.
In this paper, we introduced a new method of GAN to generate an image given a text attribute and a base image. Different from the existing text-to-image synthesis algorithms only considering the foreground object, the proposed method aims to generate the proper target image, as well as preserving the semantics of the given background image. To solve the problem, we newly proposed a MC-GAN structure and a synthesis block which is a core component enabling a photo-realistic synthesis by smoothly mixing foreground and background information. Using the proposed method, we confirmed that our model can generate diverse forms of a target object according to the text attribute while preserving the information of the given background image. We also confirmed that our model can generate the object even when the background image does not include the same kind of object as the target, which is difficult for existing works.
This work was supported by Green Car Development project through the Korean MTIE (10063267) and Next-Generation Information Computing Development Program through the National Research Foundation of Korea (2017M3C4A7077582)
Chen and Koltun (2017)
Qifeng Chen and Vladlen Koltun.
Photographic image synthesis with cascaded refinement networks.
The IEEE International Conference on Computer Vision (ICCV), volume 1, 2017.
- Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
Choi et al. (2018)
Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul
Stargan: Unified generative adversarial networks for multi-domain image-to-image translation.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Dong et al. (2017) Hao Dong, Simiao Yu, Chao Wu, and Yike. Guo. Semantic image synthesis via adversarial learning. In The IEEE International Conference on Computer Vision (ICCV), 2017.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Gregor et al. (2015) Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. 2015.
- Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015.
Kim et al. (2017)
Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim.
Learning to discover cross-domain relations with generative
International Conference on Machine Learning (ICML) 2017, 2017.
- Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2015.
- Ma et al. (2017) Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose guided person image generation. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 405–415, 2017.
- Mansimov et al. (2016) Elman Mansimov, Emilio Parisotto, Jimmy Ba, and Ruslan Salakhutdinov. Generating images from captions with attention. In ICLR, 2016.
- Mao et al. (2017) Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2813–2821. IEEE, 2017.
- Nilsback and Zisserman (2008) M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
- Nilsback and Zisserman (2007) Maria-Elena Nilsback and Andrew Zisserman. Delving into the whorl of flower segmentation. In BMVC, pages 1–10, 2007.
- Radford et al. (2016) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations (ICLR), 2016.
- Reed et al. (2016a) Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 49–58, 2016a.
- Reed et al. (2016b) Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text-to-image synthesis. In Proceedings of The 33rd International Conference on Machine Learning, 2016b.
- Reed et al. (2016c) Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Victor Bapst, Matt Botvinick, and Nando de Freitas. Generating interpretable images with controllable structure. Technical report, 2016c.
- Reed et al. (2016d) Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. Learning what and where to draw. In Advances in Neural Information Processing Systems, pages 217–225, 2016d.
- Wah et al. (2011) C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
- Wang et al. (2018) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Zhang et al. (2017a) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In The IEEE International Conference on Computer Vision (ICCV), pages 5907–5915, 2017a.
- Zhang et al. (2017b) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. arXiv: 1710.10916, 2017b.
- Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In The IEEE International Conference on Computer Vision (ICCV), 2017.