Text-to-image synthesis is fundamental and novel research domain in computer vision, which was first proposed by Reed in 2016
Text-to-image synthesis is fundamental and novel research domain in computer vision, which was first proposed by Reed in 2016. It could be seen as a reverse task to image caption, aiming to generate natural images from input sentences. Similar to image caption, text-to-image synthesis helps to mining the relationship between text and image, exploring the visual semantic mechanism of human brain. Besides, it has huge application potentials in art creation, computer-aided design , image searching and so on.
The classical methods for text-to-image synthesis mostly applied a similar framework.
They utilize a pretrained text-encoder to encode the input descriptions as a semantic vector, then train a conditional GAN as image-decoder to generate natural images based on a vector combining the semantic vector and a noise vector which conforms to Normal Distribution.
Although such a framework could synthesize high-quality natural images, it split the training process of text-encoder and image-decoder.
In such a framework, the quality of the semantic vector encoded by text-encoder will dominate the best quality of the image-decoder process.
To tackle this issue, we build a fully-trained GAN(FTGAN) for text-to-image synthesis, which could train the text-encoder and image-decoder at the same time.
The classical methods for text-to-image synthesis mostly applied a similar framework. They utilize a pretrained text-encoder to encode the input descriptions as a semantic vector, then train a conditional GAN as image-decoder to generate natural images based on a vector combining the semantic vector and a noise vector which conforms to Normal Distribution. Although such a framework could synthesize high-quality natural images, it split the training process of text-encoder and image-decoder. In such a framework, the quality of the semantic vector encoded by text-encoder will dominate the best quality of the image-decoder process. To tackle this issue, we build a fully-trained GAN(FTGAN) for text-to-image synthesis, which could train the text-encoder and image-decoder at the same time.
Text-to-face synthesis is a sub-domain of text-to-image synthesis, aiming to synthesize face images based on human descriptions. Similar to text-to-image synthesis, there are two main targets for text-to-face synthesis: (1) to generate high-quality images;(2) to generate images which are conformed to the input descriptions. This task, compared with text-to-image synthesis, has more relative values in the public safety domain. As we all know, drawing a picture for suspect just based on the descriptions of the eyewitnesses is a difficult task, which requires professional skills and rich experience. And it is also time-consuming. However, with a well-trained text-to-face model, a normal person could directly generate photo-realistic faces of suspects based on the descriptions of eyewitnesses quickly.
For text-to-image synthesis, the common datasets is CUB , Oxford102  and COCO . Since text-to-face is a sub-domain of text-to-image synthesis, those state-of-the-art networks can be also applied in text-to-face synthesis. However, there are few research are focused on text-to-face synthesis with no standard text-to-face datasets available. This is because there are no standard datasets for text-to-face synthesis. To our best know, there are some research focus on text to face sketch synthesis  and attributes vector to sketch to natural face synthesis . However, for generating natural faces from descriptions, there is only a repository named T2F on Github(https://github.com/akanimax/T2F), which build a network based on ProGAN  and StackGAN  and utilized a dataset called Face2text  for training and testing. But its synthesized results are of poor quality. To tackle this issue, we build a dataset SCU-Text2face based on CelebA , which contains 1000 images. For each face images in SCU-Text2face, there are five descriptions given by different persons. This dataset could help to build a baseline for text-to-face synthesis task.
The main contribution of our method is threefold. (i) An Fully-trained Generative Adversarial Network FTGAN is proposed for synthesizing images from text descriptions. Experimental results show that the FTGAN significantly outperforms previous state-of-the-art GAN models. (ii) A text-to-face dataset SCU-Text2face is build for text-to-face synthesis task. (iii) A baseline for text-to-face synthesis is build based on FTGAN. To our best know, it is the first research focusing on generating natural faces from text descriptions.
2 Related Work
There are two main domains are related to text-to-face synthesis: (1)text-to-image synthesis; (2)face generation. Though there are few research focusing on text-to-face synthesis, it can benefit much from the development of this two domains.
2.1 Text-to-image Synthesis
Despite there are kinds of networks for text-to-image synthesis, they are mostly based on encoder-decoder framework and conditional GAN . This encoder-decoder framework inludes text-encoder and image-decoder. The text-encoder turn input descriptions to semantic vectors and the image-decoder turn the encoded semantic vectors to natural images. There are two main targets for text-to-image synthesis: to generate high-quality images and generate images matching the given descriptions. All the developments of text-to-images synthesis are based on this two targets.
The early research for text-to-image synthesis are mainly focusing on improving the quality of generated images. The task text-to-image was first presented in 2016, Reed et al. presented this novel task and developed two end-to-end networks based on conditional GAN to accomplish it . Reed utilized a pretrained Char-CNN-RNN network for text encoding and built a network similar to DCGAN  as image decoder to generate natural images from vector. Then many researchers made some progresses based on his work . One of the most influential research is made by Zhang et al., they proposed a 2-stages network StackGAN to solve this task, which could generate high-quality images and improved the Inception Score obviously . This network is also inherited by later research [37, 32, 38, 20].
Since the network has already been capable to generate realistic images, researchers progressively focused on achieving another target: improving the similarity between input text and generated images.
Reed et al. proposed a network to generate images based on a box which was first generated. This method helped to generate more accurate results on the output images .
Hong et al. also designed a GAN network based on a similar idea .
On the other side, Sharma et al. utilized dialog to assist the understanding for the description, which helps to synthesize images more relative to the input text .
Dong et al. proposed an approach to generate new images based on the input image and descriptions, which can generate new images which matching input descriptions .
Besides, they also proposed a new training method called Image-Text-Image (I2T2I) which integrates text-to-image and image-to-text (image captioning) synthesis to improve the performance of text-to-image synthesis .
Attention mechanisms have already achieved great breakthroughs in text-related and image-related tasks [33, 31, 35, 27], now it also being used in GANs for text-to-image generation, Xu et al.  built AttnGAN firstly develops an attention mechanism that enables GANs to generate fine-grained high resolution images from nature language description.
Qiao et al.  proposed a text-to-image-to-text network called MirrorGAN which applied a global-local collaborative attention model.
Since there is no available criterion how the generated images matching to the input descriptions, Zhang proposed a visual-semantic similarity measure as an assist to evaluation metrics.
Those research imply a trend that researchers are progressively focusing on boosting the consistency between generated images and input sentences.
proposed a text-to-image-to-text network called MirrorGAN which applied a global-local collaborative attention model. Since there is no available criterion how the generated images matching to the input descriptions, Zhanget al. 
proposed a visual-semantic similarity measure as an assist to evaluation metrics. Those research imply a trend that researchers are progressively focusing on boosting the consistency between generated images and input sentences.
2.2 Face Synthesis
Since GAN was proposed by Goodfellow in 2014  , image synthesis has been a hot topic in deep learning.
Because there are two large scale public dataset: CelebA and LFW
, image synthesis has been a hot topic in deep learning. Because there are two large scale public dataset: CelebA and LFW, face synthesis is also a popular research domain. Almost most of the state-of-art networks will examplify their model’s superiority on face synthesis, including networks based on GAN and networks based both on conditional GAN(such as DCGAN , CycleGAN , ProGAN , BigGAN , StyleGAN , Stargan  and so on). With the development of those networks, the quality of generated face images are becoming better and better. Now some networks could even generate 10241024 face images, much larger than the original images resolution of the face dataset. Those models aim to learn a mapping from noise vector which conforms to Normal distribution to natural face images. But they can’t control the network to generate a precise face image which they want.
To tackle this issue, with conditional GAN, face synthesis have derived many interesting applications about face, such as translating edges to natural face images , exchanging the attributes of two face images , generating a positive face from the side face , generating a full face from eyes’ region only , from face attributes to sketches to natural face images synthesis ,face inpainting  and so on. Those networks try to control the synthesized face images by adding a condition vector, could generate face images which meet the needs of different situations. Text-to-face synthesis is similar to those tasks, which utilize the input descriptions as the control condition.
3 Fully-trained Generative Adversarial Network
In this section, we will elaborate the framework and details about FTGAN. At first, we will compare the framework of FTGAN with the previous text-to-image network. Then a comprehensive description of the network design of FTGAN will be given.
3.1 Fully-trained text-to-image framework
The framework of text-to-image synthesis could be divided into two parts: text-encoder and image-decoder. Text-encoder is responsible for encoding the input sentences to semantic vectors, the Char-CNN-RNN  used in Reed’s work could be seen as a text-encoder. Image-decoder is to generate natural images based on the semantic vectors encoded by text-encoder, which is often similar to networks like DCGAN. Current GAN-based models for text-to-image generation [22, 23, 36, 37, 32] typically split the training of text-encoder and image-decoder. They trained text-encoder firstly, and then utilize the pre-trained text-encoder to train the image-decoder. Different from most of the previous networks, AttnGAN designed a DAMSM network to do text-encoding and calculate the attention map, instead using the Char-CNN-RNN for encode the input sentences. Our work is mainly based on this network. In this section, we propose a novel framework for text-to-image synthesis, which train the text-encoder and image decoder at the same time.
As is shown in Figure 2, the common networks for text-to-image synthesis are based on an encoder-decoder framework. The encoder takes sentences as input and encode it to a semantic vector. The decoder then turn this semantic vector to a natural image. This two parts are of equal importance to text-to-image task. However, most previous research split this framework to two networks and train them separately. Reed firstly proposed a network to solve text-to-image task, they used a pre-trained network called Char-CNN-RNN to calculate the semantic vector of the input text, and then utilized a CNN similar to DCGAN to generate image with this semantic vector. When training this network, they actually just train the CNN network and split the training between encoder and decoder(previous framework in Figure 2). Later research are mostly based on this framework and try to improve the efficiency of the CNN.
However, as the base for image-decoder, the effect of the pre-trained text encoder will directly determined the upper limit of image-decoder. There are two main tasks for text-to-image synthesis: generating high-quality images; the images should conform to the meaning of input text. Using a pre-trained text-encoder and just training the image-decoder could generate high-quality images to some extent. However, we couldn’t make sure if generated images are what we want, because the input of the image-decoder is the semantic vector, which are highly determined by the pre-trained text-decoder. In order to generate higher-quality images which further matching the input text, we should train the text-encoder and image-decoder at the same time.
3.2 Fully-trained Generative Adversarial Networks Design
In this section, we will describe the network details of the proposed FTGAN. Figure 3 shows the detail network of the proposed fully-trained generative adversarial network (FTGAN). The main network are based on conditional GAN, include one generator and 3 discriminators (for different scale at 6464, 128 and 256256). The generator is an encoder-decoder network, which is the main part of text-to-image synthesis. It can be divided into two parts: text-encoder and image-decoder. In previous text-to-image networks, the image-decoder is the most important part. From StackGAN to AttnGAN and MirrorGAN, multi-stages image-decoder has proved its superiority in genarating high-quality images. Here, we also follow this idea. In the follow sections, we will describe main parts of the proposed FTGAN separately.
The text-encoder is construted by a bi-directional Long Short-Term Memory (BiLSTM) that extracts semantic vectors from the input descriptions.
In the BiLSTM, each word corresponds to two hidden states, one for each direction.
We concatenate its two hidden states to represent the semantic meaning of each word.
Through the text-encoder, the input sentences will be encoded as a matrix of
is construted by a bi-directional Long Short-Term Memory (BiLSTM) that extracts semantic vectors from the input descriptions. In the BiLSTM, each word corresponds to two hidden states, one for each direction. We concatenate its two hidden states to represent the semantic meaning of each word. Through the text-encoder, the input sentences will be encoded as a matrix of. Its column is the feature vector for the word. is the dimension of the word vector and is the number of words. On one side, the sentences embedding will be used for calculating the attention maps, which are the inputs of the last two stages in image-decoder, helping to guide the image generation process. Meanwhile, the last hidden states of the BiLSTM are concatenated to be the global sentence vector, denoted by . The semantic vector will be concated with a noise (conform to normal distribution) to a new vector, which is the input of image-decoder. To boost the stability of training process, we pre-trained the text-encoder network first. However, different from previous networks, the parameters in text-encoder will be also updated when training the image-decoder.
The image-decoder is a 3-stages Convolutional Neural Network (CNN) that maps semantic vectors to natural images.
The first stage takes the vector C generated by text-encoder concated with a noise (conforms to normal distribution) as input, reshaping it into 4
is a 3-stages Convolutional Neural Network (CNN) that maps semantic vectors to natural images. The first stage takes the vector C generated by text-encoder concated with a noise (conforms to normal distribution) as input, reshaping it into 44 feature maps(the dark yellow block in Figure 3). Through 4 upsample blocks(blue blocks), the 44 feature maps will be enlarged to 6464. The upsample block is a deconvolution layer, each will enlarge the scale of the feature map twice as it before. Follow the noise processing in StyleGAN , the input noise vector Z will not only be utilized to be combined with semantic vector C, it is also be weighted ( respectively) added into the first 3 deconvolution layers(at 88, 1616, 3232 scale) after full connected layer and reshape operations.
The second and third stages are similar, which are both consist by an upsample block. Different from the upsample blocks in the first stage, the upsample block in the next two stages are followed by a finetune block(light yellow blocks), which is used for further tuning the feature maps after upsampling. The fine-tune block is a constructed by a convolutional layer with a 33 kernel. The second parts take the 6464 feature maps and attention maps as input, and generate 128128 images. The third part is similar to the second part, the only difference is that the feature maps scale is from 128128 to 256256. The attention maps are calculated by referring to the attention maps in AttnGAN, and there is one feature map for every words of the input sentences. After two upsample blocks in the next two stages, 256256 images will be generated, which will be used for calculating a generator loss.
in the FTGAN are similar to each other, referring to previous networks [23, 32].
D0, D1 and D2 all takes sentence embedding C and its corresponding generated images(6464, 128128, 256256 respectively) as inputs.
The input images will firstly be downsampled to 4 4 feature maps by several downsample blocks(according to the resolution of input images).
Each downsample blocks contains a convolution layer, a batchnormalization layer and a leaky relu layer.
Then the sentence embedding vector
4 feature maps by several downsample blocks(according to the resolution of input images). Each downsample blocks contains a convolution layer, a batchnormalization layer and a leaky relu layer. Then the sentence embedding vectorwill also be reshaped to the same shape as image feature maps after being reshaped and repeated. The image feature maps and sentence feature maps will be concated. After several convolution layers, we get the final outputs of discriminator. For the ground-truth and semantic vector pair, discriminator should define it as true. And for the generated images and semantic vector pair, discriminator should define it as false.
Loss functionis also an important part of text-to-image synthesis. The loss functions of FTGAN includes generator loss and discriminator loss. The total generator loss is divided into two parts: the original generator loss and DAMSM loss. The generator loss is similar to common CGAN’s generator loss, include conditional part and unconditional part. But it calculate the generator loss at 3 scales (64, 128 and 256 respectively). And the DAMSM loss is calculated by a pre-trained DAMSM . The image-decoder generate images for every stage at different scales: 6464, 128128 and 256256. Every output images of the 3 stages will be used to calculate generator losses :
where is the generated images for every stage, is the discriminator, is the semantic vector of input sentences.
In order to boost the similarity between input sentences and output images, DAMSMS loss in AttnGAN is also used in generator to guide the training process. Therefore, the total generator loss is:
The discriminator losses are also similar to common discriminator loss, include conditional loss and unconditional loss. Because AttnGAN has done a great job in the image-decoder of text-to-image network, in the part of image-decoder we mainly refer to this network. However, the kernel idea of FTGAN is to train the text-encoder and image-decoder at the same time, which could help to mine deeper relations between text and images, finally generating higher-quality and higher-semantic similarity images. For every discriminator, the discriminator loss is:
where is the ground-truth of the input description. The 3 discriminators are optimized independently.
In summary, we propose novel framework for text-to-image task, which train a total network from input sentences to output images, combining the text-encoder with image-decoder. The fully-trained mechanism enables the network update the parameters in both text-encoder and image-decoder at the same time, which helps to boost the consistency between input sentences and generated images and improve the quality of final synthesized images.
In this section, extensive experiments are carried out to evaluate the proposed FTGAN. We first exemplified the superiority of our proposed FTGAN by comparing with the previous state-of-the-art GAN models for text-to-image [36, 37, 22, 23] on public dataset CUB . Then, we further prove the efficiency of FTGAN on SCU-Text2face, comparing it with AttnGAN and building a baseline for text-to-face synthesis task.
The poposed network is trained on a single 1080Ti GPU. In all our experiments, we empirically set for .
4.1 SCU-Text2face Dataset Construction
In public safety domain, the task of text-to-face is of huge potentials. However, because of lack of dataset, there are few research focus on this task. To our best know, there are just a project on Github and a conference paper are focus on this task. The project T2F on Github designed a network based on ProGAN and StackGAN, using the Face2text dataset for training and testing. But the results of this project are not so satisfactory (as shown in Figure 4). As for the conference paper, what we could only found is an abstract. Therefore, there are still no satisfactory baseline for text-to-face task.
The Face2text dataset is a dataset originally used for image caption. Just like CUB and COCO, it could also be used for text-to-face synthesis. However, this dataset only contains 400 images and the descriptions for those images are not very formal (as shown in Figure 5). Referring to public dataset CUB and COCO, we build a dataset called SCU-Text2face for text-to-face synthesis based on the public face dataset CelebA. SCU-Text2face contains 1000 face images. For each of the face images in it, there are 5 descriptions from 5 different persons. To build a standard text-to-face dataset, we firstly selected 1000 images from CelebA, which all belong to different persons. To maintain a balance of the dataset, those face images in SCU-Text2face contains people who have different ages, sexes and skins. For normalization, all the face images are cropped and reshaped into 256 256. Figure 6 shows some example of the SCU-Text2face.
4.2 Experimrnts on CUB
To prove the superiority of FTGAN, we will first evaluate the proposed network on public dataset CUB. CUB is one of the most popular dataset in text-to-image synthesis(the other two are Oxford102 and COCO), includes 200 birds species. Oxford102 is a dataset of flowers, which is similar to CUB but contains fewer images. The scale of COCO is much larger than CUB and Oxford102, experiments on which is very time-consuming. Thus, we finally choose the CUB dataset to exemplify the proposed FTGAN. Follow the preprocess in previous research [36, 37, 22, 23], we divide this dataset into training set (180 species) and test set(20 species), contains 8855 and 2933 images respectively. In this section, we first quantitatively evaluate the qualitative results of FTGAN. Some examples of the generated images by FTGAN are shown in Figure 7. From images shown in Figure 7, we find that our FTGAN are qualified to generate high-quality images with different kinds of bird postures and backgrounds. However, it is hard to visually prove the superiority of the proposed FTGAN comparing to the previous work. We still need objective criterion to evaluate the quality of the generated images by FTGAN.
Inception Score  is a widely accepted criterion in text-to-image synthesis task. To qualitatively examine the images generated by our FTGAN, here we use Inception Score to evaluate the results of FTGAN, as show in Table 1. Follow the settings in previous work [22, 23, 36, 37, 32], we generate 10 images for each samples, thus the total number of generated test images is 29330, on which we calculate the Inception Score.
|GAN-INT-CLS ||2.88 .04|
|GAWWN ||3.62 .07|
|StackGAN ||3.70 .04|
|StackGAN-v2 ||3.82 .06|
|HDGAN ||4.15 .05|
|AttnGAN ||4.36 .03|
|MirrorGAN ||4.56 .05|
|Our FTGAN||4.63 .05|
From the Table 1 we find that our proposed FTGAN outperforms the state-of-the-art network MirrorGAN, which is also base on AttnGAN. MirrorGAN design a global sentence attention to aid the word attention in AttnGAN and utilize the regenerated captions to calculate stream loss replacing the DAMSM loss proposed by AttnGAN, which is far more complex than the image-decoder of FTGAN. During the published research, the FTGAN achieves a new state-of-the-art Inception Score in CUB dataset. Through the experiments on CUB, we could prove the efficiency of proposed FTGAN.
4.3 Comparison with previous methods on SCU-Text2face
Since there is no public baseline for text-to-face task, in this section, we will set a baseline for this task by FTGAN. The test dataset of SCU-Text2face contains 200 face images. For each face sample, we generate 10 face images, then evaluate them qualitatively and quantitatively.
As shown in figure 1, we could see that FTGAN could generate photo-realistic face images whose quality is close to the ground-truth. Besides, the generated images basically match their descriptions. For example, all those five face images meet the description of "brown hair" and "slender eyebrows" in the sentence.
Figure 8 shows two examples of the generated face images(right) with its relevant input descriptions(left) and attention maps in generating process(middle). For each words in the input sentence, there will be a attention map for it. The attention maps serve as input of the inputs in the second and third stage of the image-decoder. Showing where the network will focus on for every words when generating images. The generated attention maps basically match the focusing area of human brain when reacting to those words. We find that the generated face images are of high consistency with their input sentences. For example, the "blond hair" in the first line and "black hair", "mouth closed" in the second line are all presented in their generated face images.
To prove the superiority of FTGAN, we compare the generated face images of FTGAN and the results of AttnGAN(as shown in figure 9). For each input text, the first line is generated by AttnGAN, and the second line is the results of FTGAN. As shown in the red boxes, we find that the face images generated by AttnGAN are less diverse than FTGAN. And intuitively, our FTGAN is capable to generate higher-quality images than AttnGAN.
Generally speaking, text-to-image synthesis utilize Inception Score as its criterion. To evaluate the networks’ results in CUB, we often use a pre-trained Inception-V3 network which are fine-tuned on CUB to calculate Inception Score. However, for face dataset, there are no pre-trained Inception-V3 model. So we turned to FID score , which is another common criterion for evaluating image synthesis, and could be seen as a boosted version of Inception Score. Besides, in order to evaluate another targets for text-to-face, we refer to two criterion in . Because the final target for text-to-face synthesis is to generate faces similar to their ground-truth just based on the input text, it is a natural idea to judge if the generated face is the same person as ground-truth. We utilize FaceNet  to extract the feature vector of faces, and then calculate the average face semantic distance (FSD) and average face semantic similarity (FSS) between the generated face and ground-truth. The formulas of FSD and FSS are shown in formula 4 and formula 5
where means using a pre-trained Facenet model to extract a semantic vector of the input face, means one of the generated faces, means the ground-truth of the synthesized face image. And means calculating the cosine similarity of two vectors.
A higher FSS score and lower FSD score mean the generated face images are more similar to the ground-truth.
means calculating the cosine similarity of two vectors. A higher FSS score and lower FSD score mean the generated face images are more similar to the ground-truth.
The final results are shown in Table 2. We could find that the FSD value of FTGAN is lower than AttnGAN, consistently, the FSS value of FTGAN is higher than AttnGAN, which means that FTGAN could generate face images which are more similar to the ground-truth than AttnGAN. Because of the limit of dataset, the face similarity of both networks are not very high, just about 59%. However, we find that the generated face images are of high consistency to the input text. To our analysis, the main reason is the descriptions of SCU-Text2face are not complex enough. The descriptions for faces only contain several few attributes (3-5 attributes), which hugely constrains the face similarity between generated face images and its relative ground-truth. If there are comprehensive descriptions for every face images, we believe that the face similarity between the synthesized faces and ground-truth will be boosted obviously.
This paper proposed a novel text-to-image network FTGAN, which train the text-encoder and image-decoder at the same time. Through experiments in the public dataset CUB, FTGAN shows its superiority comparing with the newest state-of-the-art network, achieving 4.63 in Inception Score. Though FTGAN have shown its superiority in boosting the quality of generated images comparing to the previous text-to-image synthesis networks, we found this framework are not so stable in the training process. In the future, we will try to tackle this problem.
Besides, to fill in the blank in the domain of text-to-face, we build a dataset SCU-Text2face for text-to-face synthesis based on faces in CelebA. Every face images in SCU-Text2face have 5 descriptions. Based on SCU-Text2face, we set a baseline for text-to-face synthesis task by FTGAN. We use FID score to evaluate the image quality of synthesized faces. Beside, to evaluate the similarity between generated faces and input text, we calculate the similarity between generated faces and ground-truth to replace it. Experiments show that FTGAN could achieve higher-quality images and more similar faces to the ground-truth. Different from image synthesis on CUB, Oxford102 and COCO, text-to-face generation are more precisely and fixed. Therefore, to futher improve the quality of generated results, more prior information of face could be added to text-to-face synthesis network. The task of text-to-face synthesis has huge potentials in public safety domain. We hope our works could be a good start for this task.
J. Bao, D. Chen, F. Wen, H. Li, and G. Hua.
Towards open-set identity preserving face synthesis.
computer vision and pattern recognition, pages 6713--6722, 2018.
-  A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv: Learning, 2018.
-  X. Chen, L. Qing, X. He, J. Su, and Y. Peng. From eyes to face synthesis: a new approach for human-centered smart surveillance. IEEE Access, 6:14567--14575, 2018.
Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo.
Stargan: Unified generative adversarial networks for multi-domain image-to-image translation.In computer vision and pattern recognition, pages 8789--8797, 2018.
-  X. Di and V. M. Patel. Face synthesis from visual attributes via sketch using conditional vaes and gans. arXiv: Computer Vision and Pattern Recognition, 2018.
-  H. Dong, S. Yu, C. Wu, and Y. Guo. Semantic image synthesis via adversarial learning. In international conference on computer vision, pages 5707--5715, 2017.
-  H. Dong, J. Zhang, D. G. Mcilwraith, and Y. Guo. I2t2i: Learning text to image synthesis with textual data augmentation. In international conference on image processing, pages 2015--2019, 2017.
-  A. Gatt, M. Tanti, A. Muscat, P. Paggio, R. A. Farrugia, C. Borg, K. P. Camilleri, M. Rosner, and L. V. Der Plas. Face2text: Collecting an annotated image description corpus for the generation of rich face descriptions. language resources and evaluation, 2018.
-  I. J. Goodfellow, J. Pougetabadie, M. Mirza, B. Xu, D. Wardefarley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In international conference on neural information processing systems, pages 2672--2680, 2014.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In neural information processing systems, pages 6626--6637, 2017.
-  S. Hong, D. Yang, J. Choi, and H. Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In computer vision and pattern recognition, pages 7986--7994, 2018.
G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.
Labeled faces in the wild: A database for studying face recognition in unconstrained environments.Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
-  R. Huang, S. Zhang, T. Li, and R. He. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In international conference on computer vision, pages 2458--2467, 2017.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In international conference on learning representations, 2018.
-  T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. CoRR, abs/1812.04948, 2018.
-  T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In european conference on computer vision, pages 740--755, 2014.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In international conference on computer vision, pages 3730--3738, 2015.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv: Learning, 2014.
-  M. E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Conference on Computer Vision, Graphics & Image Processing, 2009.
-  T. Qiao, J. Zhang, D. Xu, and D. Tao. Mirrorgan: Learning text-to-image generation by redescription, 2019.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In international conference on learning representations, 2016.
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee.
Generative adversarial text to image synthesis.
international conference on machine learning, pages 1060--1069, 2016.
-  S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In neural information processing systems, pages 217--225, 2016.
-  T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In neural information processing systems, pages 2234--2242, 2016.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In computer vision and pattern recognition, pages 815--823, 2015.
-  S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and Y. Bengio. Chatpainter: Improving text to image generation using dialogue. arXiv: Computer Vision and Pattern Recognition, 2018.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In neural information processing systems, pages 5998--6008, 2017.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. J. Belongie. The caltech-ucsd birds-200-2011 dataset. Advances in Water Resources, 2011.
-  T. Wang, M. Liu, J. Zhu, A. J. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In computer vision and pattern recognition, pages 8798--8807, 2018.
-  Y. Wang, L. Chang, Y. Cheng, L. Jin, Z. Cheng, X. Deng, and F. Duan. Text2sketch: Learning face sketch from facial attribute text. In international conference on image processing, pages 669--673, Oct 2018.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In international conference on machine learning, pages 2048--2057, 2015.
-  T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In computer vision and pattern recognition, pages 1316--1324, 2018.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention networks for image question answering. In computer vision and pattern recognition, pages 21--29, 2016.
J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang.
Generative image inpainting with contextual attention.In computer vision and pattern recognition, pages 5505--5514, 2018.
-  H. Zhang, I. J. Goodfellow, D. N. Metaxas, and A. Odena. Self-attention generative adversarial networks. arXiv: Machine Learning, 2018.
-  H. Zhang, T. Xu, and H. Li. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In international conference on computer vision, pages 5908--5916, 2017.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1--1, 2018.
-  Z. Zhang, Y. Xie, and L. Yang. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In computer vision and pattern recognition, pages 6199--6208, 2018.
-  J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In international conference on computer vision, pages 2242--2251, 2017.