Generative adversarial networks (GANs) goodfellow2014generative
and their variants have received massive attention in the machine learning and computer vision communities recently due to their impressive performance in various tasks, such as categorical image generationmiyato2018cgans , text-to-image synthesis reed2016generative zhang2017stackgan isola2017image zhu2017unpaired choi2018stargan , and semantic manipulation park2019semantic . The goal of GANs or e.g., cGANs is to learn a generator that mimics the underlying distribution represented by a finite set of training data. Considerable progress has been made to improve the robustness of GANs.
However, when the training data does not represent the underlying distribution well, i.e., the empirical training distribution deviates from the underlying distribution, GANs trained from under-represented training data mimic the training distribution, but not the underlying one. This situation occurs because data collection is labor intensive and it is difficult to be thorough. Additionally, some modes of the underlying distribution could be missing in the training data due to insufficient quantity and in particular, diversity. Specifically, we consider that the underlying distribution is under-represented by the training data at the category level, i.e., some categories have no training data examples.
Training a GAN conditioned on category labels requires collecting training examples for each category. If some categories are not available in the training data, then it appears infeasible to learn to generate their representations without any additional information. For instance, in the task of hair recoloring (or hair color transfer), if we want to train an image-to-image translation model that recolors hair by rare colors such as purple, it is necessary to collect images with those hair colors. However, it is impractical to collect all possible dyed hair colors for arbitrary recoloring. Another example is that if the training data consists of only red colored roses, the GANs’ discriminators would reject the other colors of roses and fail to generate roses of colors other than red. At the same time, we want to ensure that GANs will not generate a rose with an unnatural color. To the best of our knowledge, no previous works have investigated improving the diversity of the training distribution to better mimic the underlying distribution.
To this end, we propose Knowledge-Guided Generative Adversarial Networks (KG-GANs), a novel GAN framework that incorporates domain knowledge into GANs to enrich and expand the generated distribution, hence increasing the diversity of the generated data at the category level. Our key idea is to leverage domain knowledge as another learning source other than data to guide the generator to explore different regions of the image manifold. By doing so, the generated distribution goes beyond the training distribution and better mimics the underlying one. Note that domain knowledge serves GANs as a guide not only to explore diversity but also to constrain exploring into regions that are knowingly impossible, such as generating gray roses.
Our framework consists of two parts: (1) constructing the domain-knowledge for the task at hand, and (2) training two generators and that are conditioned on available and unavailable categories, respectively. We formalize domain-knowledge as a constraint function that explicitly measures whether an image has the desired characteristics of a particular category. On the one hand, the constraint function is task-specific and guides the learning of . On the other hand, we share weights between and to leverage the knowledge learned from available to unavailable categories.
We validate KG-GAN on two tasks: fine-grained image generation and hair recoloring. For fine-grained image generation, such as flower images with different species, we aim to train a category-conditional generator that is capable of generating both seen and unseen categories. The domain knowledge we use here is a semantic embedding representation that describes the semantic relationships among fine-grained categories, such as the textual features from descriptions of each category’s appearance. The constraint function used is a deep regression network that predicts the semantic embedding vector of the underlying category of an image.
For hair recoloring, given the training data consists of three hair colors, our method trains an image-to-image translation model that is capable of recoloring face images with arbitrary
hair colors. We leverage the domain knowledge that hair color is characterized by the dominant color of the hair region (not including eyebrows and beard). In this case, hair segmentation plays a key role in implementing the constraint function that performs hair color estimation. We jointly train the segmentation network,, and
in an unsupervised manner. We additionally leverage the observation that hair recoloring can be safely assumed as a spatially invariant linear transformation applied only on the hair region. We propose a new generator architecture that outputs transformations rather than images. This significantly improves the segmentation accuracy and hence the recoloring quality.
Our main contributions are summarized as follows: (1) We tackle the problem that the training data cannot well represent the underlying distribution. (2) We propose a novel generative adversarial framework that incorporates domain knowledge into GAN methods. (3) We demonstrate the effectiveness of our KG-GAN framework on fine-grained image generation and hair recoloring tasks. We are able to enrich the diversity of the generated distribution by generating categories not available in the original training data.
2 Related work
Since a comprehensive review of the related works on GANs is beyond the scope of the paper, we only review representative works most related to ours.
Generative Adversarial Networks. Generative adversarial networks (GANs) goodfellow2014generative introduce an adversarial learning framework that jointly learns a discriminator and a generator to mimic a training distribution. Conditional GANs (cGANs) extend GANs by conditioning on additional information such as category label mirza2014conditional ; miyato2018cgans , text reed2016generative , or image isola2017image ; zhu2017unpaired ; choi2018stargan . SN-GAN miyato2018cgans ; miyato2018spectral proposes a projection discriminator and a spectral normalization method to improve the robustness of training. CycleGAN zhu2017unpaired employs a cycle-consistency loss to regularize the generator for unpaired image-to-image translation. StarGAN choi2018stargan proposes an attribute-classification-based method that adopts a single generator for multi-domain image-to-image translation. Hu et al. hu2018deep introduce a general framework that incorporates domain knowledge into deep generative models. Its primary purpose is improving quality, but not improving diversity, which is our goal.
Diversity. Creative adversarial network (CAN) elgammal2017can augments GAN with two style-based losses to make its generator go beyond the training distribution and thus generate diversified art images. Imaginative adversarial network (IAN) hamdi2019ian proposes a two-stage generator that goes beyond a source domain (human face) and towards a target domain (animal face). Both works aim to go beyond the training data. However, they target art image generation and do not possess a well-defined underlying distribution. Mode-Seeking GAN mao2019mode proposes a mode seeking regularization method to alleviate the mode collapse problem in cGANs, which happens when the generated distribution cannot well represent the training distribution. Our problem appears similar but is ultimately different. We tackle the problem that the training distribution under-represents the underlying distribution.
Zero-shot Learning We refer readers to xian2017zero for a comprehensive introduction and evaluation of representative zero-shot learning methods. The crucial difference between their and our method is that they focus on image classification while we focus on image generation. Recently, some zero-shot methods xian2018feature ; felix2018multi ; xian2019f
propose to learn feature generation of unseen categories for training zero-shot classifiers. Instead, this work aims to learn image generation of unseen categories.
This section presents our proposed KG-GAN that incorporates domain knowledge into the GAN framework. We first provide an overview of KG-GAN. Then, we show how to apply KG-GAN on two representative tasks: fine-grained image generation and hair recoloring.
We consider a set of training data under-represented at the category level, i.e., all training samples belong to the set of seen categories, denoted as (e.g., black, brown, blond hair color categories), while another set of unseen categories, denoted as (e.g., any other hair color categories), has no training samples. Our goal is to learn categorical image generation for both and . To generate new data in , KG-GAN applies an existing GAN-based method to train a category-conditioned generator by minimizing GAN loss over . To generate unseen categories , KG-GAN trains another generator from the domain knowledge, which is expressed by a constraint function that explicitly measures whether an image has the desired characteristics of a particular category.
KG-GAN consists of two parts: (1) constructing the domain knowledge for the task at hand, and (2) training two generators and that condition on available and unavailable categories, respectively. KG-GAN shares the parameters between and to couple them together and to transfer knowledge learned from to . Based on the constraint function , KG-GAN adds a knowledge loss, denoted as , to train . The general objective function of KG-GAN is written as .
3.1 Fine-grained Image Generation
Given a fine-grained image dataset that some categories are unseen, our aim is using KG-GAN to generate unseen categories in addition to the seen categories. Figure 1 shows an overview of KG-GAN for fine-grained image generation. Our generators take a random noise and a category variable as inputs and generate an output image . In particular, and , where and belong to the set of seen and unseen categories, respectively.
We leverage the domain knowledge that fine-grained categories are characterized by a semantic embedding representation, which describes the semantic relationships among categories. In other words, we assume that each category is associated with a semantic embedding vector . For example, we can acquire such feature representation from the textual descriptions of each category. We propose the use of the semantic embedding in two places. One is for modifying the GAN architecture, and the other is for defining the constraint function.
SN-GAN. Here we briefly review SN-GAN (please refer to miyato2018cgans ; miyato2018spectral for more details), which is adopted as the GAN part of KG-GAN. SN-GAN employs a projection-based discriminator and adopts spectral normalization for discriminator regularization. The objective functions for training and use a hinge version of adversarial loss. The category variable in SN-GAN is a one-hot vector indicating which target category. We propose to replace the one-hot vector by the semantic embedding vector
. By doing so, we directly encode the domain knowledge into the GAN training. The loss functions of the modified SN-GAN are defined as
Semantic Embedding Loss. We define the constraint function as predicting the semantic embedding vector of the underlying category of an image. To achieve that, we implement by training an embedding regression network from the training data. Once trained, we fix its parameters and add it to the training of and . In particular, we propose a semantic embedding loss as the role of knowledge loss in KG-GAN. This loss requires the predicted embedding of fake images to be close to the semantic embedding of target categories. is written as
Total Loss. The total loss is a weighted combination of and . The loss functions for training and for training and are respectively defined as
3.2 Hair Recoloring
Given a set of face images categorized by hair color, which is defined as a discrete set of representative colors. (For example, in the CelebA dataset liu2015faceattributes .) Our goal is using KG-GAN to achieve hair recoloring with arbitrary colors. Figure 2 shows an overview. Given an input face image , hair recoloring aims to transfer the image’s hair color into a target color . Here, KG-GAN trains and where is the target category represented by a one-hot vector, and is the target color represented by a D RGB vector.
To train , we adopt StarGAN choi2018stargan as the GAN part of our KG-GAN. For , we leverage two pieces of domain knowledge about hair color: (1) hair color is characterized by the dominant color of the hair region (the upper part of head), and (2) if the hair region can be identified, its recoloring process can be simplified as a simple color transfer. We use the former domain knowledge for defining the constraint function while the latter is used for designing a specialized generator architecture.
StarGAN. Here we briefly review StarGAN. Please refer to choi2018stargan for more details. StarGAN learns a single generator to perform multi-domain image-to-image translation. In our case, we can train a StarGAN model on the CelebA Face Dataset to translate images into one of the three categories (i.e., domains): black, brown, and blond. The StarGAN discriminator performs domain classification and discrimination between real and fake images, i.e., .
The total loss in StarGAN is comprised of three losses: an adversarial loss , a domain classification loss , and a cycle-consistency loss . The adversarial loss is a W-GAN loss arjovsky2017wasserstein with a gradient penalty term gulrajani2017improved . The domain classification loss is responsible for translation quality. The cycle-consistency loss regularizes the generator. The loss functions used in training and are respectively defined as
Hair Color Estimation. Hair color is characterized by the dominant color of the hair region. Therefore, we define the constraint function as explicitly extracting the hair color from an image. To this end, we propose a segmentation-based color estimation network
that consists of two steps: (1) Performing hair segmentation to obtain the hair probability mapof the input image . (2) A weighted average of weighted by to obtain the hair color, which is expressed by
where and are the -th pixel value of and , respectively. is a weighting function that turns the segmentation probabilities into binary weights. is defined as where I is the indicator function.
Hair Segmentation. The first step of requires hair segmentation. Instead of training the hair segmentation sub-network from another set of labeled training data, we propose training in an unsupervised manner by jointly training , , and together. Following recent image-to-image translation methods zhang2018generative ; zhao2018modular that adopt mask mechanism in their generator architecture, we add a mask network in the StarGAN generator and shares its parameters with the segmentation sub-network in . On the one hand, the modified StarGAN generator performs image translation by where is a mask image, is pixel-wise multiplication, and
is a convolutional neural network that generates the foreground image. On the other hand, hair segmentation is learnedfrom training and is utilized in the color estimation network for training .
Generator Architecture. Since hair segmentation plays an essential role in the constraint function, its accuracy directly influences the quality of . However, the masks from the modified StarGAN generator do not guarantee to localize hair region. It is because an inaccurate hair mask could still yield a high-quality result as long as is of high-quality.
To further improve the accuracy of unsupervised segmentation, we leverage domain knowledge: if we can identify the hair region, we can simplify the recoloring process as a simple color transfer. Specifically, we assume the recoloring process is a spatially invariant linear transformation. Such an assumption greatly restricts the foreground generation from a highly nonlinear mapping to a linear one. By doing so, the segmentation network is forced to be more accurate; otherwise, a false-positive region (such as eyebrows) could be transformed into an unrealistic color and then appears in the output image. The linear transformation, parameterized by a matrix , takes a pixel color as input and outputs a new color by . Such a transformation can be equivalently expressed by a x convolution as . Finally, based on and , our transform generators and are respectively defined as
where and are convolutional neural networks that generate x convolutional filters.
Color Loss. Based on Equation 5, We propose a color loss as the knowledge loss in KG-GAN. encourages the predicted color of the foreground image, , to be close to the target color , which is uniformly sampled from the RGB color space. The color loss is written as
Cycle Loss. In addition to the cycle-consistency loss in StarGAN for regularizing , we propose another cycle-consistency loss that considers both and to further improve with the aid of . In particular, is first transferred through to become and then transfers back to the original category through . is defined accordingly as
Total loss. The total loss is a weighted combination of , , and . The loss functions for training and for training and are respectively defined as
|Real images||SN-GAN||One-hot KG-GAN||KG-GAN w/o||KG-GAN|
4.1 Fine-grained Image Generation.
Experimental Settings. We use the Oxford Flowers dataset nilsback2008automated , which contains flower images from categories. Each image is annotated with visual descriptions. Following reed2016learning , we randomly split the images into seen and unseen categories. To extract the semantic embedding vector of each category, we first extract sentence features from each visual description using the fastText library bojanowski2017enriching . Then we average over the features within each category to obtain the per-category feature vector as the semantic embedding. We resize the images to as the image size in our experiments. For the SN-GAN part of the model, we use its default hyper-parameters and training configurations. In particular, we train for iterations. For the knowledge part, we use in our experiments.
Comparing Methods. We compare with SN-GAN trained on the full Oxford Flowers dataset, which potentially represents a performance upper-bound of our method. Besides, we additionally evaluate two ablations of KG-GAN: (1) One-hot KG-GAN: is a one-hot vector that represents the target category. (2) KG-GAN w/o : our method without .
Results. To evaluate the quality of the generated images, we compute the FID scores heusel2017gans in a per-category manner as in miyato2018cgans . Then, we average over the FID scores of the set of the seen and the unseen categories, respectively. Table 1 shows the seen and unseen FID scores. We can see from the table that in terms of the category condition, semantic embedding gives better FID scores than one-hot representation. Our full method achieves the best FID scores. For a visual comparison, we show the generated images of two representative unseen categories in Figure 3. As we can see, our full model faithfully generates flowers that have the right color. However, there is room for improvement in terms of the shapes and the structures. It is because color is the major information provided from the visual descriptions of the Oxford Flowers dataset. Also, shapes or structures are more complex to describe than color. For the generated images of the remaining unseen categories, please refer to the supplementary material.
|Method||Training data||Condition||Seen FID||Unseen FID|
|Mask||Generator output||Color loss||Cycle loss||L||MSE||MS-SSIM|
4.2 Hair Recoloring.
Experimental Settings. We use the CelebA dataset liu2015faceattributes , which contains face images of celebrities. Each image is annotated with binary attributes. We use the three hair attributes (black, brown, and blond) in our experiments. We randomly select K images as the test set and use the remaining images as the training set. We center crop the original CelebA images to and resize them to , which is the image size in our experiments. For the StarGAN part of the model, we use its default hyper-parameters and training configurations except that we set . For the knowledge part, We use and in our experiments.
Evaluation Metrics. We evaluate the realism and the recoloring quality of the generated images. For realism, we use the FID score heusel2017gans , which is computed from the last convolutional layer features of an Inception-V3 network fine-tuned on the CelebA dataset. For recoloring quality, we measure the segmentation accuracy of the mask network . To achieve this, we utilize the hair segmentation ground-truth, provided by Borza et al. borza2018deep , of images from the CelebA dataset. In particular, we measure three metrics: L error, mean squared error (MSE), and MS-SSIM wang2003multiscale .
Realism. We first evaluate the realism of the generated images. Since we do not have any real data as a reference for , we cannot use FID to evaluate the generated images of . Therefore, we examine the results of qualitatively. Figure 4 shows the recoloring results and the corresponding hair segmentation maps. Our successfully recolors hair into various colors while maintaining its realism (please refer to the supplementary material for more qualitative results).
We still compute FID for to see whether adding has any impact on . Following miyato2018cgans , we compute FID between real and generated images for each category. In particular, we randomly sample K images from the training set as the real images and use to generate K fake images of a particular category from the test set. Table 2 shows the FID scores. We can see that: (1) adding mask mechanism improves realism, and (2) the FID scores of in KG-GAN are worse than those of StarGAN + . The reason is that our transform generator is more sensitive to the segmentation accuracy than StarGAN’s image generator.
Ablation study. Next, we conduct an ablation study to justify the contributions of each of our proposed components. Note that the mask network and the color loss are necessary components to enable our . Thus we ablate the generator architecture and the cycle loss. In particular, we evaluate the recoloring quality quantitatively by measuring the segmentation accuracy. As we can see in Table 3, our transform generator significantly outperforms StarGAN’s image generator. We note that the cycle loss greatly improves the realism of . However, training with cycle loss gives slightly worse segmentation results than without cycle loss. Figure 5 shows a qualitative comparison of our ablated variants. As we can see, adopting our transform generator improves the hair segmentation while adding cycle loss improves realism.
Limitations. Our mask network may recognize eyebrows or beard as hair because their colors are often similar to the hair color. Such false positives do not hurt but hurt , because recoloring eyebrows or beard with a bright color is often unrealistic. One possible solution could be teaching to distinguish hair from eyebrows and beard by collecting an additional category of data whose hair color is uncommon to both eyebrows and beard.
We presented KG-GAN, the first framework that incorporates domain-knowledge into the GAN framework for improving diversity. We applied KG-GAN on two tasks to demonstrate its effectiveness. For fine-grained image generation, our results show that when the semantic embedding only provides coarse knowledge about a particular aspect of flowers, the generation of the other aspects mainly borrows from seen classes, meaning that there is still much room for improvement. For hair recoloring, unsupervised segmentation plays an essential role in knowledge. Transformation generator improves segmentation accuracy while cycle-consistency further improves realism. KG-GAN takes the first step towards increasing diversity with knowledge. We hope that our work could inspire future research along the direction of knowledge-guided image generation.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, 2014.
-  Takeru Miyato and Masanori Koyama. cgans with projection discriminator. In ICLR, 2018.
-  Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, 2016.
-  Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.
Image-to-image translation with conditional adversarial networks.In CVPR, 2017.
-  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
-  Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
-  Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. arXiv preprint arXiv:1903.07291, 2019.
-  Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
-  Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
-  Zhiting Hu, Zichao Yang, Ruslan R Salakhutdinov, LIANHUI Qin, Xiaodan Liang, Haoye Dong, and Eric P Xing. Deep generative models with learnable knowledge constraints. In Advances in Neural Information Processing Systems, 2018.
-  Ahmed Elgammal, Bingchen Liu, Mohamed Elhoseiny, and Marian Mazzone. CAN: Creative adversarial networks, generating" art" by learning about styles and deviating from style norms. arXiv preprint arXiv:1706.07068, 2017.
-  Abdullah Hamdi and Bernard Ghanem. IAN: Combining generative adversarial networks for imaginative face generation. arXiv preprint arXiv:1904.07916, 2019.
-  Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative adversarial networks for diverse image synthesis. In CVPR, 2019.
-  Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning - the good, the bad and the ugly. In CVPR, 2017.
-  Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot learning. In CVPR, 2018.
-  Rafael Felix, Vijay BG Kumar, Ian Reid, and Gustavo Carneiro. Multi-modal cycle-consistent generalized zero-shot learning. In ECCV, 2018.
-  Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. f-VAEGAN-D2: A feature generating framework for any-shot learning. In CVPR, 2019.
-  Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, 2015.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
-  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, 2017.
-  Gang Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Generative adversarial network with spatial attention for face attribute editing. In ECCV, 2018.
-  Bo Zhao, Bo Chang, Zequn Jie, and Leonid Sigal. Modular generative adversarial networks. In ECCV, 2018.
-  Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In ICCVGI, 2008.
-  Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. Learning deep representations of fine-grained visual descriptions. In CVPR, 2016.
-  Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017.
-  Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, 2017.
-  Diana Borza, Tudor Ileni, and Adrian Darabant. A deep learning approach to hair segmentation and color extraction from facial images. In International Conference on Advanced Concepts for Intelligent Vision Systems, 2018.
-  Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, 2003.