Generative deep learning modeling is an ongoing growing field, in which recent works have shown remarkable success in different domains. In particular, the computer vision community has witnessed a dramatic improvement in large variety of tasks, ranging from image synthesis[27, 12, 3]10, 4, 19]. The latter task poses the problem of translating images from one domain to another, including style transfer [11, 29, 13], inpainting [24, 16, 25, 26], attribute transfer [14, 6, 4], and others.
The objective of attribute transfer is to synthesize new and realistic appearing images for a pre-defined target domain. For instance, Fig. 1 row 6 shows a non-smiling man with mustache wearing eyeglasses (these are the given attributes, a.k.a. domains) and the output results show how these attributes have been changed one at a time according to our attribute target domain. We refer to a domain as a set of images sharing the same attributes. Such attributes are meaningful semantic features of an image, such as a mustache or a face with eyeglasses. Like other image-to-image translations, attribute transfer methods have also achieved an impressive progress by implementing different variants of GANs [6, 4, 13] leading to state-of-the-art results in the field. Nevertheless, these attribute transfer approaches are mostly based on the global manipulation of GAN latent space. As a result, in order to produce good transfer results, the aforementioned methods require additional inverse generator paths (which tends to make them less stable) and can be quite cumbersome.
Image inpainting or completion refers to the task of inferring locally missing or damaged parts of an image. It has been applied to many different applications like photo editing, restoration of damaged paintings, image-based rendering and computational photography. The main challenge of image inpainting is to synthesize realistic pixels for the missing regions that are coherent with existing ones. Image inpainting techniques are mostly separated into two groups regarding their basic approach. The first group uses local methods [2, 9]
based on low-level feature information, such as color or texture, to attempt to solve the problem. The second group relies on recognizing patterns in images, e.g. deep convolutional neural networks (CNN), to predict pixels for the missing regions. CNN-based models deal with both local and global features and can, in combination with generative adversarial networks (GANs), produce realistic inpainted outputs. The introduction of GANs have inspired recent works [20, 24, 16, 10, 25] which formulate the inpainting tasks as a conditional image generation problem, using a generator for inpainting and a discriminator for evaluating the result.
The contribution of this paper is a novel attribute transfer approach that alters given natural images in such a way, that the output image meets the pre-defined visual attributes. To do so, our proposed architecture integrates an inpainting block. In particular, we take advantage of the fact, that most facial attributes are induced by local structures (e.g. relative position between eyes and ears). Hence, it is sufficient to change only parts of the face while the remaining parts can be used to force the generator into realistic outputs. Note that the hole/mask (in terms of inpainting) is generated by our method to apply this “trick”. The proposed ATI-GAN model integrates inpainting for local attribute transfer in a single end-to-end architecture with three main building blocks: on one hand, we have an inpainitng network that takes masked images as input and outputs realistically restored images; on the other hand, we have a second network that takes these inpainted images and encoded attributes (e.g. one-hot vector) as input, and learns how to separate attribute information from the rest of the image representation; finally, a third network, which acts as a discriminator, judges if the overall result looks realistic.
Evaluations of our model on the CelebA  dataset of faces demonstrate the capacity of ATI-GAN to produce high quality outputs. Quantitative and qualitative results show superior inpainting and attribute transfer performance.
2 Related Work
In computer vision. deep learning approaches have heavily contributed in many semantic image understanding tasks. In this section, we briefly review publications related to our work in each of the different sub-fields. In particular, since our proposal is based on GANs and image translation, more specifically in inpainting and attribute transfer, we will review the seminal work in that direction.
takes real and fake images as inputs and learns to distinguish between them at global and patch levels. Furthermore, it also learns to classify them to their corresponding domain. (fig:g)takes both, the output from and the target domain label, as inputs and generates the domain transformed image. Then, the transformed image and the original domain label are fed into (creating a loop). After that, both outputs from (from the two iterations) are passed to one at a time.
2.1 Generative Adversarial Networks
Generative adversarial networks  have arisen as a reliable framework for deep generative models. They have shown remarkable results in various computer vision tasks like image generation [27, 12], style transfer [11, 29], inpainting [20, 24, 16, 10, 25] and attribute transfer [14, 6, 4]. The vanilla GAN model consists of two networks, a generator and a discriminator . Its procedure can be seen as a minmax game between , which learns how to generate samples resembling real data, and a discriminator , which learns to discriminate between real and fake data. Throughout this process, indirectly learns how to model the input image distribution by taking samples from a fixed distribution (e.g. Gaussian) and forcing the generated samples to match the input images
. The objective loss function is defined as
GAN-based conditional approach  has shown a rapid progress and it has become an essential ingredient for recent research. The intuition behind this kind of GANs is to insert the class information into the model, in order to generate samples that are conditioned on the class. In this work, we take advantage of this property and we encode the attribute characteristics as a conditional information which will be fed into the model.
2.2 Image Inpainintg
Classical inpainting methods are often based on either local or non-local information to rebuild the patches. Local methods [2, 9] attempt to solve the problem using only context information, such as color or texture, i.e. matching, copying and merging backgrounds patches into holes by propagating the information from hole boundaries. These approaches need very little training or prior knowledge, and provide good results especially in background inpainting tasks. However, they do not perform well for large patches because of their inability of generating novel image contents. More powerful methods are global, content-based and semantic inpainting approaches. Even though, these techniques require more expensive dense patch computations, they can handle larger patches successfully. In many models, CNN-based approaches have become the de facto implementation due to their capability to learn to recognize patterns in images and use them to fill holes in images.
GANs for image translation [20, 24, 16, 10] have emerged as a promising paradigm for inpainting tasks. Nowadays, they are already able to produce realistic synthetic outputs with high quality image resolution [25, 26]. In order to reach this point though, GAN’s techniques have been evolving quite intensively over the past few years. Initially, the inpainting task was formulated as a conditional image generation problem, consisting of one generator and one discriminator. However, more recent works [10, 25] have introduced the concept of global and local discriminators. Furthermore, apart from modifying the topology from the discriminator, recent methods [10, 16, 20, 25] have adopted new losses such as Wasserstein  or Wasserstein with gradient penalty . Inspired by all of these works, our approach also leverages the conditioned adversarial framework with global and local discriminators together with Wasserstein loss gradient penalty.
2.3 Attribute Transfer
Small and unbalanced datasets can cause severe problems when training a machine learning model. Recently, numerous works have put their attention towards transferring visual attributes, such as color , texture [11, 29], facial features [14, 6, 4] and more, for data augmentation. However, although most of the approaches correctly synthesize new attributes belonging to the target domain, it is still very challenging to generalize attributes between different applications since they are usually designed to transfer a specific type (e.g. facial expressions, facial attributes, or even colors).
GAN-based approaches for image translation have been actively studied. One of the first proposals , which was capable of learning consistent image domain transforms, employed a pair of images that could be used to create models that convert from the original to the target domain (e.g. segmentation labels to the original image). Unfortunately, this system requires that both, images and target images must exist as pairs in the training dataset in order to learn the transformation between domains. Several works [29, 4] have tried to address this drawback. They suggested to use the virtual result in the target domain. In this way, if the virtual result is inverted again, the inverted result must match with the original image. In these works, the framework can flexibly control the image translation into different target domains.
In the following section, we describe ATI-GAN approach, which addresses image-to-image translation for facial attribute transfer. We explain the training of the reconstructor, generator and discriminator in detail, showing that our model trains in an introspective manner, such that it can estimate the difference between the generated (fake) samples and the real samples, and finally update itself to produce more realistic samples.
3.1 Model Architecture
The network architecture of our proposal is depicted in Fig. 2. It
is separated in an inpainting network (reconstructor Fig. 1(a)), generative
network (generator Fig.1(c)) and discriminative network (discriminator Fig.
1(b)). By combining each of these blocks in a sequential manner, the
ensemble model is able to perform an end-to-end attribute transfer.
Reconstructor. Given an image and its masked paired , the reconstructor takes and tries to fill the large missing region with plausible content so that the output looks realistic (see Fig. 3). To achieve this objective, the reconstructor is backed up by the discriminator (as if it were a vanilla GAN setting) that assesses the reconstructed images. We can formulate the inpainting training process as a minimization problem
On the one hand, constrains the reconstructed image by minimizing the absolute differences between the estimated values and the existing target values. It is defined as
where the subindeces ct and p refer to contour and patch respectively. We can observe these concepts in Fig. 3(a), where a detailed visual example is depicted.
We apply on a separately distance norm for the contour and for the patch. We treat them differently because the patch loss does not have to be strictly 0 since those synthetic images, which are not exactly equal to the original, are also a valid solution, if only if they look realistic and fit with the contour.
On the other hand, penalizes unrealistic images and (see Eq. 9) looks after incorrect image-domain transformations. For these two cases the ideal solution converges to 0 loss. Further details are discussed in the following subsections.
In terms of topology implementation, we adopt the coarse network architecture introduced in 
. Since the size of the receptive fields are a decisive factor in inpainitng tasks, we use dilated convolutions to guarantee a sufficient large size. Additionally, we use mirror padding for all convolution layers and exponential linear unit (ELUs) as activation functions.
Generator. The role of the generator is to learn the mappings among multiple attribute domains. To achieve this goal, we use the modified output from the reconstructor (see Fig. 3(b)) as input. Then we train to translate into an output image , which is conditioned on the target domain code label (see Fig. 5). In the same manner as in , both and also play important roles in the generator loss definition. Moreover, we have an extra term called cycle consistency loss . Previous works [14, 29, 4] have shown how this term helps to create strong paths between latent space and outcomes, and in our case, it guarantees that the translated images preserve the content of their input images while changing only the domain-related part of the inputs though the latent space. This term is defined as
Finally, joining it all terms together we have the following formula for generator loss
Note that the generator performs the entire cyclic translation for every sample, forcing the code to be crucial for moving among domains. First, to translate an original image into an image in the target domain and then to recover the original image from the translated image .
. On top of it, we apply several topology changes to adapt it to our system. In the aftermath, the final generator architecture consists of three convolutional layers, two of them with the stride size of two for down-sampling, six residual blocks, where in each block dilated layers with different dilatation values are added, and two transposed convolutional layers with the stride size of two for upsampling. We use instance normalization in the network.
Discriminator. Our discriminator behaves slightly different from vanilla GAN. It takes samples of real and generated data and then tries to classify them correctly according to their attribute. Additionally, there is a second in-built classifier that tries to determine the domain from each sample. As a result, the discriminator needs to have one adversarial loss that judges the appearance of the images and one classification loss that classifies the attributes .
The inner structure of our discriminator does also differ from standard discriminators. It is split into two fully convolutional topologies and , and one final convolutional layer to combine both discriminators’ outputs (see Fig. 6). All layers from have a stride size of two for downsampling followed by LeakyReLUs as activation function.
While the global discriminator assesses the semantic consistency of the whole image, the patch discriminator deals with the reconstructed initial masked part to enforce local consistency. As a consequence, every image is evaluated by two independent loss functions and which together form the joint adversarial loss
Over the last year, formulations of adversarial loss functions have continuously been changing and improving. GANs based on Earth-Mover distance loss  were one of the first attempts to clearly outperform vanilla GAN . Consequently, several approaches on image-to-image translation [18, 11, 29] and generative inpainting networks [10, 16, 20] relied on DCGAN  for their adversarial supervision. However, more recent research  has showed that there are beneficial effects for image generation when adding a gradient penalty instead of weight clipping to enforce the Lipschitz constraint. As a result, a second wave of publications [4, 25] propose to use WGAN-GP. Following this approach, we write and as
where is sampled uniformly along a straight line between a pair of a real and generated images. is analog to after replacing for .111Note that Eq. 8 belongs to the case when or is updated. For updating the input has to be replaced by and .
As we mentioned above, similarly to , our discriminator relies on a second loss function which accounts for domain classification. It computes the binary cross entropy function between the output domain labels from and one of two variants, either or . Therefore, we can write the classification loss training from the reconstructor or the generator as
where stands for fake input. Note that if we break Eq. 9 down, we can see that the input from is the result of when given a reconstructed input image and a target domain label . In the case of training the reconstructor, will be directly fed with but this time conditioned on . On the other hand, the discriminator loss is written as
where stands for real input. Note that the procedure is almost the same as for the reconstructor, but now the input in is the real image , and by extension, we call it .
The key of a good global training relies heavily on the discriminator. On the one hand, indirectly forces the generator by penalizing the gradients through to produce correct image-domain transformations as well as it learns to generate the correct output label when is fed with real data . On the other hand, it also needs to learn to classify between the true and generated data and penalize accordingly . Therefore the discriminator’s optimization problem is formulated as
In this section, we present results for a series of experiments evaluating the proposed method, both quantitatively and qualitatively. We first give a detailed introduction of the experimental setup. Then, we discuss the inpainting outcomes and finally, we review the attribute transfer results.
4.1 Experimental Settings
We train ATI-GAN on the CelebFaces Attributes (CelebA) dataset . It consists of 202,599 celebrity face images with variations in facial attributes. We randomly select 2,000 images for testing and use all remaining images as the training dataset. In training, we crop and resize the initially 178x218 pixel image to 128x128 pixels, and we mask them with 52x52 size patches. These masked regions are always centered around the tip of the nose, occluding in most of the cases a large portion of the face (see Figure 7). All experiments presented in this paper have been conducted on a single NVIDIA GeForce GTX 1080 GPU, without applying any post-processing.
Since our model is divided into three distinguishable parts (reconstruction/inpainting, generative and discriminative), three independent Adam optimizer with , . are used during training. We set the batch size to 16 and run the experiments for 200,000 iterations. We start using the output of the reconstructor as input for the generator after iteration 50K. In this way, we ensure that the gradient updates in the generative model are reliable. We update the generator after every five discriminator updates as in [8, 4]
. The learning rate used in the implementation is 0.0001 for the first 10 epochs and then linearly decreased to 0 over the next 10 epochs. The losses are weighted by the factors:, and set to 10, to 5 and to 1 respectively. The training procedure as a whole is described in Algorithm 1.
4.3 Image Inpainting
The image inpainting problem has a number of different formulations. The definition of our interest is: given that most of the pixels of a face are unobserved because they are masked, our objective is to restore them in a natural way, so that we end up having a plausible and realistic face. In order to achieve good inpaiting results, our synthesized faces must fit into these masks/holes taking into account both, the reconstruction quality of the face as well as the adaptation with the rest of the image. Note that this pixel transformation will be conditioned on the desired attribute transfer too.
, it is important to notice that there is no perfect numerical metric for semantic inpainting due to the existence of infinite amount of possible solutions. Note that image inpainting algorithms do not try to reconstruct the ground-truth image, but to fill the masked area with content that looks realistic. As a result, the ground-truth image is only one of many possibilities. Following classical inpainting approaches, we employ in our evaluation study the peak signal-to-noise ratio (PSNR). However, this metric might oversimplify the comparison since it directly measures difference in pixel values. Therefore, it is usually combined with a second metric called structural similarity (SSIM) which offers a more elaborated and reliable measurement values.
Our inpainting goal is always under the same conditions i.e. regenerating missing facial attributes, our mask will be a central square patch in the image. This is the standard crop procedure for CelebA since most of the information lies on the center of the image. Table 1 shows the comparison results for PSNR and SSIM metrics, where similar works have also reported the scores based on square centered crops.
We are inclined to think that the improvement of the metrics (specially PSNR) comes from a good equilibrium between our reconstructor and our discriminator. While the learns to produce the coarse features from faces (natural-looking structures) via , enforces to smooth the results “asking” for finer details via through and .
Finally, theses results demonstrate that our approach is able to utilize the end-to-end model architecture to propagate informative gradients, which eventually lead to a significant performance gain. Nevertheless, note that we do not aim at outperforming state-of-the-art image inpainting techniques, but we use it as a crucial part for our attribute transfer system.
4.4 Attribute Transfer
In this subsection, we focus on attribute manipulation. We validate that faces change according to the specified target attribute. This phenomenon is known as attribute transfer or morphing. In particular, we focus on the following set of attributes: eyeglasses, mustache, smiling and young.
Figure 8 shows the transformed images (with the target attribute), in which we can qualitatively determine the results of the model by judging the attribute transfer results. We can observe that ATI-GAN clearly generates natural-looking faces containing the target attributes providing very competitive results on test data. This is possible because of the inherent properties of the end-to-end system that takes advantage of the inpainting structure (among others) presented in this work.
|Smiling Not-Smiling||Not-Smiling Smiling|
|Eyeglasses Not-Eyeglasses||Not-Eyeglasses Eyeglasses|
|Young Old||Old Young|
|Mustache Not-Mustache||Not-Mustache Mustache|
Additionally, we have performed a user study to assess attribute transfer tasks. It consists of a survey, in which users have to label with 0 when the attribute is not recognized and 1 otherwise. For each attribute transfer, we conduct the test on of the testing data. Note that the generated images have a single attribute translation from the aforementioned list.
According to Table 2, a big part of our translations achieves a successful attribute transformation. More interestingly, however, is to analyze what the meaning behind the percentage is. By inspecting pairs of transformations, for instance Eyeglasses Not-Eyeglasses and vice versa, we can notice that there is no symmetry on transferring attributes. This mainly happens because ATI-GAN is not an invertible model, therefore, moving from domain A to B involves one path (certain set of operations) and from B to A another path (another set of operations). A second plausible explanation for this asymmetric occurrence can be the nature of the dataset used in the experiments. It is known that CelebA suffers from unbalanced attributes, meaning that not all the attributes are equally present. As a result, the ability to transfer might be conditioned on the amount of samples containing the involved feature. For example, Not-Mustache Mustache has much lower success rate for women than for men, because there are no examples of women wearing mustaches.
In this paper, we introduce a novel image-to-image translation model capable of applying an accurate local attribute transformation. Previous attribute transfer works were mostly based only on the manipulation of GAN latent space. However, we propose a completely different approach utilizing inpainting as a part of our embedded system. Our method takes advantage of the fact that attributes are induced by local structures. Therefore, it is sufficient to change only parts of the image, while the remaining parts can be used to force the generator into realistic outputs. We show how ATI-GAN can synthesize high quality human face images. We do believe the method is generalisable to other objects and domains being able to produce synthetic data containing certain attributes on demand. We see many interesting avenues of future work including exploring multi-attribute transfer.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. In ACM Transactions on Graphics (ToG), volume 28, page 24. ACM, 2009.
-  A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo.
Stargan: Unified generative adversarial networks for multi-domain
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8789–8797, 2018.
-  L. Chongxuan, T. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. In Advances in neural information processing systems, pages 4088–4098, 2017.
-  A. Creswell, Y. Mohamied, B. Sengupta, and A. A. Bharath. Adversarial information factorization. arXiv preprint arXiv:1711.05175, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
-  Y. Hu, D. Zhang, J. Ye, X. Li, and X. He. Fast and accurate matrix completion via truncated nuclear norm regularization. IEEE transactions on pattern analysis and machine intelligence, 35(9):2117–2130, 2013.
-  S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4):107, 2017.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
-  T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
-  T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1857–1865. JMLR. org, 2017.
-  A. Li, J. Qi, R. Zhang, X. Ma, and K. Ramamohanarao. Generative image inpainting with submanifold alignment. arXiv preprint arXiv:1908.00211, 2019.
-  Y. Li, S. Liu, J. Yang, and M.-H. Yang. Generative face completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3911–3919, 2017.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
-  S. Mo, M. Cho, and J. Shin. Instance-aware image-to-image translation. In International Conference on Learning Representations, 2019.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  P. Vitoria, J. Sintes, and C. Ballester. Semantic image inpainting through improved wasserstein generative adversarial networks. arXiv preprint arXiv:1812.01071, 2018.
-  Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia. Image inpainting via generative multi-column convolutional neural networks. In Advances in Neural Information Processing Systems, pages 331–340, 2018.
-  R. A. Yeh, C. Chen, T. Yian Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5485–5493, 2017.
-  J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5505–5514, 2018.
-  J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Free-form image inpainting with gated convolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 4471–4480, 2019.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5907–5915, 2017.
R. Zhang, P. Isola, and A. A. Efros.
Colorful image colorization.In European conference on computer vision, pages 649–666. Springer, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017.