CT-GAN: Conditional Transformation Generative Adversarial Network for Image Attribute Modification

07/12/2018 ∙ by Sangpil Kim, et al. ∙ 6

We propose a novel, fully-convolutional conditional generative model capable of learning image transformations using a light-weight network suited for real-time applications. We introduce the conditional transformation unit (CTU) designed to produce specified attribute modifications and an adaptive discriminator used to stabilize the learning procedure. We show that the network is capable of accurately modeling several discrete modifications simultaneously and can produce seamless continuous attribute modification via piece-wise interpolation. We also propose a task-divided decoder that incorporates a refinement map, designed to improve the network's coarse pixel estimation, along with RGB color balance parameters. We exceed state-of-the-art results on synthetic face and chair datasets and demonstrate the model's robustness using real hand pose datasets. Moreover, the proposed fully-convolutional model requires significantly fewer weights than conventional alternatives and is shown to provide an effective framework for producing a diverse range of real-time image attribute modifications.



There are no comments yet.


page 6

page 9

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative models have been shown to provide effective frameworks for representing complex, structured datasets and generating realistic samples from underlying data distributions [1]. This concept has also been extended to form conditional models capable of sampling from conditional distributions in order to allow certain properties of the generated data to be controlled or selected [2]. These generative models are designed to sample from broad classes of the data distribution, however, and are not suitable for inference tasks which require identity preservation of the input data. Models have also been proposed which incorporate encoding components to overcome this by learning to map input data to an associated latent space representation within a generative framework [3]. The resulting inference models allow for the defining structure/features of inputs to be preserved while specified target properties are adjusted through conditioning [4]. Conventional conditional models have largely relied on rather simple methods, such as concatenation, for implementing this conditioning process; however,  [5] have shown that utilizing the conditioning information in a less trivial, more methodical manner has the potential to significantly improve the performance of conditional generative models. In this work, we provide a general framework for effectively performing inference with conditional generative models by strategically controlling the interaction between conditioning information and latent representations within a generative inference model.

In this framework, a conditional transformation unit (CTU), , is introduced to provide a means for navigating the underlying manifold structure of the latent space. The CTU is realized in the form of a collection of convolutional layers which are designed to approximate the latent space operators defined by mapping encoded inputs to the encoded representations of specified targets (see Figure 1). This is enforced by introducing a consistency loss term to guide the CTU mappings during training. In addition, a conditional discriminator unit (CDU), , also realized as a collection of convolutional layers, is included in the network’s discriminator. This CDU is designed to improve the network’s ability to identify and eliminate transformation specific artifacts in the network’s predictions.

The network has also been equipped with RGB balance parameters consisting of three values designed to give the network the ability to quickly adjust the global color balance of the images it produces to better align with that of the true data distribution. In this way, the network is easily able to remove unnatural hues and focus on estimating local pixel values by adjusting the three RGB parameters rather than correcting each pixel individually. In addition, we introduce a novel estimation strategy for efficiently learning shape and color properties simultaneously; a task-divided decoder is designed to produce a coarse pixel-value map along with a refinement map in order to split the network’s overall task into distinct, dedicated network components.

Summary of contributions:

  1. We introduce the conditional transformation unit, with a family of modular filter weights, to learn high-level mappings within a low-dimensional latent space. In addition, we present a consistency loss term which is used to guide the transformations learned during training.

  2. We propose a novel framework for color inference which separates the generative process into three distinct network components dedicated to learning (i) coarse pixel value estimates, (ii) pixel refinement scaling factors, and (iii) the global RGB color balance of the dataset.

  3. We introduce the conditional discriminator unit designed to improve adversarial training by identifying and eliminating transformation-specific artifacts present in generated images.

Each contribution proposed above has been shown to provide a significant improvement to the network’s overall performance through a series of ablation studies. The resulting latent transformation neural network (LTNN) is placed through a series of comparative studies on a diverse range of experiments where it is seen to outperform existing state-of-the-art models for (i) simultaneous multi-view reconstruction of real hand depth images in real-time, (ii) view synthesis and attribute modification of real and synthetic faces, and (iii) the synthesis of rotated views of rigid objects. Moreover, the CTU conditioning framework allows for additional conditioning information, or target views, to be added to the training procedure ad infinitum without any increase to the network’s inference speed.

Figure 1: The conditional transformation unit constructs a collection of mappings in the latent space which produce high-level attribute changes to the decoded outputs. Conditioning information is used to select the appropriate convolutional weights for the specified transformation; the encoding of the original input image is transformed to and provides an approximation to the encoding of the attribute-modified target image .

2 Related Work

[6] has proposed a supervised, conditional generative model trained to generate images of chairs, tables, and cars with specified attributes which are controlled by transformation and view parameters passed to the network. The range of objects which can be synthesized using the framework is strictly limited to the pre-defined models used for training; the network can generate different views of these models, but cannot generalize to unseen objects to perform inference tasks. Conditional generative models have been widely used for geometric prediction [7, 8]. These models are reliant on additional data, such as depth information or mesh models, to perform their target tasks, however, and cannot be trained using images alone. Other works have introduced a clamping strategy to enforce a specific organizational structure in the latent space [9, 10]; these networks require extremely detailed labels for supervision, such as the graphics code parameters used to create each example, and are therefore very difficult to implement for more general tasks (e.g. training with real images). [11]

have proposed the appearance flow network (AFN) designed specifically for the prediction of rotated viewpoints of objects from images. This framework also relies on geometric concepts unique to rotation and is not generalizable to other inference tasks. The conditional variational autoencoder (CVAE) incorporates conditioning information into the standard variational autoencoder (VAE) framework 

[12] and is capable of synthesizing specified attribute changes in an identity preserving manner [13, 4]. CVAE-GAN [14] further adds adversarial training to the CVAE framework in order to improve the quality of generated predictions. [15] have introduced the conditional adversarial autoencoder (CAAE) designed to model age progression/regression in human faces. This is achieved by concatenating conditioning information (i.e. age) with the input’s latent representation before proceeding to the decoding process. The framework also includes an adaptive discriminator with conditional information passed using a resize/concatenate procedure. [16]

have proposed Pix2Pix as a general-purpose image-to-image translation network capable of synthesizing views from a single image. The IterGAN model introduced by 

[17] is also designed to synthesize novel views from a single image, with a specific emphasis on the synthesis of rotated views of objects in small, iterative steps. To the best of our knowledge, all existing conditional generative models designed for inference use fixed hidden layers and concatenate conditioning information directly with latent representations; in contrast to these existing methods, the proposed model incorporates conditioning information by defining dedicated, transformation-specific convolutional layers at the latent level. This conditioning framework allows the network to synthesize multiple transformed views from a single input, while retaining a fully-convolutional structure which avoids the dense connections used in existing inference-based conditional models. Most significantly, the proposed LTNN framework is shown to outperform state-of-the-art models in a diverse range of view synthesis tasks, while requiring substantially less FLOPs for inference than other conditional generative models (see Tables 1 & 2).

3 Latent Transformation Neural Network

In this section, we introduce the methods used to define the proposed LTNN model. We first give a brief overview of the LTNN network structure. We then detail how conditional transformation unit mappings are defined and trained to operate on the latent space, followed by a description of the conditional discriminator unit implementation and the network loss function used to guide the training process. Lastly, we describe the task-division framework used for the decoding process.

The basic workflow of the proposed model is as follows:

  1. Encode the input image to a latent representation .

  2. Use conditioning information to select conditional, convolutional filter weights .

  3. Map the latent representation to , an approximation of the encoded latent representation of the specified target image .

  4. Decode to obtain a coarse pixel value map and a refinement map.

  5. Scale the channels of the pixel value map by the RGB balance parameters and take the Hadamard product with the refinement map to obtain the final prediction .

  6. Pass real images as well as generated images to the discriminator, and use the conditioning information to select the discriminator’s conditional filter weights .

  7. Compute loss and update weights using ADAM optimization and backpropagation.

Provide: Labeled dataset with target transformations indexed by a fixed set , encoder weights , decoder weights , RGB balance parameters , conditional transformation unit weights , discriminator with standard weights and conditionally selected weights

, and loss function hyperparameters

corresponding to the smoothness, reconstruction, adversarial, and consistency loss terms, respectively. The specific loss function components are defined in detail in Equations 1 - 5 in Section 3.2.

1:procedure Train( )
2:      COMMENT# Sample input and targets from training set
3:      COMMENT# Encoding of original input image
4:     for  in  do
5:          COMMENT# True encoding of specified target image
6:          COMMENT# Approximate encoding of target with CTU
7:          COMMENT# Compute RGB value and refinement maps
8:          COMMENT# Assemble final network prediction for target
10:         # Update encoder, decoder, RGB, and CTU weights
15:         for  in  do
18:         # Update discriminator and CDU weights
20:         for  in  do
LTNN Training Procedure

3.1 Conditional transformation unit

Generative models have frequently been designed to explicitly disentangle the latent space in order to enable high-level attribute modification through linear, latent space interpolation. This linear latent structure is imposed by design decisions, however, and may not be the most natural way for a network to internalize features of the data distribution. Several approaches have been proposed which include nonlinear layers for processing conditioning information at the latent space level. In these conventional conditional generative frameworks, conditioning information is introduced by combining features extracted from the input with features extracted from the conditioning information (often using dense connection layers); these features are typically combined using standard vector concatenation, although some have opted to use channel concatenation 

[15, 14]. Six of these conventional conditional network designs are illustrated in Figure 2 along with the proposed LTNN network design for incorporating conditioning information.

Rather than directly concatenating conditioning information, we propose using a conditional transformation unit (CTU), consisting of a collection of distinct convolutional mappings in the network’s latent space; conditioning information is then used to select which collection of weights, i.e. which CTU mapping, should be used in the convolutional layer to perform a specified transformation. For view point estimation, there is an independent CTU per viewpoint. Each CTU mapping maintains its own collection of convolutional filter weights and uses Swish activations [18]. The filter weights and Swish parameters of each CTU mapping are selectively updated by controlling the gradient flow based on the conditioning information provided. The CTU mappings are trained to transform the encoded, latent space representation of the network’s input in a manner which produces high-level view or attribute changes upon decoding. This is accomplished by introducing a consistency term into the loss function which is minimized precisely when the CTU mappings behave as depicted in Figure 1. In this way, different angles of view, light directions, and deformations, for example, can be synthesized from a single input image.

Figure 2: Selected methods for incorporating conditioning information; the proposed LTNN method is illustrated on the left, and six conventional alternatives are shown to the right.

3.2 Discriminator and loss function

The discriminator used in the adversarial training process is also passed conditioning information which specifies the transformation which the model has attempted to make. The conditional discriminator unit (CDU), consisting of convolutional layers with modular weights similar to the CTU, is trained to specifically identify unrealistic artifacts which are being produced by the corresponding conditional transformation unit mappings. For view point estimation, there is an independent CDU per viewpoint.

The proposed model uses the adversarial loss as the primary loss component. The discriminator, , is trained using the adversarial loss term defined below in Equation 1. Additional loss terms corresponding to structural reconstruction, smoothness [19], and a notion of consistency, are also used for training the encoder/decoder:


where is the modified target image corresponding to an input ,   are the weights of the CDU mapping corresponding to the transformation, is the CTU mapping for the transformation, is the network prediction, and is the two-dimensional, discrete shift operator. The final loss function for the encoder and decoder components is given by:


with hyperparameters typically selected so that . The consistency loss is designed to guide the CTU mappings toward approximations of the latent space mappings which connect the latent representations of input images and target images as depicted in Figure 1. In particular, the consistency term enforces the condition that the transformed encoding, , approximates the encoding of the target image, , during the training process.

3.3 Task separation / Task-divided decoder

The decoding process has been divided into three tasks: estimating the refinement map, pixel-values, and RGB color balance of the dataset. We have found this decoupled framework for estimation helps the network converge to better minima to produce sharp, realistic outputs. The decoding process begins with a series of convolutional layers followed by bilinear interpolation to upsample the low resolution latent information. The last component of the decoder’s upsampling process consists of two distinct transpose convolutional layers used for task separation; one layer is allocated for predicting the refinement map while the other is trained to predict pixel-values. The refinement map layer incorporates a Sigmoid activation function which outputs scaling factors intended to refine the coarse pixel value estimations. RGB balance parameters, consisting of three trainable variables, are used as weights for balancing the color channels of the pixel value map. The Hadamard product of the refinement map and the RGB-rescaled value map serves as the network’s final output:


In this way, the network has the capacity to mask values which lie outside of the target object (i.e. by setting refinement map values to zero) which allows the value map to focus on the object itself during the training process. Experimental results show that the refinement maps learn to produce masks which closely resemble the target objects’ shapes and have sharp drop-offs along the boundaries.

Figure 3: Comparison of CVAE-GAN (top) with proposed LTNN model (bottom) using the noisy NYU hand dataset [20]. The input depth-map hand pose image is shown to the far left, followed by the network predictions for 9 synthesized view points. The views synthesized using LTNN are seen to be sharper and also yield higher accuracy for pose estimation (see Figure 6).
Figure 4: Qualitative evaluation for multi-view reconstruction of hand depth maps using the NYU dataset.

4 Experiments and results

To show the generality of our method, we have conducted a series of diverse experiments: (i) hand pose estimation using a synthetic training set and real NYU hand depth image data [20] for testing, (ii) synthesis of rotated views of rigid objects using the real ALOI dataset [21] and synthetic 3D chair dataset [22], (iii) synthesis of rotated views using a real face dataset [23], and (iv) the modification of a diverse range of attributes on a synthetic face dataset [24]. For each experiment, we have trained the models using 80% of the datasets. Since ground truth target depth images were not available for the real hand dataset, an indirect metric has been used to quantitatively evaluate the model as described in Section 4.1. Ground truth data was available for all other experiments, and models were evaluated directly using the mean pixel-wise error and the structural similarity index measure (SSIM) [7, 25] (the masked pixel-wise error  [17] was used in place of the error for the ALOI experiment).

To evaluate the proposed framework with existing works, two comparison groups have been formed: conditional inference models (CVAE-GAN, CVAE, and CAAE) with comparable encoder/decoder structures for comparison on experiments with non-rigid objects, and view synthesis models (MV3D [8], IterGAN, Pix2Pix, AFN [11], and TVSN [7]) for comparison on experiments with rigid objects. Additional experiments have been performed to compare the proposed CTU conditioning method with other conventional concatenation methods (see Figure 2); results are shown in Figure 5.

Figure 5: Quantitative evaluation for multi-view hand synthesis. (a) Evaluation with state-of-the-art methods using the real NYU dataset. (b) LTNN ablation results and comparison with alternative conditioning frameworks using synthetic hand dataset. Our models: conditional transformation unit (CTU), conditional discriminator unit (CDU), task-divide and RGB balance parameters (TD), and LTNN consisting of all previous components along with consistency loss. Alternative concatenation methods: channel-wise concatenation (CH Concat), fully connected concatenation (FC Concat), and reshape concatenation (RE Concat).
Figure 6: Quantitative comparison of model performances for experiment on the real face dataset.
Figure 7: Qualitative evaluation for multi-view reconstruction of real face using the stereo face dataset [23].

4.1 Experiment on non-rigid objects

Hand pose experiment: Since ground truth predictions for the real NYU hand dataset were not available, the LTNN model has been trained using a synthetic dataset generated using 3D mesh hand models. The NYU dataset does, however, provide ground truth coordinates for the input hand pose; using this we were able to indirectly evaluate the performance of the model by assessing the accuracy of a hand pose estimation method using the network’s multi-view predictions as input. More specifically, the LTNN model was trained to generate 9 different views which were then fed into the pose estimation network from [26] (also trained using the synthetic dataset).

A comparison of the quantitative hand pose estimation results is provided in Figure 5 where the proposed LTNN framework is seen to provide a substantial improvement over existing methods; qualitative results are also available in Figure 3. With regard to real-time applications, the proposed model runs at 114 fps without batching and at 1975 fps when applied to a mini-batch of size 128 (using a single TITAN Xp GPU and an Intel i7-6850K CPU).

Real face experiment: The stereo face database [23], consisting of images of 100 individuals from 10 different viewpoints, was used for experiments with real faces; these faces were segmented using the method of  [27] and then cropped and centered to form the final dataset. The LTNN model was trained to synthesize images of input faces corresponding to three consecutive horizontal rotations. As shown in Figure 6, our method significantly outperforms the CVAE-GAN, CAAE, and IterGAN models in both the and SSIM metrics.

4.2 Experiment on rigid objects

Real object experiment: The ALOI dataset [21], consisting of images of 1000 real objects viewed from 72 rotated angles (covering one full rotation), has been used for experiments on real objects. As shown in Table 1 and in Figure 8, our method outperforms other state-of-the-art methods with respect to the metric and achieves comparable SSIM metric scores.

Figure 8: Unseen objects qualitative evaluation for ALOI dataset.

Of note is the fact that the LTNN framework is capable of effectively performing the specified rigid-object transformations using only a single image as input, whereas most state-of-the-art view synthesis methods require additional information which is not practical to obtain for real datasets. For example, MV3D requires depth information and TVSN requires 3D models to render visibility maps for training which is not available in the ALOI dataset.

Model  seen  unseen SSIM seen SSIM unseen
Ours .138.046 .221.064 .927.012 .871.031
IterGAN .147.055 .231.094 .918.019 .875.025
Pix2Pix .210.092 .256.100 .914.021 .864.041
Table 1: The result table for the experiment on the ALOI dataset.
Figure 9: Generated views for chair dataset. A single, gray-scale image of the chair at the far left (shown in box) is provided to the network as input.

3D chair experiment: We have tested our model’s ability to perform view estimation on the chairs and compared the results with the other state-of-the-art methods. The proposed model outperforms existing models specifically designed for the task of multi-view prediction and require the least FLOPs for inference compared with all other methods (see Table 2).

Model SSIM Parameters for train Parameters for infer GFLOPs / Image
Ours .912 .217 65,438 K 16,961 K   2.183
IterGan .865 .351 59,951 K 57,182 K 12.120
AFN .891 .240 70,319 K 70,319 K   2.671
TVSN .894 .230 60,224 K 57,327 K   2.860
MV3D .895 .248 69,657 K 69,657 K   3.056
Table 2: Results for 3D chair view synthesis. The proposed method uses significantly less parameters during inference, requires the least FLOPs, and yields the fastest inference times. FLOP calculations correspond to inference for a single image with resolution 2562563.
Elevation Azimuth   Light Direction Age
Model   SSIM      SSIM      SSIM      SSIM   
Ours .923 .107 .923 .108 .941 .093 .925 .102
CVAE-GAN .864 .158 .863 .180 .824 .209 .848 .166
CVAE .799 .166 .812 .157 .806 .209 .795 .173
CAAE .777 .175 .521 .338 .856 .270 .751 .207
AAE .748 .184 .520 .335 .850 .271 .737 .209
Table 3:

Results for simultaneous colorization and attribute modification on synthetic face dataset.

4.3 Diverse attribute exploration with synthetic face data

To evaluate the proposed framework’s performance on a more diverse range of attribute modification tasks, a synthetic face dataset and five conditional generative models with comparable encoder/decoder structures to the LTNN model have been selected for comparison. These models have been trained to synthesize discrete changes in elevation, azimuth, light direction, and age from a single greyscale image; results are shown in Table 3. Near continuous attribute modification is also possible within the proposed framework, and distinct CTU mappings can be composed with one another to synthesize multiple modifications simultaneously.

5 Conclusion

In this work, we have introduced an effective, general framework for incorporating conditioning information into inference-based generative models. We have proposed a modular approach to incorporating conditioning information using CTUs and a consistency loss term, defined an efficient task-divided decoder setup for deconstructing the data generation process into manageable subtasks, and shown that a context-aware discriminator can be used to improve the performance of the adversarial training process. The performance of this framework has been assessed on a diverse range of tasks and shown to outperform state-of-the-art methods.