Pose-variant 3D Facial Attribute Generation

07/24/2019 ∙ by Feng-Ju Chang, et al. ∙ University of Southern California 1

We address the challenging problem of generating facial attributes using a single image in an unconstrained pose. In contrast to prior works that largely consider generation on 2D near-frontal images, we propose a GAN-based framework to generate attributes directly on a dense 3D representation given by UV texture and position maps, resulting in photorealistic, geometrically-consistent and identity-preserving outputs. Starting from a self-occluded UV texture map obtained by applying an off-the-shelf 3D reconstruction method, we propose two novel components. First, a texture completion generative adversarial network (TC-GAN) completes the partial UV texture map. Second, a 3D attribute generation GAN (3DA-GAN) synthesizes the target attribute while obtaining an appearance consistent with 3D face geometry and preserving identity. Extensive experiments on CelebA, LFW and IJB-A show that our method achieves consistently better attribute generation accuracy than prior methods, a higher degree of qualitative photorealism and preserves face identity information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 12

page 13

page 14

page 15

page 16

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Faces are of unique interest in computer vision, whether it be for recognition, visualization or animation, upon the diversity with which their images are manifested. This is partly due to the variety of attributes associated with faces and partly due to extrinsic variations like head pose. Thus, generating photorealistic images of faces that address both of those aspects is a problem of fundamental interest that also enables downstream applications, such as augmentation of under-represented classes in face recognition.

Figure 1: Facial attributes generation under head pose variations, showing results comparison of our method to StarGAN [4] and CycleGAN [40]. Traditional frameworks generate artifacts due to pose variations. Introducing a 3D UV representation, the proposed TC-GAN and 3DA-GAN generates photo-realistic face attributes on pose-variant faces.

In recent years, conditional generative models such as Variational Auto-Encodesr (VAE) [19] or Generative Adversarial Networks (GAN) [11] have achieved impressive results [34, 4, 14, 28]. However, they have largely focused on frontal faces. In contrast, we consider the problem of generating 3D-consistent attributes on possibly pose-variant faces. As a motivating example, consider the problem of adding sunglasses to a face image. For a frontal input and with a desired frontal output, this involves inpainting with sunglass texture limited to the region around the eyes. For an input face image observed under a largely profile view and a more general task of generating an identity-preserving and sunglass-augmented face under arbitrary pose, a more complex transformation is needed since (i) both attribute-related and unrelated regions must be handled and (ii) the attribute must be consistent with 3D face geometry. Technically, this requires working with a higher-dimensional output space and generating an image conditioned on both head pose and attribute code. In Figure 1, we show how our proposed framework achieves these abilities surpassing conventional ones such as StarGAN [4] and CycleGAN [40].

A first attempt would be to frontalize the pose-variant face input. Despite good visual quality, appearance-based face frontalization methods [30, 37, 17, 29] may suffer from lack of identity-preservation. Geometric modeling methods [13, 6] faithfully inherit visible appearance but need to guess the invisible appearance due to self-occlusion, leading to extensions like UV-GAN [6].

Further, we note that both texture completion and attribute generation are correlated with 3D shape, that is, the hallucinated appearance should be within the shape area and the generated attribute should comply with the shape. This motivates our framework that utilizes both 3D shape and texture, distinguishing our work from traditional ones that deal only with appearance or UV-GAN that only uses the texture map.

Specifically, we propose to disentangle the task into mainly two stages: (1) We apply an off-the-shelf 3D shape regression PRNet [9] with a rendering layer to directly achieve 3D shape and weak perspective matrix from a single input, and utilize the information to render partial (self-occluded) texture. (2) A two-step GAN, consisting of a texture completion GAN (TC-GAN) that utilizes the above 3D shape and partial texture to complete the texture map and a 3D Attribute generation GAN (3DA-GAN) that generates target attributes on the completed 3D texture representation. In stage (1), we apply the UV representation [12, 9] for both 3D point cloud and texture, termed and , respectively. The UV representation not only provides the dense shape information but also builds a one-to-one correspondence from point cloud to texture.

In stage (2), TC-GAN and 3DA-GAN use both and as input to inject 3D shape insights into both the completed texture and generated attribute. Extensive experiments show the effectiveness of our method, which generates geometrically accurate and photorealistic attributes under large pose variation, while preserving identity.

Our contributions are summarized as the following:

  • [leftmargin=10pt]

  • We are the first to achieve 3D facial attributes generation under unconstrained head poses such as profile pose. Our method works on the pose-invariant 3D UV space, while most prior ones work on 2D image space.

  • We propose a novel two-stage GAN, for UV space texture completion (TC-GAN) and texture attribute generation (3DA-GAN). The stacked structure effectively solves the pose variation problem, conducts face frontalization and can generate attributes for different pose angles.

  • We propose a two-phase training protocol to guide the network to focus only on the area related to the attribute, which significantly improves identity-preservation.

  • Extensive experiments on several public benchmarks demonstrate the consistently better results in face frontalization, accurate attribute generation, image visual quality and close-to-original identity preservation.

2 Related Work

Face Frontalization: Early works [13, 10] apply a 3D Morphable Model and search for dense point correspondence to complete the invisible face region. [42] proposes a high fidelity pose and expression normalization approach based on 3DMM. Sagonas et al. [5] formulate the frontalization as a low rank optimization problem. Yang et al. [18] formulate the frontalization as a recurrent object rotation problem. Yim et al. [35] propose a concatenate network structure to rotate faces with image-level reconstruction constraint. Cole et al. [8] proposes using the identity perception feature to reconstruct normalized faces. Recently, GAN-based generative models [30, 37, 17, 29, 1, 6] have achieved high visual quality and preserve identity with large extent. Our method aligns in the GAN-based methods but works on 3D UV position and texture other than the 2D images.

Attribute Generation: Pixel-level graphical editing takes large part in attribute generation. However, we focus on the holistic image-level attribute generation and thus only discuss the closely related works. Li et al. [21] apply an attribute perception loss to guide the attribute synthesis. Upchurch et al. [31]

propose the target attribute guided feature-level interpolation for the synthesis. Shen and Liu 

[28] introduce residual maps to add or remove specific attributes. GAN-based methods [25, 39, 32, 14, 4, 20, 26, 38, 33] aim at connecting the latent attribute code space and the with-target-attribute image space, , swap attribute related latent code [39, 32], or disentangling the attribute for invariant representation [20], or imposing an attention network to guide the attribute generation in a specific area [38]. Xiao et al, [33] worked on paired images of attribute transfer. Given low resolution or occluded face images, both [23] and [3] attempted to generate high resolution images, which satisfy the user-given attributes. Our work lies in the GAN-based methods. In literature, there is no work synthesize attributes based on 3D representation while ours is the first. Moreover, our newly proposed two phase training and masked reconstruction loss, enable the network to focus only on the attribute related region, thus highly preserves the identity.

Figure 2: Illustration of image coordinate space and UV space. (a) Input image. (b) 3D dense point cloud. (c) UV position map transferred from 3D point cloud. (d) UV texture map , partially visible due to pose variation (Best viewed in color).
Figure 3: The proposed framework of pose-variant 3D facial attribute generation. By 3D dense shape reconstruction, a pose-variant face input is transformed into the UV position map and incomplete UV texture map (with the black holes) due to self-occlusion. Then, a texture completion GAN (TC-GAN) inpaints the black holes into a completed UV texture map. Further, a 3D attribute generation GAN (3DA-GAN) is designed to generate the target attributes on UV texture map and rendered back to 2D images with variant head poses.

3 The Proposed Approach

In this section, we firstly introduce a dense 3D representation named UV space that supports appearance generation. Then, rendering is conducted to generate visible appearance from the original input. Further, a texture completion GAN is presented to achieve fully visible texture map. In the end, a 3D attribute generation GAN is proposed to work on the 3D UV position and texture representation, generating target attribute under pose-variant conditions.

3.1 UV Position and Texture Maps

To faithfully render the visible appearance, we seek a dense 3D reconstruction of shape and texture. The 3D Morphable Model [2] sets up a parametric representation by decomposing both shape and texture into linear subspaces. It reduces the space dimension but also drops the high frequency information which is highly demanded for the rendering and generation tasks. Directly applying the raw shape and texture is computationally heavy. Following [12, 9], we introduce a sphere UV space that homographically map to the coordinate space.

Assume 3D point cloud , N is the number of vertices. Each vertex consists of the three dimensional coordinates in 3D space. are defined as:

(1)

Eq. 1 establishes a unique mapping from dense point cloud to the UV maps. By quantizing the UV space with different granularity, one can control the density of UV space versus the image resolution. In this work, we quantize the UV maps into and thus preserves vertices. As shown in Fig. 2, a UV position map is defined on the UV space, each entry is the corresponding three dimensional coordinate . We apply PRNet [9]

to estimate the 3D shape and then exploit Eq. 

1 to obtain the . A UV texture map is also defined on the UV space, each entry is the corresponding coordinate’s RGB color.

UV texture map rendering: of a pose-variant face is partially visible as shown in Fig. 2 (d). The invisible region corresponds to the self-occluded region resulting from pose variation. In the original coordinate space, we conduct a z-buffering algorithm [41] to label the visible condition of each 3D vertex. Those vertices with largest depth information are visible while all others are invisible. Assume the visibility matrix with entry means visible and invisible.

The rendering is a look-up operation by associating the specific coordinate’s color to the corresponding coordinate. We formulate the process in Eq. 2.

(2)

where is determined by Eq. 1 and denotes element-wise multiplication.

3.2 UV Texture Map Completion

The incomplete from the rendering is insufficient to conduct the attribute generation. We seek a texture completion that can not only recover photo-realistic appearance but also preserve identity. UV-GAN [6] proposes a similar framework to complete the UV texture map by applying an adversarial network. However, it only considers the texture information. We argue that for 3D UV representation, completing the appearance should consider both texture information and the shape information. For example, combining the original and flipped input will provide a good initialization for appearance prediction. But it only applies the symmetry constraint on shape, which is not sufficient to preserve the shape information. Thus, we take , and flipped as input.

Reconstruction module: To prepare the UV texture ground truth, we start with near-frontal face images where all the pixels are visible. Then, we perturb the head pose of this original image with random angle. Note that all the pose variant images share the same frontal ground truth which is the original image. By rendering in Eq. 2, we obtain the incomplete texture map for the input. Since ground truth is provided, we propose the supervised reconstruction loss to guide the completion.

(3)

stands for the generator consists of the encoder and decoder. is the partial texture map, the flipped input and the complete ground truth of the input. Merely rely on reconstruction leads to blurry effect. We introduce the adversarial learning to improve the generation quality.

Discriminator module: Given the ground truth images in the positive sample set and the generated samples in the negative sample set , we train a discriminator with the following objective.

(4)

Generator module: Following the adversarial training, aims to fool D and thus push the objective to the other direction.

(5)

Smoothness term: To remove the artifact, we propose to apply the total variation loss to locally constrain the smoothness of the output.

(6)

is the gradient of the output . is the number of entries of the texture map. To preserve identity, it is general to introduce a face recognition engine to guarantee the recognition feature of generated image is close to the ground truth feature. In practice, we find the reconstruction constraint Eq. 3 is sufficient to preserve the identity. It is because major part of the facial area is visible, which already largely indicates the identity information. By symmetry and reconstruction constraint, the identity is well preserved. Thus, the overall loss for TC-GAN is summarized:

(7)

Weight balance is empirically set as respectively.

3.3 3D Face Attribute Generation

Figure 4: The architecture and loss design of 3DA-GAN.

Dissimilar from the traditional image based attribute generation, we adopt the 3D UV representation, the and completed , as the input. We believe that introducing 3D geometric information can better synthesize attribute, , with 3D shape information, sunglasses will be generated as surface. We formulate the target attribute generation as a conditional GAN framework, as shown in Fig. 4, by inserting the attribute code into the data flow. We manually select 5 out of 40 attributes defined from celebA [22] which do not indicate the face identity. Thus, , each element stands for one attribute, means with the attribute without. The attribute code is convolved with two blocks and then concatenated to the third block of the encoder of generator .

We investigate CycleGAN [40] and StarGAN [4] network structures and find that CycleGAN provides a more stable training and better accuracy indicated in experiment section. Thus, we start with the CycleGAN loss design.

Identity loss: in conditional GAN setting, if input attribute code is the original ground truth , we expect the output should reconstruct the ground truth input, terming as the identity loss:

(8)

Quality Discriminator: We introduce a quality discriminator in charge of the image quality, leaving the attribute generation correctness to an independent discriminator. The positive sample set are the ground truth and the negative sample set are the generated UV maps . To update , we apply the following loss.

(9)

The quality loss from is fed back to the generator , resulting the adversarial loss of quality.

(10)

Cycle Consistency: Following CycleGAN’s setting, we simultaneously set an inverse generation module , to convert the generated into the original input , and expect the converted back UV texture is similar to the original input.

(11)

Besides the CycleGAN losses, we propose two new losses that specifically deal with attribute generation.

Figure 5: Manually defined attribute related masks based on the reference UV texture map. (a) reference (constructed by our generated UV position map and the mean face texture provided by Basel Face Model), (b) eyeglasses mask, (c) lipstick and smile mask, (d) 5’o clock shadow mask, and (e) bangs mask.

Masked Reconstruction Module: We manually define the non-attribute area shown in Fig. 5 on the reference UV texture map. Those attributes are divided into several different mask types or their combination, , lipsticks and smile share the same mask of Fig. 5 (c). Together with the fully visible one (mask of all entries as ), we define mask indicating all the categories. The reconstruction objective is as below.

(12)

is determined by the target attribute code.

Target Attribute Discriminator: Separated from , we set an independent discriminator to evaluate whether the one-bit specific attribute is correctly generated or not. The positive sample set consists of samples from the ground truth with the specific attribute. The negative sample set are the samples generated from . The target attribute discriminator is updated as:

(13)

Accordingly, the adversarial loss to update the generator is:

(14)

In TC-GAN, we find that reconstruction loss other than recognition perception loss is sufficient to preserve identity. It also applies for attribute generation. As shown in Fig. 5, attribute related area is small portion of the entire facial area. By reconstruction, the large portion already strongly indicates the identity. The overall training is divided in two phases. Phase one accepts the original attribute code and expect to output the reconstructed UV texture. Phase two accepts the target attribute code and generate the image with target attribute.

(15)

The hyper-parameters for phase one and phase two are set as , and respectively.

4 Implementation Details

To prepare TC-GAN training data, we collect near-frontal images from 4DFE and 300W-LP (58848 from 4DFE and 2735 from 300W-LP) and augment them with uniformly distributed poses, , from left profile to right profile in every

. The near-frontal images are converted to UV representation and serve as the ground truth. The augmented pose-variant images are converted to UV position and incomplete texture, serving as input. By mixing the two training sets, the model generalization ability is enhanced. We apply an hour-glass [24] structure as the TC-GAN backbone. For structure detail please refer to supplementary material.

We find that inside the structure, skip links are important to preserve high frequency information, especially from the lower layers.

we train the network using Adam optimizer, with batch size 120 and initial learning rate

. It converges within 10 epochs. We further fine-tune it on CelebA training set with initial learning rate

for another 8 epochs.

Figure 6: Visualization of TC-GAN and other face frontalization methods on LFW [16]. A near-frontal image is randomly selected from LFW and shown as “Ground truth”. We render the ground truth with multiple head poses as input with black background.
method yaw-15 yaw-30 yaw-45 yaw-60 yaw-75
Hassner et al.[13] 30.85 53.80 174.12 208.79 203.71
DR-GAN [30] 82.39 84.88 90.82 98.68 110.11
Ours 8.06 13.17 20.29 27.39 38.92
Table 1: FID score comparison on LFW dataset. We randomly select one image out of the verification pairs, and render yaw to 15, 30, 45, 60, and 75 respectively. FID is calculated between the frontalized images and the not selected original images.

We similarly prepare training data for 3DA-GAN training, picking 48K near-frontal images from CelebA for each attribute and convert them to UV representation. Those without target attribute ones serve as input. Those with target attribute ones are positive samples while generated UV texture maps are negative samples for attribute discriminator. For quality discriminator, real UV texture are positive samples and generated ones are negative samples. We randomly select one bit as our target attribute and all others remain not perturbed. The training procedure is two phases: (1) Reconstruction. Assuming input and the original attribute code . (2) Attribute perturbed generation. We set one attribute per time of to be 1. The inputs are and the perturbed . The two-phase training pushes the generation to focus on the attribute related area while remain the non-attribute area.

We use Adam optimizer with batch size 16 and initial learning rate . The training converges around 15 epochs across different target attributes.

Test F1-score (higher better) FID-score (lower better)
real (yaw 45) real-a. (yaw 45) real (yaw 45) real-a. (yaw 45)
Model Train SG LS SM BA SG LS SM BA SG LS SM BA SG LS SM BA
FaderNet [20] real 98.97 - - - 96.72 - - - 52.1 - - - 79.9 - - -
AttGAN [14] real 97.80 - - 86.86 91.89 - - 86.15 87.6 - - 135.5 99.0 - - 172.6
StarGAN [4]* real 97.15 84.26 87.40 89.56 96.38 77.54 77.11 86.33 85.7 78.9 92.3 82.3 139.8 135.9 150.6 144.0
real-a 97.35 78.87 83.40 89.33 98.07 75.43 79.01 86.77 72.7 68.9 58.9 59.5 114.0 85.1 82.8 105.3
Ours 98.88 84.70 87.87 94.86 98.23 82.04 83.32 93.67 38.2 34.1 33.0 21.8 36.3 35.4 30.6 19.4
CycleGAN [40]* real 97.66 84.41 86.33 70.96 90.49 74.45 76.48 69.01 30.1 25.1 32.3 28.7 40.9 49.2 43.3 36.8
real-a 98.93 91.34 84.25 82.43 97.31 69.27 75.51 80.70 33.9 12.5 12.7 9.1 19.8 31.0 17.1 11.5
(ResNet) Ours 99.37 94.69 94.56 93.35 99.10 93.04 91.49 91.64 18.5 12.6 13.0 10.3 29.7 10.9 11.0 8.9

Table 2: Quantitative comparison on attribute generation by F1 score and FID [15]

score from CelebA testing set. The target generated attribute is evaluated by an off-line attribute classifier for F1 score (precision and recall). Visual quality is indicated by FID score between the target attribute generated images and the ground truth with same attribute images. “real” means original CelebA training set. “real-a” means original plus pose augmented images. “Ours” means training with our proposed loss and UV texture data. *: we apply the network structure and re-train models. SG: Sunglass, LS: Wearing Lipstick, SM: Smiling, BA: Bangs.

Test F1-score (higher better) FID-score (lower better)
real (yaw 45) real-a (yaw 45) real (yaw 45) real-a (yaw 45)
Model Loss SG LS SM BA SG LS SM BA SG LS SM BA SG LS SM BA
CycleGAN w/o Eq. 12 14 97.97 87.92 84.62 83.65 97.93 86.21 81.11 82.21 20.2 10.6 7.8 14.1 43.8 20.4 27.2 18.3
w/o Eq. 12 99.28 92.95 93.17 94.86 98.87 90.79 89.50 93.82 17.6 17.5 13.8 11.9 26.6 18.1 15.0 11.4
(ResNet) w/o Eq. 14 97.82 83.28 81.81 86.56 97.54 82.35 78.43 85.86 29.1 19.0 18.1 10.5 39.3 18.4 17.7 10.4
Full 99.37 94.69 94.56 93.35 99.10 93.04 91.49 91.64 18.5 12.6 13.0 10.3 29.7 10.9 11.0 8.9
Table 3: Ablation study for w/o masked reconstruction loss (Eq. 12), and/or w/o attribute loss (Eq. 14). We put the CycleGAN loss (w/o Eq. 12 14) as starting point, quality adversarial loss, identity loss and cycle consistency loss, since we believe the CycleGAN loss ablation is fully studied in [40]. F1 and FID scores are reported. We use CycleGAN ResNet structure as it achieves the best result across the experiments. SG: Sunglass, LS: Wearing Lipstick, SM: Smiling, BA: Bangs.
Method SG LS SH SM BA Avg.
Original - - - - - 91.38
FaderNet [20] 79.05 - - - - 79.05
AttGAN [14] 87.94 - - - 82.20 85.07
StarGAN [4]* 75.28 78.11 81.11 78.80 81.31 79.03
CycleGAN [40]* 89.79 88.09 88.48 90.00 89.20 89.11
Ours 90.40 87.11 89.76 90.68 90.06 89.60
Table 4: Identity preserving evaluation on IJBA dataset under the verification protocol, reporting TAR@FAR0.01. *: models we retrain on our training data. SG: Sunglass, LS: Wearing Lipstick, SH: 5’oclock shadow, SM: Smiling, BA: Bangs.

5 Experiments

In this section, we evaluate our framework for the tasks of UV texture map completion and the 3D attribute generation. Regarding the training, for texture completion, we generate the UV space representation of 300W-LP [41] and 4DFE [36] to form our training set.

The evaluation for texture completion is conducted on LFW [16] on both visualization and FID score as a fair comparison to other methods. For attribute generation, we generate the UV space representation of CelebA [22] and provide the rendered pose augmented images for both training and testing.

5.1 Datasets

300W-LP: It is generated from 300W [27] face database by 3DDFA [41], in which it establishes a 3D morphable model and reconstructs the face appearance with varying head poses.

It consists of overall 122,430 images from 3,837 subjects. For each subject, images are with uniformly distributed varying head poses.

CelebA: It contains about 203K images with 40 attributes per image annotated. The distribution of this dataset in terms of yaw angle is highly long-tailed towards near-frontal, which remains the demand to augment it for more pose-variant attribute generation.

4DFE: It is a high-resolution 3D dynamic facial expression database. It contains 606 3D facial expression sequences captured from 101 subjects, with a total of approximately 60,600 frame models. Each 3D model of a 3D video sequence has the resolution of approximately 35,000 vertices. The texture video has a resolution of about pixels per frame.

5.2 UV Texture Map Completion

In our framework, we firstly apply a 3D dense shape reconstruction and rendering to obtain a partially visible UV texture map. Then we apply our TC-GAN to obtain the completed UV texture map and render it back to image-level appearance.

Frontalization Visual Comparison: since our framework provides a way to conduct face frontalization, we visually compare our method with several state-of-the-art frontalization methods in Fig. 6. The traditional geometric method [13] fails to complete the holes caused by self-occlusion when head pose is large. DR-GAN [30] works fairly well when head pose is small. When head pose is close to profile, DR-GAN fails to preserve the face identity while our method consistently preserves the identity across different head poses. Our method also consistently preserves the skin color where DR-GAN cannot.

Quantitative Comparison: the Fletcher Inception Distance (FID) [15] is introduced in Table 1 to quantitatively indicate the photo-realisticity of generated images compared to original real images. The closer to real images, the lower FID score. In Table 1, our method clearly achieves significantly lower FID score than other methods.

Figure 7: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [4] and CycleGAN [40] trained on our prepared data.
Figure 8: Visual results of applying our method to augment face images from CelebA [22] testing set, in attributes and yaw angles.

5.3 3D Attribute Generation

We manually select out of attributes defined in CelebA, which do not indicate face identity and only correlate with the facial area. They are Sunglasses (SG), Wearing lipstick (LS), 5’o clock shadow (SH), Smiling (SM), and Bangs (BA). We strictly follow the CelebA training, validation and testing splitting protocol. 111SH is not shown in Table 2 and Table 3 due to space limit. Please refer to supplementary material for complete information.

Traditional attribute generation methods, , FaderNet [20] and AttGAN [14], are trained on 2D images.

For fair comparison, we apply StarGAN and CycleGAN network structures, trained orthogonally on real, real plus pose augmented images and our UV texture and position maps. For real data in CelebA, we observe a strong head pose bias towards near-frontal poses. We calculate the pose by the reconstructed 3D shape vertices, and use it to split the testing data into and . As testing data is very few, such as, images for lipstick, where have images, we augment the data from near-frontal images and achieves augmented samples to match the volume of .

Attribute Generation Accuracy: we apply an off-line attribute classifier, trained on CelebA training set, to evaluate the attribute generation performance, whose average precision on CelebA testing set is , close to state-of-the-art performance. F1 score is reported as precision and recall may vary due to threshold setting.

We apply 3DA-GAN on the negative samples (without the target attribute) to generate the images with target attribute, serving as positive samples. Further, FID [15] is computed to evaluate the photo-realisticity of the attribute augmented images.

In Table 2, we compare to several state-of-the-arts, FaderNet [20], AttGAN [14], StarGAN [4] and CycleGAN [40]. The last two are retrained on original celebA real data, real plus pose augmented data (“real-a”) as well as our UV texture and position data. For “Ours”, we apply our proposed loss instead of StarGAN or CycleGAN loss. The numbers in Table 2 clearly show that our proposed 3DA-GAN consistently achieves higher F1 score than the state-of-the-arts. Moreover, our method also achieves consistently lower FID score. On CycleGAN (ResNet) model, our method FID score is close to the one trained on “real-a”, tie on and slightly better on . However, our method achieves much higher F1 score (precision and recall), more than higher on “SM” and “BA”, compared to CyCleGAN trained on “real-a” across yaw and .

Identity Preserving Property: we apply a state-of-the-art face recognition engine, ArcFace [7] to provide the identity feature. For each verification pair, we randomly select one image without the target attribute, apply our method to generate the target attribute, and evaluate the similarity between the generated target attribute image and the not selected image. We independently run experiments for those 5 attributes. In Table 4, “Original” means the original verification accuracy without any attribute generation, which serves as the upper bound for all methods. Compared to other methods, our 3DA-GAN achieves almost all higher verification accuracy while slightly worse on lipstick. Nevertheless, our method achieves average accuracy, which is close to the upper bound , indicating that the proposed attribute generation maximumly preserves identity information.

Visualization: we show a pose-variant face attribute generation example in Fig. 7, and compare to StarGAN and CycleGAN. The 2D image based methods suffers from the pose variation, , for both StarGAN and CycleGAN in Sunglass, the left eye region is not correctly generated. In smile, StarGAN failed to generate the attribute while CycleGAN shows unpleasant artifacts in the mouth area. In contrast, our method shows not only the correct attribute generation but also the pleasant visual quality. Worth noting that for “lipstick” and “shadow”, they are actually related to the gender or identity. This is because for lipstick, the dataset is naturally biased towards female. For shadow, the training images are quite similar to another attribute “beard”, which caused the similar appearance generation effect.

Further shown in Figure 8, given an unconstrained face image, our method can generate target attribute with variant head poses. It provides strong potential in high quality face editing of multiple attributes and can serve as face augmentation for face recognition alongside head pose and attribute axis.

5.4 Ablation Study

We investigate the contribution of each component proposed in our framework. In Table 3, we start with the default CycleGAN loss, which is without our proposed masked reconstruction loss Eq. 12 and attribute adversarial loss Eq. 14. For CycleGAN loss, , generative adversarial loss (a.k.a quality adversarial loss), identity loss and cycle consistency loss, we believe these components’ effects are clearly discussed in  [40]. Thus, we focus on the two newly proposed losses Eq. 12 and Eq. 14. Overall, without each or both of the two new components, the performance across F1 and FID score is degraded in certain degree. Moreover, without attribute adversarial loss is more critical as accuracy drops significantly more than without masked reconstruction loss.

6 Conclusion

We propose a two-stage Texture Completion GAN (TC-GAN) and 3D Attribute GAN (3DA-GAN), to tackle the pose-variant facial attribute generation problem. The TC-GAN inpaints the missing appearance from self-occlusion and provides a normalized UV texture. Our 3DA-GAN works on the UV texture space to generate target attributes with maximum preserved subject identity. Extensive experiments show that our method achieves consistently better attribute generation accuracy, closer to original images’ visual quality, and higher identity preserving verification accuracy, when compared to several state-of-the-art attribute generation methods. Our good generation quality also provides the potential for face editing and face image augmentation alongside pose and attribute axis.

References

  • [1] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Towards open-set identity preserving face synthesis. In CVPR, 2018.
  • [2] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In SIGGRAPH, 1999.
  • [3] Z. Chen, S. Nie, T. Wu, and C. G. Healey. High resolution face completion with multiple controllable attributes via fully end-to-end progressive generative adversarial networks. arXiv preprint arXiv:1801.07632, 2018.
  • [4] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo.

    Stargan: Unified generative adversarial networks for multi-domain image-to-image translation.

    In CVPR, 2018.
  • [5] C.Sagonas, Y.Panagakis, S.Zafeiriou, and M.Pantic. Robust statistical face frontalization. In ICCV, 2015.
  • [6] J. Deng, S. Cheng, N. Xue, Y. Zhou, and S. Zafeiriou. Uv-gan: Adversarial facial uv map completion for pose-invariant face recognition. In CVPR, 2018.
  • [7] J. Deng, J. Guo, X. Niannan, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In CVPR, 2019.
  • [8] F.Cole, D.Belanger, D.Krishnan, A.Sarna, I.Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity feature. In CVPR, 2017.
  • [9] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou. Joint 3d face reconstruction and dense alignment with position map regression network. In ECCV, 2018.
  • [10] C. Ferrari, G. Lisanti, S. Berretti, and A. Bimbo. Effective 3d based frontalization for unconstrained face recognition. In ICPR, 2016.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [12] R. A. Güler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos. Densereg: Fully convolutional dense shape regression in-the-wild. In CVPR, 2017.
  • [13] T. Hassner, S. Harel, E. Paz, and R. Enbar. Effective face frontalization in unconstrained images. In CVPR, 2015.
  • [14] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen. Attgan: Facial attribute editing by only changing what you want. In arXiv:1711.10678, 2018.
  • [15] M. Heusel, H. Ramsauer, T. Unterthiner, and B. Nessler. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
  • [16] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, University of Massachusetts, Amherst, 2007.
  • [17] R. Huang, S. Zhang, T. Li, and R. He. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In ICCV, 2017.
  • [18] J.Yang, S.Reed, M.-H.Yang, and H.Lee. Weaklysupervised disentangling with recurrent transformations for 3d view synthesis. In NIPS, 2015.
  • [19] D. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • [20] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. DENOYER, et al. Fader networks: Manipulating images by sliding attributes. In NIPS, 2017.
  • [21] M. Li, W. Zuo, and D. Zhang. Convolutional network for attribute- driven and identity-preserving human face generation. In arXiv:1608.06434, 2016.
  • [22] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
  • [23] Y. Lu, Y.-W. Tai, and C.-K. Tang. Attribute-guided face generation using conditional cyclegan. In Proceedings of the European Conference on Computer Vision (ECCV), pages 282–297, 2018.
  • [24] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
  • [25] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. A. ̵́lvare. Invertible conditional gans for image editing. In NIPS Workshops, 2016.
  • [26] A. Pumarola, A. Agudo, A. Martinez, A. Sanfeliu, and F. Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In ECCV, 2018.
  • [27] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In ICCVW, 2013.
  • [28] W. Shen and R. Liu. Learning residual images for face attribute manipulation. In CVPR, 2017.
  • [29] Y. Shen, P. Luo, J. Yan, X. Wang, and X. Tang. Faceid-gan: Learning a symmetry three-player gan for identity-preserving face synthesis. In CVPR, 2018.
  • [30] L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for pose-invariant face recognition. In CVPR, 2017.
  • [31] P. Upchurch, J. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Weinberger. Deep feature interpolation for image content changes. In CVPR, 2017.
  • [32] T. Xiao, J. Hong, and J. Ma. Dna-gan: Learning disentangled repre- sentations from multi-attribute images. In ICLR Workshops, 2018.
  • [33] T. Xiao, J. Hong, and J. Ma. Elegant: Exchanging latent encodings with gan for transferring multiple face attributes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 168–184, 2018.
  • [34] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, 2016.
  • [35] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, and J. Kim.

    Rotat- ing your face using multi-task deep neural network.

    In CVPR, 2015.
  • [36] L. Yin, X. Chen, Y. Sun, T. Worm, , and M. Reale. A high-resolution 3d dynamic facial expression database. In International Conference on Automatic Face and Gesture Recognition, 2008.
  • [37] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Towards large-pose face frontalization in the wild. In ICCV, 2017.
  • [38] G. Zhang, M. Kan, S. Shan, and X. Chen. Generative adversarial network with spatial attention for face attribute editing. In Proceedings of the European Conference on Computer Vision (ECCV), pages 417–432, 2018.
  • [39] S. Zhou, T. Xiao, Y. Yang, D. Feng, and Q. He. Genegan: Learning object transfiguration and attribute subspace from unpaired data. In BMVC, 2017.
  • [40] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
  • [41] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment across large poses: A 3D solution. In CVPR, 2016.
  • [42] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-fidelity pose and expression normalization for face recognition in the wild. In CVPR, 2015.

Appendix A 3D Shape Alignment and UV Maps Rendering

In this section, we explain how we prepare our ground truth 3D point cloud with respect to the reference BFM 3D shape. We first trim the original BFM shape to the one focusing on the facial area and consists of 38K vertices, as the BFM reference shape thereafter. Given an image, we obtain its 3D shape from the dataset or estimated by [9]. Since the number and definition of 3D vertices are different, the untrimmed shape need to be aligned to the reference trimmed BFM shape. A diagram for this alignment is shown in Fig. 9.

The 4DFE 3D point cloud and reference BFM are deformed to match the detected 2D landmarks. Then we refine the alignment via a 3D-ICP like procedure to obtain the aligned shape.

Given the aligned shape, our goal is to obtain a 3D dense shape representation, i.e., UV texture map, so that the high frequency information can be preserved. To this end, a reference UV coordinates is introduced as illustrated in the lower part of Fig. 10. By extrapolating 3D points based on this reference coordinates and the aligned pose-variant shape, we can get very high resolution UV position map. Here we set it as . Note that this reference UV coordinates is shared by all images, so every pixel corresponds to the same facial point; this is essential to define the attribute related masks (Fig. 5 of the main paper). It enables the attribute generation under an invariant UV space, where arbitrary head pose variation is allowed for the input.

Figure 9: Align a ground truth shape or an estimated shape from the existing 3D reconstruction method to the trimmed BFM shape. The example image is from 4DFE dataset and the landmarks can be obtained by any off-the-shelf image based landmark detector.
Figure 10: Given an input image, conversion from the aligned BFM to the fixed UV coordinates, and the uv texture map rendering based on the vertex visibility and input image. denotes element-wise multiplication.

Appendix B Identity Preserving Evaluation for TC-GAN

We have already shown in Section 5.2 of the main paper that our TC-GAN can achieve better quality of frontalization with the lowest FID score compared to [13] and DR-GAN [30]. Here, we take a step further to evaluate the verification accuracy on LFW dataset by applying all methods to the non-frontal images, which we define as the ones of yaw and replacing the original images with the frontalized ones. Again, the state-of-the-art face recognition engine, ArcFace [7] is exploited to provide the identity features. In Table 5, the accuracy based on TC-GAN drops the least compared to original performance, which indicates our method preserves identity better than the state-of-the-arts.

method Verification Accuracy
Original
Hassner et al.[13]
DR-GAN [30]
Ours
Table 5: Verification accuracy comparison on LFW dataset. We apply our TC-GAN and other face frontalization methods to the LFW images of yaw angle to replace the original image with the frontalized one.

Appendix C More Attribute Generation and Pose-variant Attribute Augmentation Results

In addition to Figure 7 and Figure 8 of the main paper, we show more attribute generation results against StarGAN and CycleGAN in Fig. 111213,  14,  15. As can be seen, our method can generate higher quality, more geometrically consistent attributes under large head pose variations.

Besides, more attribute and pose augmentation results are shown in Fig. 1617181920. Our method has good potential to benefit the face recognition system by enriching training data diversity while maximumly preserving the original identity information.

Figure 11: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [4] and CycleGAN [40] trained on our prepared data.
Figure 12: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [4] and CycleGAN [40] trained on our prepared data.
Figure 13: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [4] and CycleGAN [40] trained on our prepared data.
Figure 14: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [4] and CycleGAN [40] trained on our prepared data.
Figure 15: Pose-variant qualitative results of our 3DA-GAN compared to StarGAN [4] and CycleGAN [40] trained on our prepared data.
Figure 16: Visual results of applying our method to augment face images from CelebA [22] dataset, in attributes and yaw angles.
Figure 17: Visual results of applying our method to augment face images from CelebA [22] dataset, in attributes and yaw angles.
Figure 18: Visual results of applying our method to augment face images from CelebA [22] dataset, in attributes and yaw angles.
Figure 19: Visual results of applying our method to augment face images from CelebA [22] dataset, in attributes and yaw angles.
Figure 20: Visual results of applying our method to augment face images from CelebA [22] dataset, in attributes and yaw angles.

Appendix D Including 5’o Clock Shadow (SH) Results

In Table 2 of the main paper, we have already shown the attribute classification accuracy and visual quality for the 4 attributes: Sunglasses, Lipstick, Smiling, and Bangs, while omitting the 5’o clock shadow due to the space limit. Therefore, we include the results here for 5’o clock shadow and split the original table into two, one for F1 score in Table 6, and the other for FID score in Table 7. The same trend for SH has shown in both Table 6 and Table 7. Our method is consistently better than StarGAN and CycleGAN in the attribute generation accuracy, and achieves consistently lower FID score in the image quality, which indicates more similar visual effects to the original input.

Appendix E Ablation Study

Similarly, we include the quantitative ablation study for SH shown in Table 8 and Table 9. They also show the same trend as other attributes for both F1 score and FID score. More interestingly, we visualize the qualitative generation images by running the ablative models to further indicate the effect of the proposed losses. Figure 212223 show that for “w/o Eq. 12”, which is masked reconstruction loss, some of the generation fails and some of the generation introduces artifacts. For “w/o Eq. 14”, which is attribute adversarial loss, the generation mostly fails. For “w/o Eq. 11”, which is cycle consistency loss, it shows more artifacts than the full results. For “w/o Eq. 8”, which is identity loss, it also shows certain level of artifact compared to the full result.

Test real (yaw 45) real-a (yaw 45)
Train method SG LS SH SM BA SG LS SH SM BA
StarGAN [4]* real 97.15 84.26 88.75 87.40 89.56 96.38 77.54 82.07 77.11 86.33
real-a 97.35 78.87 89.63 83.40 89.33 98.07 75.43 88.64 79.01 86.77
Ours 98.88 84.70 91.12 87.87 94.86 98.23 82.04 90.06 83.32 93.67
CycleGAN [40]* real 97.66 84.41 84.49 86.33 70.96 90.49 74.45 79.21 76.48 69.01
real-a 98.93 91.34 85.17 84.25 82.43 97.31 69.27 84.98 75.51 80.70
Ours 99.37 94.69 91.80 94.56 93.35 99.10 93.04 90.90 91.49 91.64
Table 6: Quantitative comparison on attribute generation by F1 score on CelebA testing set. The target generated attribute is evaluated by an off-line attribute classifier for F1 score (precision and recall). The higher the better. “real” means original CelebA training set. “real-a” means original plus pose augmented images. “Ours” means training with our proposed loss and UV texture data. *: we apply the network structure and re-train models. SG: Sunglass, LS: Wearing Lipstick, SH: 5’o clock shadow, SM: Smiling, BA: Bangs.
Test real (yaw 45) real-a (yaw 45)
Train method SG LS SH SM BA SG LS SH SM BA
StarGAN [4]* real 85.68 78.86 96.97 92.28 82.28 139.77 135.93 172.84 150.58 144.02
real-a 72.73 68.91 42.36 58.92 59.53 114.02 85.14 89.02 82.82 105.34
Ours 38.22 34.05 26.19 33.02 21.79 36.31 35.43 30.05 30.58 19.39
CycleGAN [40]* real 30.10 25.06 28.73 32.32 28.69 40.88 49.21 42.56 43.31 36.78
real-a 33.89 12.46 6.57 12.74 9.05 19.83 31.04 8.81 17.06 11.54
Ours 18.54 12.56 7.47 13.03 10.28 29.65 10.92 6.81 10.97 8.94
Table 7: Quantitative comparison on attribute generation by FID score [15] on CelebA testing set. Visual quality is indicated by FID score between the target attribute generated images and the ground truth with same attribute images. The lower the better. “real” means original CelebA training set. “real-a” means original plus pose augmented images. “Ours” means training with our proposed loss and UV texture data. *: we apply the network structure and re-train models. SG: Sunglass, LS: Wearing Lipstick, SH: 5’o clock shadow, SM: Smiling, BA: Bangs.
Test F1-score (higher better)
real (yaw 45) real-a (yaw 45)
Model Loss SG LS SH SM BA SG LS SH SM BA
CycleGAN w/o Eq. 12,14 97.97 87.92 85.05 84.62 83.65 97.93 86.21 84.40 81.11 82.21
w/o Eq. 12 99.28 92.95 90.10 93.17 94.86 98.87 90.79 89.15 89.50 93.82
(ResNet) w/o Eq. 14 97.82 83.28 82.25 81.81 86.56 97.54 82.35 82.58 78.43 85.86
Full 99.37 94.69 91.80 94.56 93.35 99.10 93.04 90.90 91.49 91.64
Table 8: Ablation study for w/o masked reconstruction loss (Eq. 12)), and/or w/o attribute loss (Eq. 14). F1 scores are reported. We use CycleGAN ResNet structure as it achieves the best result across the experiments. SG: Sunglass, LS: Wearing Lipstick, SH: 5’o clock shadow, SM: Smiling, BA: Bangs.
Test FID-score (lower better)
real (yaw 45) real-a (yaw 45)
Model Loss SG LS SH SM BA SG LS SH SM BA
CycleGAN w/o Eq. 12,14 20.2 10.6 13.9 7.8 14.1 43.8 20.4 20.4 27.2 18.3
w/o Eq. 12 17.6 17.5 7.0 13.8 11.9 26.6 18.1 11.4 15.0 11.4
(ResNet) w/o Eq. 14 29.1 19.0 7.6 18.1 10.5 39.3 18.4 11.3 17.7 10.4
Full 18.5 12.6 7.5 13.0 10.3 29.7 10.9 6.8 11.0 8.9
Table 9: Ablation study for w/o masked reconstruction loss (Eq. 12)), and/or w/o attribute loss (Eq. 14). FID scores are reported. We use CycleGAN ResNet structure as it achieves the best result across the experiments. SG: Sunglass, LS: Wearing Lipstick, SH: 5’o clock shadow, SM: Smiling, BA: Bangs.
Figure 21: The effect of masked reconstruction loss on sunglasses, smile, and lipstick generation. From left to right: input images from CelebA dataset, using full losses, without masked reconstruction loss (Eq. 12). The masked reconstruction loss helps generating attributes in a specific region while preserve the non-attribute parts.
Figure 22: The effect of adversarial attribute loss on smile and bangs generation. From left to right: input images from CelebA dataset, using full losses, without adversarial attribute loss (Eq. 14). The adversarial attribute loss helps enhancing the intensity of generated attributes.
Figure 23: The effect of cycle consistent loss and identity loss on sunglasses generation. From left to right: input images from CelebA dataset, using full losses, without cycle consistent loss (Eq. 11), and without identity loss (Eq. 8). The cycle consistent loss and identity loss help preserving the non-attribute regions. The identity loss also makes the generated attribute regions more natural.

Appendix F Network Architectures

The network architectures of StarGAN and CycleGAN used in our experiments are shown in Table 1011121314

. We use instance normalization for the generator network in all the layers except the output layer. For the quality and attribute discriminator networks, we use Leaky ReLU with a negative slope of 0.01 in StarGAN, and 0.02 in CycleGAN. The definitions of the annotations in the tables are as follows: C: the number of output channels, K: kernel size, S: stride size, P: padding size, IN: instance normalization,

: the number of attributes to be generated, and are the height and width of the input image.

Type Layer
Downsampling Conv-(C64, K7x7, S1, P3), IN, ReLU
Downsampling Conv-(C128, K4x4, S2, P1), IN, ReLU
Downsampling Conv-(C256, K4x4, S2, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P1), IN, ReLU
Upsampling Deconv-(C128, K4x4, S2, P1), IN, ReLU
Upsampling Deconv-(C64, K4x4, S2, P1), IN, ReLU
Upsampling Deconv-(C3, K7x7, S1, P3), Tanh
Table 10: StarGAN Generator network architecture
Type Layer
Input Conv-(C64, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C128, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C256, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C512, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C1024, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C2048, K4x4, S2, P1), Leaky ReLU
Output Conv-(C1, K3x3, S1, P1)
Output Conv-(C, , S1, P0)
Table 11: StarGAN Quality and Attribute discriminator network architecture
Type Layer
Input ReflectionPad2d(3)
Input Conv-(C64, K7x7, S1, P0), IN, ReLU
Downsampling Conv-(C128, K3x3, S2, P1), IN, ReLU
Downsampling Conv-(C256, K3x3, S2, P1), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Residual Block Conv-(C256, K3x3, S1, P0), IN, ReLU
Upsampling Deconv-(C128, K3x3, S2, P1), IN, ReLU
Upsampling Deconv-(C64, K3x3, S2, P1), IN, ReLU
Upsampling ReflectionPad2d(3)
Upsampling Deconv-(C3, K7x7, S1, P0), Tanh
Table 12: CycleGAN Generator network architecture
Type Layer
Input Conv-(C64, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C128, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C256, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C512, K4x4, S1, P1), Leaky ReLU
Output Conv-(C1, K4x4, S1, P1)
Table 13: CycleGAN quality discriminator network architecture
Type Layer
Input Conv-(C64, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C128, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C256, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C512, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C1024, K4x4, S2, P1), Leaky ReLU
Hidden Conv-(C2048, K4x4, S2, P1), Leaky ReLU
Output Conv-(C, , S1, P0)
Table 14: The Attribute discriminator network architecture we used with CycleGAN