SeCGAN: Parallel Conditional Generative Adversarial Networks for Face Editing via Semantic Consistency

by   Jiaze Sun, et al.

Semantically guided conditional Generative Adversarial Networks (cGANs) have become a popular approach for face editing in recent years. However, most existing methods introduce semantic masks as direct conditional inputs to the generator and often require the target masks to perform the corresponding translation in the RGB space. We propose SeCGAN, a novel label-guided cGAN for editing face images utilising semantic information without the need to specify target semantic masks. During training, SeCGAN has two branches of generators and discriminators operating in parallel, with one trained to translate RGB images and the other for semantic masks. To bridge the two branches in a mutually beneficial manner, we introduce a semantic consistency loss which constrains both branches to have consistent semantic outputs. Whilst both branches are required during training, the RGB branch is our primary network and the semantic branch is not needed for inference. Our results on CelebA and CelebA-HQ demonstrate that our approach is able to generate facial images with more accurate attributes, outperforming competitive baselines in terms of Target Attribute Recognition Rate whilst maintaining quality metrics such as self-supervised Fréchet Inception Distance and Inception Score.



There are no comments yet.


page 1

page 3

page 7

page 8

page 12

page 13


Mask Embedding in conditional GAN for Guided Synthesis of High Resolution Images

Recent advancements in conditional Generative Adversarial Networks (cGAN...

GuidedStyle: Attribute Knowledge Guided Style Manipulation for Semantic Face Editing

Although significant progress has been made in synthesizing high-quality...

The GAN that Warped: Semantic Attribute Editing with Unpaired Data

Deep neural networks have recently been used to edit images with great s...

MatchGAN: A Self-Supervised Semi-Supervised Conditional Generative Adversarial Network

We propose a novel self-supervised semi-supervised learning approach for...

Mask-Guided Portrait Editing with Conditional GANs

Portrait editing is a popular subject in photo manipulation. The Generat...

JigsawGAN: Self-supervised Learning for Solving Jigsaw Puzzles with Generative Adversarial Networks

The paper proposes a solution based on Generative Adversarial Network (G...

Triple consistency loss for pairing distributions in GAN-based face synthesis

Generative Adversarial Networks have shown impressive results for the ta...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Existing approaches usually incorporate high-level information within the RGB domain. We propose to translate semantic masks in parallel to the RGB images without altering existing the architecture in the RGB domain (zoom in for a better view).

With the advent of deep learning 

[LeCun1989Backpropagation, He2016Deep, Schroff2015FaceNet], there is growing demand for annotated training data. This is challenging for many practical tasks as collecting large-scale annotated data demands much expertise, time, and resources. This becomes even more difficult when dealing with face-related tasks due to growing privacy concerns. Target labelled conditional Generative Adversarial Networks (cGANs) for RGB face attribute manipulation have been widely used to generate labelled photo-realistic synthetic data. Generating such examples is an important and challenging research problem [choi2018stargan, He2019AttGAN, Liu2019STGAN, bhattarai2020inducing, Sun2020MatchGAN]. In the past years, attempts were made focusing on encoding target domain information in order to regularise cGANs and improve the accuracy and quality of generated images. The majority of them focused on engineering representations of target labels and some have managed to improve the performance. STGAN [Liu2019STGAN]

proposed to condition the generator using the difference between binary vector representations of the target and source labels 

[Liu2019STGAN] instead of target only [choi2018stargan, mirza2014conditionalgan]. Similarly, word embeddings and graph-induced representations [bhattarai2020inducing], 3D morphable model (3DMM) parameters for faces [gecer2018semi], facial semantic masks [Gu2019Mask, Lee2020MaskGAN], and facial key points [Qian2019Make, kossaifi2018gagan] are also employed to encode target information to provide better guidance for cGANs. A recent study on cGAN conditioning [bhattarai2020inducing] observes that higher-order encodings such as graph-induced representations are becoming more successful in representing target information compared to forms such as binary vectors. Similarly, incorporating higher-level information such as semantic segmentation [Lee2020MaskGAN], 3DMM parameters [gecer2018semi], sparse and dense geometric key points [Qian2019Make] are more effective in comparison to methods relying only on low level RGB images [choi2018stargan, He2019AttGAN]. However, a commonality between these methods is that the generation output stays within the RGB domain and the higher-level information is either absorbed through direct input or in a manner to supplement the RGB data.

We hypothesise that by distributing high and low level information to separate network branches and training them in a cooperative manner, the overall model would be able to utilise different modalities more effectively than by simply absorbing the high-level information via a direct input. Unlike previous methods which take higher-level information simply to augment or fuse with the RGB data [Gu2019Mask, Tang2020XingGAN], we propose translating higher-level facial descriptions and the RGB images in parallel and imposing certain higher-level consistency constraint in the target domain (shown in Figure 1). To this end, we propose to train a semantic mask translation network in parallel to the RGB network as shown in Figure 2. We choose semantic mask for higher-order information as it is denser and more relevant to attribute manipulation compared to other choices such as landmarks. However, the proposed framework can be easily extended to other such descriptions of the face and implemented on top of existing frameworks without altering their architecture or overriding their existing capacity. Unlike previous work employing semantic masks [Gu2019Mask, Lee2020MaskGAN] which require target mask annotation, we only have access to the source mask and thus our work is orthogonal to these methods. In addition, our framework also benefits from adopting a hierarchical structure which has been shown to retain target attributes better by preventing the modification of unwanted attributes [yi2019apdrawinggan, Wu2020Cascade]. Compared to existing hierarchical cGANs for face expression manipulations [Wu2020Cascade] whose hierarchy is based on spatial location, our hierarchy is at a semantic level which is more refined than bounding boxes.

We would like to emphasise that our main goal is to utilise semantic information to generate images with greater accuracy whilst maintaining image quality. In summary, our contributions are as follows:

  • We present a cGAN for face editing, employing a parallel hierarchy which performs translation at both raw-pixel and semantic level. To our knowledge, this is the first work exploring image-to-image translation purely between semantic masks.

  • Our extensive quantitative evaluations demonstrate that our method achieves superior Target Attribute Recognition Rate (TARR) whilst being able to maintain quality metrics including self-supervised Fréchet Inception Distance (ssFID) and Inception Score (IS).

  • Our qualitative evaluations show that our method is able to synthesise more distinct and accurate attributes whilst avoiding unnecessary and irrelevant edits.

2 Related work

Image-to-image translation with cGANs. Label-conditioned GANs [mirza2014conditionalgan] are widely used in image-to-image translation tasks. They take a label vector as an additional input to help guide the translation process thus providing better control over the output images. Pix2pix [phillip2016image] and CycleGAN [zhu2017cyclegan] were amongst the first such frameworks, but they must be trained separately for each pair of domains. StarGAN [choi2018stargan] and AttGAN [He2019AttGAN] both overcame this challenge and are able to perform multi-domain translations using a single generator conditioned on the target label. STGAN [Liu2019STGAN] improved AttGAN by conditioning the generator using the difference between the target and source labels and incorporating selective transfer units. [bhattarai2020inducing]

further improved STGAN by transforming the attribute difference vector with a graph neural network before using it to condition the generator. These methods all rely purely on the RGB domain without any higher-level geometric information.

Geometry-guided generation. Geometric information has also been incorporated into various GAN architectures to guide the generation process. [Gu2019Mask, Lee2020MaskGAN] proposed using source and target semantic segmentation masks as direct inputs to the generator for face style transfer and component editing. [Qian2019Make] used facial landmarks to manipulate facial expressions and head poses. [men2020controllable] uses both pose and semantic information to synthesise whole body images. [Jo2019SCFEGAN] edits facial images using input sketches as reference. [gecer2018semi] uses 3DMM to generate identity and pose before adding realism using a GAN. Most work in this area requires a ground-truth or user-specified target semantic mask as a reference for generation. In our work, the only guidance is the information provided by the source label and semantic masks and we use a GAN to generate the target mask.

Hierarchical GANs and mutual learning. Most frameworks mentioned above feature a single GAN in their architecture, but recent work has started to feature multiple GANs operating in different pathways, with each operating on a different scale or modality. SinGAN [Shaham2019SinGAN] can generate high-resolution images using a pyramid of GANs, with generators at higher-resolutions taking information from those at lower resolutions. [yi2019apdrawinggan] translates facial images to artistic drawings by employing a global and multiple local GANs that focus on individual facial components before fusing them together. [Wu2020Cascade] learns to manipulate facial expressions via a similar global-local hierarchy. [Chen2019Learning] adds realism to synthetic 3D images via two adversarial games - one in the RGB domain and the other in the semantic and depth domain. [Tang2020XingGAN] performs pose editing by splitting the generator into a RGB branch and pose branch which share latent information with each other. Inspired by [Chen2019Learning] and [Tang2020XingGAN], we propose to incorporate two branches of GANs performing translations in separate domains, RGB and semantic, whilst sharing information in a mutual learning manner [Zhang2018Deep].

Figure 2: Overall pipeline of our method. The input image is translated to by in the RGB branch. Meanwhile, is parsed by the segmentation network to obtain the mask which is translated to by in the semantic branch. The translated RGB image is further parsed by . Inconsistency between the two semantic outputs is measured by the semantic consistency loss and minimised by back-propagating gradient into and . For clarity, the reconstruction loss is not shown in this figure.

3 Methodology

In this section, we present in detail our approach for face editing via semantic consistency. Given an input face image with source attribute label and a target attribute label , we aim to synthesise an image with the desired characteristics specified by .

Our approach is to incorporate semantic masks as a parallel modality to RGB images in the training process. Unlike previous work [Gu2019Mask, Lee2020MaskGAN] which require target segmentation masks to be specified as an input to their networks, we have no access to any ground truth target mask information. Instead, we propose to generate the target mask by translating the source mask using a second generator , trained in parallel with the standard generator used for translating RGB images. The outputs of both generators are then compared against each other at a semantic level and this information is back-propagated to both generators to minimise any semantic inconsistency between the two outputs. Whilst is our primary generator and will be kept during inference, only provides auxiliary guidance during training and is not necessary for inference.

Parsing network. We used an existing semantic segmentation network [Yu2018BiSeNet] pre-trained on CelebAMask-HQ [Lee2020MaskGAN]

to parse both source and translated RGB images into semantic regions. The network is pre-trained separately for different resolutions and classifies each pixel in the RGB domain into one of the following 12 segments: skin, eyebrows, eyes, eyeglasses, ears, earrings, nose, mouth, lips, neck, hair, and others. The last segment “others” is in fact the union of background, necklace, hat, and clothing which are irrelevant to facial attribute manipulation.

3.1 Parallel GANs

The overall pipeline of our framework is shown in Figure 2. During training, two branches of generators (, ) and discriminators (, ) operate in parallel, with and forming the RGB branch and and forming the semantic branch. The framework of either branch can be substituted by that of a single GAN such as StarGAN [choi2018stargan], AttGAN [He2019AttGAN], or STGAN [Liu2019STGAN].

RGB branch. Given an input RGB image , source and target attribute labels and , synthesises an output RGB image , where following STGAN [Liu2019STGAN]. The discriminator takes an image as input and has two output heads, and , which are predictions of realism and attribute vector respectively. We train the network to produce realistic images using the WGAN-GP loss [Gulrajani2017Improved]


where is uniformly sampled along straight lines between and . To generate the desired target attributes, we minimise the attribute classification loss


where “” denotes the dot product, is element-wise, and is the all-one vector. Finally, to ensure that the translated images are consistent with the input images, we impose the reconstruction loss


where is the zero vector and is the L1 norm.

Semantic branch. The semantic branch operates in a manner similar to the RGB branch and the main difference is that the former only translates semantic segmentation masks. Given an input RGB image , we generate its soft semantic mask using the aforementioned fixed parsing network . Here, the soft mask is further represented in a one-hot fashion, , which is of shape [number of segments height width] and each pixel has only one channel equal to 1 and all others 0. Similar to the RGB branch, given and , the semantic generator synthesises an output semantic mask which has the same shape as the input mask. Rather than using a

layer as the final activation before the output (which is the case for the RGB branch), we replace that with a Softmax layer so that each output pixel is a probability distribution over the 12 segments and that gradient is able to flow through during back-propagation. On the other hand, the discriminator

functions largely the same way as its RGB counterpart, taking a semantic mask as input and outputting predictions for the realism score and attribute vector

. As for the loss functions, we also train the semantic branch using the WGAN-GP adversarial loss


the attribute classification loss


and the reconstruction loss


where denotes the -th pixel, and and

are the image height and width respectively. Unlike the RGB branch which uses the L1 loss for reconstruction, we use the cross entropy loss here as it is more suitable for optimising probability distributions.

3.2 Semantic consistency

Up to this point, the two branches of GANs are still independent from each other. To link both networks so that they can mutually benefit from each other during training, we introduce the semantic consistency loss as an additional term to the overall objective. Unlike knowledge distillation frameworks which usually involve a one-way information transfer from a large (teacher) network to a small (student) one, our method is more akin to a mutual learning objective [Zhang2018Deep] as complementary information can be transferred both ways between and during training. While focuses on translations at the raw pixel level, it can easily neglect higher order information such as the geometry of facial components, but this can be compensated by the network. On the other hand, operating purely at the semantic level might lead to drastic changes in the translated images which might not be desirable in the RGB domain, and thus can be held in check by the RGB branch.

More specifically, we impose the semantic consistency loss on the soft semantic outputs of the two generators, namely and . Since these are essentially classification probabilities, we choose to measure the discrepancy between and using the cross entropy loss. To update , we minimise the following loss


where is the one-hot representation of . Similarly, to update , we minimise


where is the one-hot representation of . This way, both and learn to generate outputs with consistent semantic structures.

3.3 Optimisation

Overall loss functions. Summarising all the loss terms we have introduced so far, we have


Training. In a fashion similar to [Zhang2018Deep], both the RGB and semantic branches receive the same mini-batch of examples during each iteration, and the semantic consistency loss is used to update the parameters of one generator based on the semantic output of the other. The details of the training procedure is summarised in Algorithm 1.

1:  Input: Labelled dataset of RGB images and their corresponding attribute labels.
2:  Initialise: Initial weights , , , and , parsing network with pre-trained weights, learning rates and , step count , number of discriminator updates per generator update .
3:  repeat
5:     Randomly sample a mini-batch, and compute empirical loss terms (13) and (15).
6:     Compute the gradients and update:
7:     if  then
8:        With the same mini-batch from line 5, compute empirical loss terms (14) and (16).
9:        Compute the gradients and update:
10:     end if
11:  until Convergence.
12:  Output: Optimal .
Algorithm 1 SeCGAN.

4 Experiments

4.1 Datasets

We tested our method on two datasets widely used for face editing, the CelebFaces Attributes Dataset (CelebA) [liu2015faceattributes] and the CelebA-HQ dataset [karras2018progressive, Lee2020MaskGAN].

CelebA. This dataset consists of 202,599 facial images of size , which were centre-cropped to and then resized to a desired resolution for training ( or ) in our experiments. Each example has 40 attribute annotations, from which we selected 13 attributes for our experiments, namely bald, bangs, black hair, blonde hair, brown hair, bushy eyebrows, eyeglasses, gender, mouth slightly open, moustache, beard, pale skin, and age. We allocated 182,000 images for training, 637 for validation, and 19,962 for testing, following [He2019AttGAN].

CelebA-HQ. This dataset is a subset of CelebA with 30,000 examples in total, but each image is re-created in by [karras2018progressive]. Each image also has 40 attribute annotations from which we selected the same 13 as before. We split the dataset into 28,000 for training, 500 for validation, and 1,500 for testing, again following [He2019AttGAN].

4.2 Baselines

We adopted two baseline models for our experiments, StarGAN [choi2018stargan] and AttGAN [He2019AttGAN], both of which perform attribute label conditioned image-to-image translations.

StarGAN. StarGAN is a cGAN which unifies multi-domain image-to-image translation using a single generator. Whilst the generator in StarGAN does not strictly have an encode-decoder architecture, it has six residual blocks serving as bottleneck layers sandwiched between down-sampling layers on the input side and upsampling layers on the output side. The discriminator is based on PatchGAN [phillip2016image], a fully convolutional network which learns and assesses features at the scale of local image patches. In addition, StarGAN uses a cycle-consistency loss [zhu2017cyclegan] given by . Since we are using the attribute difference vector as the conditional input in this work, there is no need to perform two translations to reconstruct a given input image. Therefore, we replace it with the reconstruction loss introduced before in equations (5) and (10) for our experiments.

AttGAN. AttGAN is a cGAN with a similar functionality and pipeline as StarGAN. Unlike StarGAN, however, AttGAN features an encode-decoder architecture which includes skip connections between them as well as injection layers where the attribute labels are concatenated with latent representations at various scales. Unlike StarGAN, the discriminator of AttGAN employs two fully-connected layers following a cascade of convolutional ones. Similarly to our StarGAN baseline, we also use the attribute difference vector as the conditional input to the generator. Compared to the 8M trainable parameters in StarGAN’s generator, AttGAN’s generator has significantly more at 43M and serves as a higher-capacity use case in our tests.

4.3 Implementation details

We implemented our method in PyTorch based on the official implementation of StarGAN and the PyTorch version of AttGAN. We follow the same training procedures and values for

, , and as the original implementations. For StarGAN, we trained the network for 200K iterations with the generator being updated once every 5 discriminator updates, whereas for AttGAN the training lasted roughly 1.1M iterations with the same generator update frequency. We used the Adam optimiser [Kingma2015Adam] for both architectures with and . We set initial learning rates of and

for StarGAN and AttGAN respectively, which were kept constant during the first half of the training (first 100K iterations for StarGAN and first 100 epochs for AttGAN) but decayed linearly to 0 for StarGAN and exponentially to

for AttGAN. The batch size was set to 16 for StarGAN and 32 for AttGAN. On an NVIDIA RTX 2080 Ti GPU, training our framework takes about 1 day on StarGAN and 6 days on AttGAN.

4.4 Evaluation metrics

TARR. As face attribute editing is our main task, we measure the accuracy of translated attributes using the Target Attribute Recognition Rate (TARR) which has also been used in existing work [He2019AttGAN, Liu2019STGAN, bhattarai2020inducing]. To compute this, we translate images in the test set 13 times, each time by reversing one of the 13 attributes, and finally measure the percentage of images that can result in the desired classification. To this end, we follow [Liu2019STGAN, bhattarai2020inducing] and use an external attribute classifier pre-trained on the training set of CelebA with 94.5% mean accuracy on the test set. We report the accuracy on each individual attribute as well as the overall mean.

ssFID and IS. To ensure that our method maintains image quality with the baseline, we also employ the self-supervised Fréchet Inception Distance (ssFID) [morozov2021on] and Inception Score (IS) [salimans2016improved]. The ssFID is a variant of FID [heusel2017gans], the latter of which measures the similarity

between the distribution of real examples and fake examples by comparing their embeddings in the Inception-v3 network (pre-trained on ImageNet


). However, as such classification-pretrained embeddings could be misleading in other tasks and non-transferable to non-ImageNet domains, ssFID instead uses the self-supervised image representation model SwAV 

[Caron2020Unsupervised] to provide a more reasonable and universal assessment of image quality. The IS measures the diversity and meaningfulness of the generated images, but it does not use the distribution of the real images for comparison. Higher image quality corresponds to a lower ssFID and higher IS. To compute the ssFID and IS, we first translate each image in the test set into 13 fake images by reversing each of the 13 attributes, resulting in a pool of fake images that is 13 times the size of the real pool. Then, we select images from the fake pool that correspond to a specific attribute reversal and compute the ssFID between this subset of fake images and the real images, and repeat this for each of the 13 attributes and take the average. As for IS, we divide the entire fake pool into 10 subsets and compute the average IS of them.

4.5 Ablation studies

Figure 3: TARR by loss weight parameter on CelebA.

To maximise the potential of our method and especially TARR, which is our main focus, we tested various values for the hyperparameter

and obtained TARR for each setup. The results are shown in Figure 3. As one can observe, with a very low , TARR stays roughly the same as the baseline, which is expected as the two branches are unable to exert enough influence over each other. TARR attains a maximum value as increases to a suitable level (0.01 in this case) before starting to decrease as increases even further. This is also reasonable since a very large weight would make the consistency constraint so strong that it disrupts the balance between various loss terms both across and within each of the two branches.

To demonstrate the advantage of incorporating a parallel semantic generator over a direct fusion of RGB and semantic masks, we also evaluated the performance of an additional baseline “Concat” which takes the concatenated input of image and its semantic mask along the channel dimension. This way, the semantic information is made available to the generator without architectural modifications or losses. We implemented this baseline on both StarGAN and AttGAN and the result is shown in Table 1. In terms of TARR, “Concat” is unable to outperform the baseline and even underperforms on AttGAN. This shows that a single network is incapable of handling multiple modalities effectively. Whilst “Concat” seems to slightly reduce ssFID, this reduction is small and possibly due to the input mask constraining the translation output to resemble the input domain more closely – which could be counterproductive in face editing as reflected by TARR.

4.6 Quantitative results


Dataset Resolution Backbone Method TARR (%) ssFID IS
CelebA StarGAN Baseline 78.72 1.58 3.08
Concat 80.50 1.47 3.04
SeCGAN (ours) 82.79 1.61 3.07
AttGAN Baseline 82.85 1.22 2.94
Concat 79.74 1.20 3.06
SeCGAN (ours) 84.81 1.27 3.05
CelebA-HQ StarGAN Baseline 72.75 3.22 2.74
SeCGAN (ours) 75.78 3.51 2.80
Baseline 58.16 4.97 3.11
SeCGAN (ours) 60.17 4.91 3.12


Table 1: Baselines vs our method in terms of TARR, ssFID, and IS. The TARR values shown here are the averaged value across attributes. Our method outperforms baselines consistently in terms of TARR and is closely on par with them in terms of ssFID and IS.
Figure 4: TARR grouped by individual attributes (zoom in for a better view at the annotations).

Table 1 summarises the performance of the baselines and our method across two datasets, two architectures, and two resolutions. In terms of TARR, our method consistently outperforms the baseline across all settings. In addition, we also provide a breakdown of TARR values by each individual attribute in Figure 4. We can also observe that our method outperforms the baseline in most of the attributes tested. This shows that our method can indeed guide the RGB-generator towards generating more accurate attributes thanks to the higher-order information provided by the semantic branch. As for image quality metrics, our method is closely on par with the baseline despite not gaining a significant advantage in every single case. In terms of IS, our method is able to outperform the baseline in most of the settings, demonstrating the greater diversity and meaningfulness of our generated images. In terms of ssFID, our method is roughly the same as the baseline which shows that the image quality is well maintained and not adversely impacted by the gain in other metrics.

4.7 Qualitative results

Figure 5: Qualitative comparison between the baseline and our method (zoom in for a better view). The first column are the original images. The second and third columns are images translated by the baseline and our method respectively. The fourth and fifth columns are heatmaps of the absolute difference between the original and translated images (the brighter, the higher the difference). The sixth and seventh columns are the original and translated semantic masks (by in SeCGAN) respectively. (More examples in the Appendix.)

Figure 5 shows visual comparisons between synthetic images generated by StarGAN (baseline) and SeCGAN (StarGAN backbone) respectively. By comparing the synthetic RGB images, one can observe that our method produces more distinct attributes, particularly ones that correlate strongly with shapes of facial components such as “add bangs” and “to bald”. As for other attributes which are only weakly correlated with shapes, the improvement is subtle but in fact observable via heatmaps of the absolute difference between translated and original images. More specifically, our method is better at focusing on the regions that are relevant to the attribute in question as shown by the higher intensity of these regions in the heatmaps. For instance, our method is better at focusing on the hair region for “hair to blonde”, skin for “skin to pale”, the area above the lips for “add moustache”, the nasolabial folds and eyebags for “to old”, the eyebrows for “to bushy eyebrows”, and so on.

On the other hand, our method also fares better in avoiding unnecessary editing of unrelated regions. For instance, in “hair to brown” and “hair to black”, the baseline is unable to locate the hair region in the input image correctly and instead applies colour changes to the hat region, whereas our method is able to learn to ignore the hat region and focus on the hair. This can also be observed by comparing the background areas in examples including “hair to blonde”, “to bald”, “add eyeglasses”, and “to bushy eyebrows”. In addition, we also visualised the input semantic mask and the one translated by the semantic generator . It can be observed that our method is able to correctly translate the semantic masks into the desired shapes for shape-related attributes, such as “add bangs”, “to bald”, and “add eyeglasses”, whilst keeping the masks relatively intact for attributes that are only weakly correlated with shapes, such as “hair to blonde” and “add moustache”.

5 Conclusion

We present SeCGAN, a novel cGAN for face editing harnessing the benefits of higher-level semantic information and mutual learning. In training, SeCGAN features two GANs for translating RGB images and semantic masks respectively whilst maintaining the semantic consistency of outputs. Experiments show that our method is able to outperform baselines in terms of TARR whilst maintaining ssFID and IS. This improvement is maintained even in higher resolutions settings with fewer training data.

Limitations. One limitation of our method is that it relies on a segmentation network which requires pre-training, though this could be done at a relatively small scale as we did on CelebA-HQ. Our method also has twice the baseline’s complexity during training, but during inference its complexity is the same as the baseline.

Societal impact. We acknowledge that our model is generative and thus can be used to create deep-fakes for disinformation, and we caution that appropriate detection algorithms should be implemented to guard against malicious use of deep-fakes. In fact, our model could be utilised as a data generator to train such detection algorithms.

6 Acknowledgement

We thank the Huawei Consumer Business Group, Croucher Foundation, and EPSRC Programme Grant ‘FACER2VM’ (EP/N007743/1) for supporting this work.


Appendix A Appendix

In this section, we include the architectural details of our method and additional qualitative examples.

a.1 Architecture details

In this subsection, we list all the layers of the generator and discriminator in the semantic branch. Here we present our approach implemented using both StarGAN (see Table 2) and AttGAN (see Table 3) as the backbone respectively. Here are some of the notations used: Conv(, , ) means a convolutional layer with output channels, kernel size

, and stride size

(similar for ConvT which is a transposed convolutional layer); FC() is a fully-connected layer output dimension

; Res refers to a residual block; IN and BN refer to instance and batch normalisation respectively; LReLU is leaky ReLU;

is the dimension of the attribute vector; is the height and width of the input images. The only differences separating and from their RGB counterparts and

are the number of channels of the inputs (of both generator and discriminator), number of channels of the generator output, and the final activation function of the generator.




Input shape Operations
Conv(64, 7, 1), IN, ReLU
Conv(128, 4, 2), IN, ReLU
Conv(256, 4, 2), IN, ReLU
{Res, Con(256, 3, 1), IN, ReLU}
ConvT(128, 4, 2), IN, ReLU
ConvT(64, 4, 2), IN, ReLU
Conv(12, 7, 1, 3), Softmax




Input shape Operations
Conv(64, 4, 2), LReLU
Conv(128, 4, 2), LReLU
Conv(256, 4, 2), LReLU
Conv(512, 4, 2), LReLU
Conv(1024, 4, 2), LReLU
Conv(2048, 4, 2), LReLU
: Conv(1, 3, 1)
: Conv(, , 1), Sigmoid


Table 2: Semantic branch architecture (backbone StarGAN). The attribute label is concatenated with the input in the first layer, thus adding to the number of input channels.




Input shape Operations
Conv(64, 4, 2), BN, LReLU
Conv(128, 4, 2), BN, LReLU
Conv(256, 4, 2), BN, LReLU
Conv(512, 4, 2), BN, LReLU
** Conv(1024, 4, 2), BN, LReLU
ConvT(1024, 4, 2), BN, ReLU
* ConvT(512, 4, 2), BN, ReLU
ConvT(256, 4, 2), BN, ReLU
ConvT(128, 4, 2), BN, ReLU
ConvT(12, 4, 2), Softmax




Input shape Operations
Conv(64, 4, 2), LReLU
Conv(128, 4, 2), LReLU
Conv(256, 4, 2), LReLU
Conv(512, 4, 2), LReLU
Conv(1024, 4, 2), LReLU
: FC(1024), ReLU, FC(1)
: FC(1024), ReLU, FC(), Sigmoid


Table 3: Semantic branch architecture (backbone AttGAN). Unlike StarGAN, the attribute label is injected at the first ConvT layer (at the middle of the Generator) rather than the beginning. The input marked with * has more channels than the previous ConvT layer since our implementation includes a shortcut which concatenates the current input with a previous output (marked with **). The attribute label is also injected at this layer which adds an additional to the number of channels.

a.2 Additional qualitative results

We also include additional qualitative results of SeCGAN trained on CelebA at resolution , implemented using StarGAN (see Figure 6) and AttGAN (see Figure 7) as backbones respectively. In these figures, each translated image represents an attribute reversal. In the “gender” column, for example, the gender of the original image is reversed. This also applies to the three hair colours, but negating one hair colour does not mean the target hair colour must be one of the other two. For instance, reversing “blonde hair” on a person who already has blonde hair would simply result in a neutral hair colour rather than brown or black.

Figure 6: Qualitative results of our method (StarGAN backbone) at (zoom in for a better view).
Figure 7: Qualitative results of our method (AttGAN backbone) at (zoom in for a better view).