Log In Sign Up

PortraitGAN for Flexible Portrait Manipulation

by   Jiali Duan, et al.

Previous methods have dealt with discrete manipulation of facial attributes such as smile, sad, angry, surprise etc, out of canonical expressions and they are not scalable, operating in single modality. In this paper, we propose a novel framework that supports continuous edits and multi-modality portrait manipulation using adversarial learning. Specifically, we adapt cycle-consistency into the conditional setting by leveraging additional facial landmarks information. This has two effects: first cycle mapping induces bidirectional manipulation and identity preserving; second pairing samples from different modalities can thus be utilized. To ensure high-quality synthesis, we adopt texture-loss that enforces texture consistency and multi-level adversarial supervision that facilitates gradient flow. Quantitative and qualitative experiments show the effectiveness of our framework in performing flexible and multi-modality portrait manipulation with photo-realistic effects.


page 1

page 3

page 6

page 7

page 8


3D Dense Geometry-Guided Facial Expression Synthesis by Adversarial Learning

Manipulating facial expressions is a challenging task due to fine-graine...

Explicit Facial Expression Transfer via Fine-Grained Semantic Representations

Facial expression transfer between two unpaired images is a challenging ...

Facelet-Bank for Fast Portrait Manipulation

Digital face manipulation has become a popular and fascinating way to to...

TailorGAN: Making User-Defined Fashion Designs

Attribute editing has become an important and emerging topic of computer...

Bidirectional Mapping Generative Adversarial Networks for Brain MR to PET Synthesis

Fusing multi-modality medical images, such as MR and PET, can provide va...

Initiative Defense against Facial Manipulation

Benefiting from the development of generative adversarial networks (GAN)...

Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association

Nowadays, we have witnessed the early progress on learning the associati...

1 Introduction

Our digital age has witnessed a soaring demand for flexible, high-quality portrait manipulation, not only from smart-phone apps but also from photography industry, e-commerce promotion, movie production etc. Portrait manipulation has also been extensively studied [34, 5, 8, 18, 1, 33]

in the academia of computer vision and computer graphics. Previous methods dedicate to adding makeups 

[23, 6], performing style transfer [9, 14, 24, 12], age progression [42] and expression manipulation [1, 39] to name a few. However, these methods are tailored to a specific task and cannot be transported to perform continuous and general multi-modality portrait manipulation.

Figure 1: Left column: input from the original domain; Middle column: facial landmark of targeted domain; Right column: generated output in the target domain. Note that inverse manipulation is simultaneously obtained.

Recently, generative adversarial networks have demonstrated compelling effects in synthesis and image translation [15, 38, 4, 35, 44, 13], among which [44, 40] proposed cycle-consistency for unpaired image translation. In this paper, we extend this idea into a conditional setting by leveraging additional facial landmarks information, which is capable of capturing intricate expression changes. Benefits that arise with this simple yet straight-forward modifications include: First, cycle mapping can effectively prevent many-to-one mapping [44, 45] also known as mode-collapse. In the context of face/pose manipulation, cycle-consistency also induces identity preserving and bidirectional manipulation, whereas previous method [1] assumes neutral face to begin with or is unidirectional [26, 29], thus manipulating in the same domain. Second, face images of different textures or styles are considered different modalities and current landmark detector will not work on those stylized images. With our design, we can pair samples from multiple domains and translate between each pair of them, thus enabling landmark extraction indirectly on stylized portraits. Our framework can also be extended to makeups/de-makeups, aging manipulation etc. once corresponding data is collected. Considering the lack of groundtruth data for many face manipulation tasks, we leverage the result of [14] to generate pseudo-targets to learn simultaneous expression and modality manipulations, but it can be replaced with any desired target domains.

However, there remain two main challenges to achieve high-quality portrait manipulation. We propose to learn a single generator as in [7]. But StarGAN [7] deals with discrete manipulation and fails on high-resolution images with irremovable artifacts. To synthesize images of photo-realistic quality (512x512), we propose multi-level adversarial supervision inspired by [37, 41] where synthesized images at different resolution are propagated and combined before being fed into multi-level discriminators. Second, to avoid texture inconsistency and artifacts during translation between different domains, we integrate Gram matrix [9] as a measure of texture distance into our model as it is differentiable and can be trained end-to-end using back propagation. Fig. 1 shows the result of our model.

Extensive evaluations have shown both quantitatively and qualitatively that our method is comparable or superior to state-of-the-art generative models in performing high-quality portrait manipulation (See Section 4.2). Our model is bidirectional, which circumvents the need to start from a neutral face or a fixed domain. This feature also ensures stable training, identity preservation and is easily scalable to other desired domain manipulations. In the following section, we review related works to ours and point out the differences. Details of PortraitGAN are elaborated in Section 3. We evaluate our approach in Section 4 and conclude the paper in Section 5.

2 Related Work

Face editing

Face editing or manipulation has been widely studied in the field of computer vision and graphics, including face morphing [3], expression edits [32, 21], age progression [16], facial reenactment [2, 34, 1]. However, these models are designed for a particular task, thus rely heavily on domain knowledge and certain assumptions. For example,  [1] assumes neutral and frontal faces to begin with while [34] assumes the availability of target videos with variation in both poses and expressions. Our model differs from them as it is a data-driven approach that does not require domain knowledge, designed to handle general face manipulations.

Image translation

Our work can be categorized into image translation with generative adversarial networks [13, 43, 4, 11, 22, 40, 37], whose goal is to learn a mapping that induces an indistinguishable distribution to target domain , through adversarial training between a pair of generator and discriminator . For example, Isola et al. [13]

takes image as a condition for general image-to-image translation trained on paired samples. Later, Zhu 

[44] extends [13] by introducing cycle-consistency loss to obviate the need of matched training pairs. In addition, it alleviates many-to-one mapping during training generative adversarial networks also known as mode collapse. Inspired by this, we integrate this loss into our model for identity preservation between different domains.

Another seminal work that inspired our design is StarGAN [7]

, where target facial attributes are encoded into a one-hot vector. In StarGAN, each attribute is treated as a different domain and an auxiliary classifier used to distinguish these attributes is essential for supervising the training process. Different from StarGAN, our goal is to perform continuous edits in the pixel space that cannot be enumerated with discrete labels. This implicitly implies a smooth and continuous latent space where each point in this space encodes meaningful axis of variation in the data. We treat different style modalities as domains in this paper and use two words interchangeably. In this sense, applications like beautification/de-beautification, aging/younger, with beard/without beard can also be included into our general framework. We compare our approach against CycleGAN 

[44] and StarGAN [7] in Section 4 and illustrate in more details about our design in Section 3.

Pose image generation

We are aware of works that use pose as condition in the task of person re-identification for person image generation [36, 20, 31, 29]. For example [26] concatenates one-hot pose feature maps in a channel-wise fashion to control pose generation similar to [30], where keypoints and segmentation mask of birds are used to manipulate locations and poses of birds. To synthesize more plausible human poses, Siarohin [31] develop deformable skip connections and compute a set of affine transformations to approximate joint deformations. These works share some similarity with ours as both facial landmark and human skeleton can be seen as a form of pose representation. However, all those works deal with manipulation in the original domain and does not preserve identity. Moreover, generated results in those works are low-resolution whereas our model can successfully generate 512x512 resolution with photo-realistic quality.

Style transfer

Neural style transfer was first proposed by Gatys et al. [9]. The idea is to preserve content from the original image and mimic “style” from the reference image. We adopt Gram matrix in our model to enforce texture consistency and replace L-BFGS iteration with back propagation for end-to-end training. Also, considering the lack of groundtruth data of many face manipulation tasks, we apply a fast neural style transfer algorithm [14] to generate pseudo targets for multi-modality manipulations. Note that our model is easily extensible to any desired target domains with current design unchanged.

Figure 2: Overview of training pipeline: In the forward cycle, original image is first translated to given target emotion and modality and then mapped back to given condition pair (,) encoding the original image. The backward cycle follows similar manner starting from but with opposite condition encodings using the same generator . Identity preservation and texture consistency are explicitly modeled in our loss design.

3 Proposed Method

3.1 Overall Framework

Problem formulation

Given domains of different modalities, our goal is to learn a single general mapping function


that transforms from domain to in domain with continuous shape edits (Figure 1). Equation 1 also implies that is bidirectional given desired conditions. We use facial landmark to denote facial expression in domain . Facial expressions are represented as a vector of 2D keypoints with , where each point is the th pixel location in . We use attribute vector to represent the target domain. Formally, our input/output are tuples of the form .

Model architecture

The overall pipeline of our approach is straightforward, shown in Figure 2 consisting of three main components: (1) A generator , which renders an input face in domain to the same person in another domain given conditional facial landmarks. is bidirectional and reused in both forward as well as backward cycle. First mapping and then mapping back given conditional pair . (2) A set of discriminators at different levels of resolution that distinguish generated samples from real ones. Instead of mapping to a single scalar which signifies “real” or “fake” , we adopt PatchGAN [44] which uses a fully convnet that outputs a matrix where each element

represents the probability of overlapping patch

to be real. If we trace back to the original image, each output has a

receptive field. (3) A loss function that takes into account identity preservation and texture consistency between different domains. In the following subsections, we elaborate on each module individually and then combine them together to construct


3.2 Base Model

To begin with, we consider manipulation of emotions in the same domain, i.e. and are of same texture and style, but with different face shapes denoted by facial landmarks and . Under this scenario, it’s sufficient to incorporate only forward cycle and conditional vector is not needed. The adversarial loss conditioned on facial landmarks follows Equation 2.


A face verification loss is desired to enforce identity consistency between and . However in our experiments, we find loss to be enough and it’s better than loss as it alleviates blurry output and acts as an additional regularization [13].


The overall loss is a combination of adversarial loss and loss, weighted by .


3.3 Multi-level Adversarial Supervision

Manipulation at a landmark level requires high-resolution synthesis, which is challenging for generative adversarial network. This is because training the whole system consists of optimizing two individual networks, where each update in either component could change the entire equilibrium. We first introduce how we improve training process and then propose a novel multi-level adversarial supervision that expedites training convergence.

Here we use two major strategies for improving generation quality and training stability. First is to provide additional constraints on the training process. For example, the facial landmark here can be seen as a constraint for generation, much similar to one-hot vector in cGAN [28]. We also adopt a feature matching loss in our framework that minimizes the distance between the learned feature representation from real samples and fake ones


In Equation 5,

is a real face randomly chosen from pool that queues authentic samples, similar to experience replay strategy used in reinforcement learning.

acts like a feature extraction function that “passes” its strong feature representation to relatively weak generator

. Our generator is similar to an encoder-decoder structure with residual blocks [10] in the middle.

Our second strategy is to provide fine-grained guidance by incorporating multi-level adversarial supervision. Cascaded upsampling layers in are connected with auxiliary convolutional branches to provide images at different scales (), where is the number of upsampling blocks. Images generated at the intermediate stage , together with corresponding downsampled images from the last stage, are fed into discriminator , which is trained to classify real samples from generated ones through minimizing the following loss,


where is sampled from real distribution and from model distribution at scale . indicates all possible values of . The auxiliary branches at different stages of generation provide more gradient signals for training the whole network, hence multi-level adversarial supervision. Compared to [7, 41], discriminators responsible for different levels are optimized as a whole rather than individually for each level, leading to faster training process. The increased discriminative ability from in turn provides further guidance when training and the two are alternatively optimized until convergence (Equation 7).

3.4 Texture consistency

When translating between different modalities in high-resolution, texture differences become easy to observe. To enforce texture consistency, we adapt style loss proposed in [9] by replacing L-BFGS iterative optimization with end-to-end back propagation. Formally, let be the vectorized th extracted feature map of image

from neural network

at layer . Gram matrix is defined as


where is the number of feature maps at layer and is th element in the feature vector. Gram matrix can be seen as a measure of the correlation between feature maps and , which only depends on the number of feature maps, not the size of . For image and , the texture loss at layer is


We obtain obvious improvement in quality of texture in modality manipulation evaluated in Section 4.2. We use pretrained VGG19 in our experiments with its parameters frozen during updates.

3.5 Going Beyond: Bidirectional Transfer

Bringing all pieces together, we are now ready to extend our Base Model in Section 3.2 to PortraitGAN by incorporating bidirectional mapping and conditional vector , which represents the target domain. Equation 2 now becomes


Forward cycle and backward cycle encourages one-to-one mapping from different modalities and thus helps preserve identity,


where and encodes different modalities. Therefore, only one set of generator/discriminator is used for bidirectional manipulation. We find that both forward and backward cycle are essential for translation between domains, which is consistent with observation in [44]. can be written in a similar fashion and below is our full objective,


where , , , controls the weight of cycle-consistency loss, feature matching loss, identity loss and texture loss respectively.

4 Experimental Evaluation

Implementation Details

Each training step takes as input a tuple of four images , , ,

randomly chosen from possible modalities of the same identity. Attribute conditional vector, represented as a one-hot vector, is replicated spatially before channel-wise concatenation with corresponding image and facial landmarks. Our generator uses 4 stride-2 convolution layers, followed by 9 residual blocks and 4 stride-2 transpose convolutions while auxiliary branch uses one-channel convolution for fusion of channels. We use two 3-layer PatchGAN 

[44] discriminators for multi-level adversarial supervision and Least Square loss [27] for stable training. We set , , , as 2, 10, 5, 10 for evaluation. The training time for PortraitGAN takes about 50 hours on a single Nvidia 1080 GPU.


Training and validation: The Radboud Faces Database [19] contains 4,824 images with 67 participants, each performing 8 canonical emotional expressions: anger, disgust, fear, happiness, sadness, surprise, contempt, and neutral. iCV Multi-Emotion Facial Expression Dataset [25] is designed for micro-emotion recognition (5184x3456 resolution), which includes 31,250 facial expressions performing 50 different emotions. Testing: We collect 20 videos of high-resolution from Youtube (abbreviated as HRY Dataset) containing people giving speech or address for testing. For the above datasets, we use dlib [17] for facial landmark extraction and a neural style transfer algorithm [14] for generating portraits of multiple modalities. Note that during testing, groundtruths are used only for evaluation purposes.

4.1 Quantitative Evaluation

In this section, we evaluate our model quantitatively using both evaluation metrics and subjective user study.

Evaluation metrics

CycleGAN [44] can only translate between two domains. StarGAN [7] extends CycleGAN using single generator, but requires an additional classifier for supervision, therefore can only work on discrete domains. In comparison, our model is the only one that also enables continuous editing. We randomly choose 368 images from HRY dataset with different identities and expressions for natural to single stylized modality evaluation. For fair comparison, we retrain 512x512 version of CycleGAN and StarGAN with domain dimension set as two. We fix extracted landmark unchanged during evaluation for PortraitGAN.

Method MSE SSIM inference time(s)
CycleGAN 0.028 0.473 0.365
StarGAN 0.029 0.483 0.277
PortraitGAN 0.025 0.517 0.290
Table 1: Quantitative evaluation on modality manipulation task. For MSE, the lower the better, SSIM the higher the better.

From Table 1, CycleGAN and StarGAN are close to each other in terms of MSE and SSIM metrics. Ours achieve the best score while maintaining fast inference speed. Note that these metrics are just indicative since mean square error may not correspond faithfully with human perception [44].

Method (%) 1st round 2nd round Average
StarGAN 29.5 31.5 30.5
CycleGAN 32.5 31.5 32.0
Ours 38.0 37.0 37.5
Table 2: Subjective ranking for different models based on perceptual evaluation of modality manipulation performance.

Subjective user study

As pointed out in [13], traditional metrics can be biased for evaluating GAN, therefore we adopt the same evaluation protocol as in [13, 44, 37, 7] for human subjective study on natural to single stylized modality manipulation. Due to resources, we collect responses from two users based on their preferences about images displayed at each group in terms of perceptual realism and how well original figure’s identity is preserved. Each group consists of one photo input and three randomly shuffled manipulated images generated by cycleGAN [44], StarGAN [7] and our proposed PortraitGAN with landmarks unchanged (See 4.2 for more details). There are in total 100 images and each user is asked to rank three methods on each image twice. Our methods get the best score among three methods as shown in Table 2.

4.2 Qualitative Evaluation

In this section, we conduct ablation study and validate the effectiveness of our design in continuous editing. We also compare against state-of-the-art generative models on tasks of continuous shape editing and simultaneous shape and modality manipulations. Our framework can be easily adapted to any desired modalities once corresponding data is acquired. Finally, we show some manipulation cases using our developed interactive interface.

Ablation study

Each component is crucial for the proper performance of the system. and for identity preservation between modalities, and for high-resolution generation, for texture consistency. Removing any of these elements would damage our network. For example, Figure 3 shows the effect of multi-level adversarial supervision. As can be seen, generated result with our component displays better perceptual quality with more high-frequency details. Texture quality would be compromised without texture loss (Figure 9). Last but not least, bidirectional cycle-consistency eliminates the need of classifier used in [7] for multi-domain manipulation.

Figure 3: Effect of multi-level adversarial supervision. Left/Right: without/with multi-level adversarial supervision.

Continuous shape editing

Figure 4

shows interpolated expression of our model on Rafd, which is beyond its original 8 canonical expressions. Note that CycleGAN can’t transfer in the same domain. On iCV dataset, we train StarGAN on 50 discrete micro emotions, but it collapsed (Figure 

6). It’s because StarGAN requires strong classification loss for supervision, which is hard to obtain on iCV dataset. Our model on the other hand, operates in the continuous space that captures subtle variations of face shapes. Another intriguing fact we observed is that boundary width of landmark doesn’t have obvious influence on output (Figure 5). More results are available in Figure 9 (column 1-3).

Figure 4: Expression interpolation using our proposed model.
Figure 5: Photo to photo manipulation results on Rafd (2nd row) and iCV dataset (3rd row).
Figure 6: StarGAN collapsed directly when manipulating 50 micro-emotions discretely.
Figure 7: Outputs from our interactive interface where landmarks are manipulated manually. First two columns show inputs and automatically detected landmarks; 4-6th column are outputs of manual manipulation. 1st row conducts face-slimming, 2nd row closes the eyes and mouth simultaneously.

Simultaneous shape and modality manipulation

Simultaneous shape and modality manipulations on HRY dataset is shown in Figure 9 (column 4-8). If look closely, our model is capable of hallucinating teeth (1st row) and capturing details such as ear rings (5th row), If landmark is fixed, our model then acts like a modality transfer model except that it can achieve bidirectional modality transfer with a minor change of attribute conditional vector .

To compare our approach with cycleGAN [44] and StarGAN [7], we use the following pipeline: Given image pair {,}, which are from domain and , cycleGAN translates to , which has content from and modality from . This can be achieved with our approach but with landmark unchanged. Similarly, we treat modalities as visual attributes and train StarGAN accordingly. Figure 8 shows comparison results of our approach with cycleGAN and StarGAN. As can be seen, ours are much sharper and visually appealing compared to StarGAN and CycleGAN. This is because PortraitGAN leverages texture loss to generate more coherent textures.

Figure 8: Comparison against StarGAN [7], cycleGAN [44] and PortraitGAN. As can be seen PortraitGAN exhibits perceptually better texture quality. Zoom in to see differences of textures: compared to StarGAN and CyleGAN, our model generates texture of more coherency.

Interactive user interface

Compared to discrete conditional labels, facial landmark gives full freedom for continuous shape edits. To test the limit of our model, we develop an online interactive editing interface, where users can manipulate facial landmarks manually and evaluate the model directly. This proves to be more challenging than landmark interpolation, as these edits may go far beyond normal expressions in the training set. Figure 7 shows some interesting results. As can be seen, our model can successfully perform simulatneous face-slimming and modality manipulation from input of the original modality. In addition, our model is supportive of bidirectional manipulation among modalities, owing to the design of bidirectional mapping.

5 Conclusion

Simultaneous shape and multi-modality portrait manipulation in high-resolution is not an easy task. In this paper, our proposed PortraitGAN pushes the limit of cycle consistency by incorporating additional facial landmark and attribute vector as condition. For bidirectional mapping, we only use one generator similar to [7], but with different training schemes. This enables us to perform multi-modality manipulations simultaneously, in a continuous manner. We validate our approach with expression interpolation and different style modalities. For better image quality, we adopt multi-level adversarial supervision to provide stronger guidance during training where generated images at different scales are combined and propagated to discriminators at different scales. We also leverage texture loss to enforce texture consistency among modalities. However, due to lack of data in many face manipulation tasks, modality manipulation beyond style transfer are not presented. Nonetheless, our proposed framework presents a step towards interactive manipulation and could be extended to manipulation across more modalities once corresponding data is obtained, which we leave as future work.

Figure 9: More results for continuous shape edits and simultaneous shape and modality manipulation results by PortraitGAN.