Generating realistic face images is an active research topic in image synthesis. Recent techniques such as StyleGAN (Karras et al., 2019, 2020) allow random generation of realistic face images. However, many of them lack direct control of facial geometry or appearance. Adding conditions to the generation process is necessary to generate specific face images of interest, since face images are results of multiple factors, including geometry, appearance, head pose, viewpoint, etc.
Sketch-based conditions have been commonly used in existing image-to-image translation techniques(Isola et al., 2017; Wang et al., 2018), since sketching provides an easy and expressive means to depict desired geometry, including shape contours and salient surface edges. Multiple sketch-to-image techniques (Chen and Hays, 2018; Sangkloy et al., 2017; Li et al., 2020; Chen et al., 2020) have been proposed to generate realistic face images from edge maps or even rough sketches. However, since it is difficult to infer the appearance information from sketches alone, the above techniques lack enough control of appearance during the generation process. On the other hand, there exist various techniques (Kim et al., 2020; Kolkin et al., 2019; Yoo et al., 2019) which allow the generation of realistic faces with their appearance as similar to given reference images as possible. However, such techniques focus on global appearance transfer and often do not well preserve geometric details of source images. Recent image disentangling works (Park et al., 2020; Kim et al., 2020) offer promising frameworks to have decoupled control of multiple attributes such as geometry and appearance. However, disentangling techniques for processing general image types do not provide optimal results for face generation. These methods treat images holistically and do not provide detailed, localized control, which is essential for face image editing.
In this work, we present DeepFaceEditing, a novel face generation and editing framework that allows disentangled control of geometry and appearance (Fig. 1). The key enabler is a structured disentanglement framework specifically designed for face images. Observing that faces have a fixed structure of facial components, we adopt a local-to-global framework, consisting of two modules: Local Disentanglement (LD) and Global Fusion (GF). With paired images and sketches, the LD module disentangles facial components into two independent representations: geometry and appearance. The former encodes the geometry information including shapes of face and facial components as well as salient geometric details such as wrinkles. This can be specified explicitly as a sketch or implicitly as a geometry reference image. In contrast, the appearance representation encodes the information related to color, material, lighting conditions, etc. The GF module is trained to fuse local representations to obtain globally consistent face images. Unlike many other forms of disentanglement, our disentanglement makes the resulting representations easy to manipulate, and in particular, the geometry is represented using sketches, which can be edited intuitively for both overall shapes and details.
To ensure reliable disentanglement, our geometry space is designed to be a shared space of both sketches and images. In this way, we can not only make use of the geometry information in a geometry reference image but also develop a sketch-based interface for intuitive geometry manipulation. Figure 1 shows representative editing examples achieved with our interface, which supports decoupled control of appearance and geometry. We will also show the applications of our technique to face generation and face image style transfer.
We perform extensive qualitative and quantitative experiments, which show that disentangled control of geometry and appearance allows flexible face image generation and editing, and our approach outperforms state-of-the-art methods for various applications.
The main contributions of our work are summarized as follows:
We present a structured disentanglement framework for face image representation and synthesis, which ensures geometry and appearance are well decoupled and can be manipulated separately.
To achieve this, our network architecture involves local disentanglement (LD) modules for individual facial components to make learning more efficient, and a global fusion (GF) module for effective fusion of feature maps generated by LD modules. The LD modules are designed to embed both images and sketches of local regions to a shared space to ensure appropriate disentanglement, and allow sketches to be used as an explicit representation for the geometry, which is essential for detailed editing.
We present a novel interactive system that supports real-time editing of portrait images, which allows detailed editing of the geometry through an intuitive sketch-based interface, as well as modifying the appearance by providing an appearance reference image.
2. Related work
Our work is closely related to several topics, including neural face image synthesis, neural face image editing, neural image disentanglement, and style transfer.
2.1. Neural Face Image Synthesis
In recent years, conditional generative models like Generative Adversarial Networks (GANs)(Goodfellow et al., 2014)
have been widely used in face generation. They are capable of generating realistic face images from a random sampling of the Gaussian distribution(Karras et al., 2019). These methods are not specifically designed to control the detail or appearance of faces explicitly. Several solutions (Li et al., 2016; Richardson et al., 2020) have been proposed to control the image synthesis with specific conditions by using conditional GANs (Mirza and Osindero, 2014). Also inspired by conditional GANs (Mirza and Osindero, 2014), Lee et al. (2020a) and Gu et al. (2019) take a semantic mask as input and synthesize a new face based on a reference portrait. Instead of directly feeding a semantic layout as input, Park et al. (2019) propose spatially-adaptive normalization to progressively inject semantic information and achieve better visual quality for scene images.
Such methods support controllable image editing by editing semantic masks, but lack precise control of details within label maps. Another user-friendly interactive mode of geometry control is sketching. Compared with label maps, sketches can be easily drawn by users and artists, and thus have often been adopted for depicting the shape of desired content (Chen et al., 2009). Based on Pix2Pix (2017) for general image-to-image translation, SketchyGAN (Chen and Hays, 2018) and LinesToFacePhoto (Li et al., 2019) synthesize realistic images by converting sketches to distance maps. On the basis of pix2pixHD (2018), an extension of Pix2Pix for generating high-resolution images, Li et al. (2020) further introduce DeepFacePencil, a novel sketch-based face image synthesis framework that is robust to hand-drawn sketches. To produce high-quality faces from rough or incomplete sketches, Chen et al. (2020) present DeepFaceDrawing, which takes a local-to-global approach and leverages manifold projection to enhance the generation quality and robustness from freehand sketches. Besides 2D image generation, Han et al. (2017) develop a sketching system for 3D face and caricature modeling. The above works use sketches to depict target geometry, and have little or no control of appearance during editing. However, the appearance plays a central role for realistic image generation. Sangkloy et al. (2017) propose to use additional color strokes to indicate preferred colors for objects. While they achieve impressive results, the synthesized images still exhibit various artifacts like color leaking. For generating high-quality face images, it is essential to represent the complex appearance, which is difficult to achieve by only using color strokes. It also needs extra effort to paint the color strokes in addition to the geometry sketches, and the consistency between different color strokes should also be maintained for visual realism. It is thus not straightforward to extend the method by Sangkloy et al. (2017) to generate realistic face images from freehand drawings and through interactive editing.
Sketches can be used to depict not only shape contours (similar to label maps) but also fine-level details like wrinkles on faces. However, as discussed previously, sketches themselves cannot support appearance control. To address this issue, Zhang et al. (2020) and Lee et al. (2020c) propose exemplar-based image translation methods to generate photo-realistic images by learning dense cross-domain correspondence. These methods could be applied to different datasets including face edge maps but the face edge maps used in their work are not freehand sketches. Liu et al. (2020) further present an exemplar-based image synthesis method with freehand sketches. The line-sketch generator is adopted to produce multiple sketches to improve the model’s robustness and thus handle freehand sketches. However, these methods are not specifically designed for photo-realistic face image synthesis considering the face structures. Robust high-quality face synthesis methods with detailed control using freehand sketches and appearance references are still missing.
2.2. Neural Face Editing
Face image editing has been a long-standing topic with the ability of providing flexible human-computer interaction, whose results have been significantly improved since the emergence of deep neural networks. Here, we mainly review the neural-network-based approaches. These approaches can be further grouped into three categories according to the intermediary they provide to the user for editing, namely direct editing, semantic-mask-guided editing and sketch-based editing. For direct editing, Brock et al.(2016) propose an introspective adversarial network (IAN), which turns rough painting strokes into user-desired photo-realistic images by directly manipulating images. For semantic-mask-guided editing, pioneering works such as (Gu et al., 2019; Lee et al., 2020a; Zhu et al., 2020) incorporate semantic masks with conditional GANs (Mirza and Osindero, 2014) for interactive editing of face images, which have been discussed in previous sub-sections. For the sketch-based methods, FaceShop (Portenier et al., 2018) introduces a face image editing system to synthesize realistic images according to user inputs: masks, simple sketches, and colored strokes. DeepFillv2 (Yu et al., 2019a) introduces gated convolution to learn dynamic features and presents a patch-based GAN discriminator to generate high quality inpaintings in a free-form manner. It is also applied to face editing under the guidance of input sketches. Based on U-Net (Ronneberger et al., 2017), Jo et al. (2019) present an extra style loss to generate more robust and higher quality results, requiring minimal effort from users. While these methods can generate high quality edited faces with guiding colored strokes and simple sketches within local areas, our method further supports editing of faces in both the local and global manners. In addition, since these methods treat face editing as an image completion problem with sketch guidance, they can only generate images based on the appearance of input images (i.e. not support using appearance from other images). Meanwhile, as demonstrated in (Jo and Park, 2019), non-photorealistic artifacts could be generated with purely sketch and color strokes. Recently, Yang et al. (2020) present another face editing method based only on sketches. With a sketch refinement strategy, their method is robust for human-drawn sketches but lacks the control of face appearance.
2.3. Neural Image Disentanglement
Learning a disentangled representation aims at modeling the factorization of data variation. Previous works have introduced the disentanglement into image-to-image translation to facilitate multi-domain translation or allow the manipulation of certain image attributes while retaining others. For example, Liu et al.(2018) disentangle images into domain-invariant representations to generate realistic results across multiple domains by changing the domain labels. The work of (Gonzalez-Garcia et al., 2018) decouples images into a shared part and an exclusive part to achieve multi-modal image translation. Furthermore, several works (e.g., (Yu et al., 2019b; Lee et al., 2020b)) have extended the disentangled representation to provide domain-invariant and domain-specific representations to perform multi-domain and multi-modal translations simultaneously. However, these methods have focused on multi-domain translation and their disentanglement mainly aims at holistic attributes whereas our method focuses on disentangled representations that respect facial component structure and support detailed editing.
Another group of methods apply disentangled representations of latent code to control predefined attributes of faces. For instance, the method by Zhang et al. (2019) is trained on labeled input data pairs by swapping designated parts of embeddings to control specific attributes. Deng et al. (2020) imitate the 3D rendering process and introduce contrastive learning to learn a disentangled latent space. Many other works (e.g., (Härkönen et al., 2020; Shen et al., 2020; Tewari et al., 2020; Abdal et al., 2020)) have tried to analyze and disentangle the latent code of some pretrained GAN space (Karras et al., 2019) also with labeled data of specific attributes. Although these works successfully disentangle the latent space, they could only control a limited number of predefined attributes such as gender, expression, and age, due to the use of labeled data in the training stage. In contrast, our method provides users with more freedom to edit faces by changing sketches and to swap the face appearance by changing their appearance code.
Several research works have also proposed to disentangle specific elements and manipulate images. Nguyen-Phuoc et al. (2019) and Schwarz et al. (2020) utilize 3D representations and disentangle images into pose, shape, and appearance. However, these methods can only generate images randomly and do not support editing with detailed control, except for changing image viewpoints. Wang et al. (2019) explicitly disentangle face deformations and appearance with two parallel networks to achieve expression editing. While their work is limited to only changing facial expressions, the method in (Park et al., 2020) also encodes images into geometry and appearance components with the idea of swapping and co-occurrent patch discriminating statistics. Compared with this work, our method leverages a sketch as a constraint instead of the patch discriminator for precise facial detail editing. MichiGAN (Tan et al., 2020) is specifically designed for photo-realistic hair image generation of portraits conditioned on decoupled attributes. In our work, we aim to disentangle geometry and appearance of face images for intuitive and detailed control of face generation. This benefits further applications such as freehand face editing and sketching.
2.4. Style Transfer
Style transfer focuses on changing the style of an image while preserving its key content. One category of style transfer methods (e.g., (Zhu et al., 2017a, b; Huang et al., 2018; Liu et al., 2019)) requires a collection of target images and treats this problem as image-to-image translation. Another category of methods achieves style transfer with an arbitrary style image. Starting from the pioneering work (Gatys et al., 2016), the follow-up works (e.g., (Li and Wand, 2016; Mechrez et al., 2018; Kolkin et al., 2019)) take various content and style representations to generate attractive stylized images by iterative optimization. DST (Kim et al., 2020) further develops a novel geometry-aware style transfer method by deforming a content image to match the geometry of a style image. Another path of improvement is to eliminate the time-consuming optimization process. These works (e.g., (Liao et al., 2017; Huang and Belongie, 2017; Li et al., 2017)) find dense correspondence or manipulate features in pre-trained networks to synthesize high-quality stylized images in real time. Although the above works generate interesting artistic results, these are not particularly suitable for photo-realistic image style transfer. To handle this problem, WCT (Yoo et al., 2019) employs wavelet operations and progressive stylization to synthesize high resolution stylized photo-realistic images within a few seconds. In our work, we disentangle the geometry and appearance of face images to represent content and style information, respectively. Our method not only supports photo-realistic synthesis of stylized faces but also allows detailed editing with sketches.
In this section, we formalize the structure of our proposed face image generation architecture in detail. Inspired by Huang et al. (2017) and DeepFaceDrawing (Chen et al., 2020), we leverage a local-to-global framework to generate high-quality face images. Specifically, we decompose a face image into five components (“left-eye”, “right-eye”, “nose”, “mouth”, and “background”) and process them using individual network modules. After component-level generation, we fuse the image patches into globally consistent results. Therefore, our architecture, called DeepFaceEditing, comprises 5 Local Disentanglement (Sec. 3.1) modules responsible for disentangling the geometry and appearance for each component, and a Global Fusion (Sec. 3.2) module responsible for fusing component features and generating high-quality results with global consistency. During training, we adopt a swapping scheme (Park et al., 2020) with a cycle-consistency constraint to enhance the robustness and generalization ability of our framework.
3.1. Local Disentanglement
By incorporating this module, we aim to extract both the geometry and appearance features for each face component and generate local image patches from these features. To achieve this, we design a Geometry Encoder and an Appearance Encoder for obtaining geometry features and appearance features, respectively. Furthermore, we design an Image Synthesis Generator to combine the geometry features and appearance features from a pair of image and sketch or two images (one providing the geometry features while the other providing appearance features) to obtain the translated component-level image patches.
Sketches depict the contours of real images, and are inherently suitable for geometry information extraction. Thus, for sketch inputs, we can directly extract pure geometry information using an auto-encoder network. The main challenge here is to extract geometry features from real images. Given a real facial component image as input, an intuitive approach extracting its geometry features is to first use a pre-trained image-to-sketch translation network to translate the real image into the sketch domain and then send the generated sketch into the geometry encoder for sketches. While this approach is promising, we show that there exists a unified way to extract the geometry information from sketches and real images.
Specifically, we achieve this by training two auto-encoders, one for sketches and the other for images, and aligning the latent distribution of the image space to that of the sketch space, ensuring that only the geometry information is encoded. We first train a network consisting of an encoder and a decoder to generate an intermediate feature of sketches, as shown Fig. 3
(top). To retain essential spatial information, the latent space in the bottleneck layer is not in the form of a vector but instead a low-resolution feature map of dimension, where , and are the height, width and channel number for the geometry latent feature map. Both the input and output of this network are sketches. Note that the sketches here can be either edge maps extracted from images or hand-drawn sketches. For hand-drawn sketches, especially incomplete ones during the sketching process, we apply sketch manifold projection (Chen et al., 2020) during preprocessing to improve robustness.
Let denote a real image and denote its corresponding sketch. We extract the geometry feature of through the pre-trained , formulated as . Then we train an encoder to map the corresponding image into the latent geometry space of sketches, denoted as . To encourage and to follow the same distribution, we impose constraints on each layer of when we feed and into the pre-trained decoder . We also add an loss between the outputs and .
Appearance is another important aspect of a facial image. The mapping between facial geometry and real face images is clearly one-to-many, but by specifying the appearance, such ambiguities can be resolved. We employ another encoder
to extract the appearance feature. The appearance encoder leverages global average pooling (i.e., for each feature channel, taking the average over all the spatial locations in the feature map), to eliminate spatial information and extract an appearance feature which is independent from its geometry feature. As appearance features are extracted for individual local regions, dropping spatial information does not cause significant loss of useful information. Results of the experiments on disentangled interpolation on face appearance and geometry (see Fig.16) demonstrate that can learn a continuous and smooth facial appearance space efficiently.
Image Synthesis Generator
Given independent geometry and appearance representations, an image synthesis generator takes them as inputs and generates reconstruction or swapping results. To control the appearance of generated face images, we adopt Adaptive Instance Normalization (AdaIN) (Huang and Belongie, 2017) in our facial image synthesis generator. Specifically, this generator comprises 4 residual blocks and 4 up-sampling layers, and the appearance features are injected into them. Then, we obtain an embedding feature which has the same resolution as the input image but with 64 channels. Finally, an image consistent with the input geometry and appearance features is predicted via a single convolution layer.
Unlike those sketch-to-image tasks, our framework takes real facial images as inputs and disentangles appearance features with geometry features under the assistance of sketches. Furthermore, a network that could provide controllable sketch-based face editing would be more appealing to users. Therefore, to achieve this, we randomly feed both the features generated by and into the image synthesis generator during the training stage. During inference, either geometry reference images or sketches can provide the geometry information and generate photo-realistic facial images with specific appearance. With appearance and geometry features disentangled, we can generate face images possessing the geometry and appearance from different sources, as shown in Fig. 4.
3.2. Global Fusion
This module generates the final face images from local image features encoded by the Local Disentanglement modules. To translate local image features into a complete and natural-looking face image, one possible way is to directly combine the local image patches generated by applying one or more convolution layers to the component-level feature maps (like the process of generating in Fig. 2). However, this straightforward approach is prone to exhibiting artifacts on the boundaries of different components. So instead of directly combining component-level image patches, our method combines the generated intermediate feature maps of the Local Disentanglement modules before sending them into an image generation network. In this way, our network aggregates more information flow and is capable of generating high-quality images.
More concretely, our Global Fusion module comprises three units: an encoder, residual blocks and a decoder, like DeepFaceDrawing (2020) and pix2pixHD (Wang et al., 2018). Given the feature map of the background component, we replace certain patches of it with the corresponding generated features of other components, in the order of “mouth”, “nose”, “left-eye”, and “right-eye” to reduce the impact of overlapping between components. Then we feed the combined feature map into the Global Fusion module to generate a new face with desired appearance and geometry.
3.3. Training Process
In this part, we introduce the training process of the aforementioned modules at length. We train the entire framework in a step-by-step manner. Specifically, we first train the Local Disentanglement modules, and then train the Global Fusion module with parameters of the Local Disentanglement modules fixed.
To train our network, we need a large-scale dataset of sketch-image pairs. At the same time, the sketches in the sketch-image pairs are required to be salient, natural and similar to hand-drawn sketches. Traditional edge extraction methods, such as HED (Xie and Tu, 2015) and Canny (Canny, 1986), often fail to produce ideal edge maps. Therefore, we follow DeepFaceDrawing (Chen et al., 2020) and use the Photocopy filter in Photoshop followed by sketch simplification to build this dataset. We use FFHQ (Karras et al., 2019) as our training data. In this way, we generate 32.2k sketch-image pairs as our dataset and randomly select 29.9k pairs for training and 2300 pairs for testing. The resolution of both images and sketches is set to 512 512.
The training process of each Local Disentanglement module consists of three steps. First, as described in Sec. 3.1, and are trained to learn a geometry latent space for sketches using L1 reconstruction loss. Once is trained, we can represent the geometry feature as . Then, we train the network , which takes a real image as input and predicts a geometry feature
following the same distribution as the learned geometry space. The loss functionis defined as follows:
where is the number of layers of decoder . The index 0 corresponds to the input feature map, the index corresponds to the output image, and other indices are intermediate feature maps. Note that when optimizing the parameters of , we fix the weights of and . Finally, we train the appearance encoder and the image synthesis generator with weights of and fixed. During the last step of training, we randomly feed the geometry feature of sketches or of real images into . In the following parts, we denote both and as , without distinguishing their sources. We further introduce a swapping strategy and adopt cycle-consistency loss to disentangle a real facial image into appearance and geometry. To ensure photorealism of generated images, we also adopt the muti-scale discriminator (Wang et al., 2018) and adversarial loss.
More concretely, as shown in Fig. 2, given two component images ( could be either a real image or sketch) and ( should be a real image) in the training set, we can extract geometry features and from and by passing them through the pre-trained or . The appearance feature from is extracted by using . By swapping the geometry feature of with that of , we generate image using the geometry feature of image and the appearance feature of image , as . With the geometry of and the appearance of , we can establish a cyclic reconstruction of image , as . We also introduce a self-reconstruction loss: when both and take the same image (such as ) as input, we can reconstruct it using its geometry and appearance features. The self-reconstruction of can be formulated as . Overall, we adopt the following loss terms to train our Local Disentanglement modules:
Self-reconstruction loss: When the geometry and appearance come from the same image, i.e. , the self-consistency of our framework requires that we can reconstruct after passing it through our framework. The self-reconstruction loss contains three terms: 1) Perceptual loss (Johnson et al., 2016)
, which measures the visual similarity between the generated images and input images by a pre-trained VGG-19 model; 2) Feature matching loss(Wang et al., 2018) of discriminators, which aims to stabilize the training process; 3) Lab color loss (Tan et al., 2020), which calculates the chromatic distance in the and channels for controlling the color tone by converting images to the CIE-LAB color space. The self-reconstruction loss can be formulated as follows:
In our experiments, we empirically set .
Figure 5. Comparisons with state-of-the-art methods for face style transfer. In each row, (a) is a geometry reference image and (b) is an appearance reference image. (c)(f) are the results by the existing methods and our results are shown in (g). WCT could better preserve the geometry of (a), but cannot well reconstruct the appearance. As shown in the third example, DST cannot capture the appearance for the hair region. Our method achieves visually the best results. Original images courtesy of ImagineCup, Dan Milner, Willie Williams, Drama League, John Benson, ShashiBellamkonda and Sebastiaan ter Burg.
Cyclic swapping loss: To disentangle the geometry and appearance features more thoroughly, we adopt swapping to generate face images from geometry and appearance features that are from different sources, ı.e. . The cyclic swapping loss contains two terms: and . Next we introduce the and in detail.
Our network aims at a complete disentanglement of geometry and appearance feature for a facial image. When generating after swapping the appearance of to that of , the structure of should be maintained. So we introduce geometry loss to keep the geometry unchanged of the generated image by comparing it to input image:
To ensure the appearance of the swapped image is the same as that of , we introduce a cycle consistency loss. With the geometry of and the appearance of the swapped image , the generated image should cyclically reconstruct image . We use the above formulation of reconstruction loss to achieve the cycle consistency constraint:
where the hyper-parameters are same as the above setting. Finally, can be formulated as follows:
where we empirically set in our experiments.
Adversarial loss: Also, we adopt the muti-scale discriminator to encourage the distribution of generated images to match the distribution of real images:
where we balance the weights of generator and discriminator with in all our experiments.
Our final objective is simply the sum of the above three losses (as we have already weighted each term in these losses) and minimizing will lead to the optimization of three networks: , , and . is formulated as follows:
Global Fusion Training
With pre-trained Local Disentanglement modules, we are able to train the Global Fusion module which fuses the features encoded by Local Disentanglement modules together and generates the final results. Similar to the previous stage, we use the adversarial loss, feature matching loss, and perceptual loss for the Global Fusion module. Note that we do not need the swapping strategy in this stage since it does not involve any disentanglement.
In this section, we show our experimental setup as well as discuss the results of our experiments. We have done extensive experiments from three aspects, namely comparison with state-of-the-art methods, ablation study and user study. The results show the effectiveness of our proposed method and its superiority to the existing and alternative approaches.
First, we make extensive comparisons with the state-of-the-art methods. Since the source of geometry feature for image generation can be either real facial images or sketches, we compare our proposed method with two main branches of image-to-image translation methods: 1) style-transfer-based methods like WCT (Yoo et al., 2019), STROTSS (Kolkin et al., 2019), DST (Kim et al., 2020) and StyleGAN2 (Karras et al., 2020), where we take real images as inputs and evaluate the ability of face style transfer; 2) sketch-to-image translation methods like pSp (Richardson et al., 2020), Liu et al. (2020) and Zhang et al. (2020)
, where we take sketches as inputs and evaluate the ability of sketch-based face image editing. We also compare with some image editing methods based on sketch inputs. Then, we conduct ablation studies to justify the necessity of each component in our framework. Finally, a perception study is conducted to test whether the images generated by our method look more appealing to users compared with alternative methods, indicating the visual quality of our controllable face generation results. As shown in our demo video, our interactive editing interface achieves real-time performance. All the above experiments are carried out on a PC with an Intel i7-7700CPU, 32GB RAM, and two Nvidia RTX 2080Ti GPUs. And DeepFaceEditing is implemented in Pytorch(Paszke et al., 2019) and Jittor (Hu et al., 2020). We use ADAM (Kingma and Ba, 2015) with 0.0002 learning rate, and . We use the maximum batch size 2 that fits in memory on our GPUs for 512 512 resolution. We will release the code and data for facilitating future research. Please find the details of the network architectures in the supplemental material.
(d) Liu et al.
(e) Zhang et al.
4.1. Results and Evaluations
Face Style Transfer
In our framework, we can extract geometry features from either images or sketches. Hence, we can disentangle the geometry and appearance of an image, and generate new images by swapping geometry and/or appearance. This allows us to compare our method with state-of-the-art style transfer methods, including WCT (Yoo et al., 2019), STROTSS (Kolkin et al., 2019), DST (Kim et al., 2020), and StyleGAN2 (Karras et al., 2020). We use the official implementations for all the above methods. By default, resolution is adopted, except that StyleGAN2 uses input and output images of . The comparisons between all the methods are shown in Fig. 5. Our method is capable of generating high-quality photo-realistic results and combining the geometry and appearance of different images without obvious artifacts, and outperforms state-of-the-art style-transfer methods. STROTSS (Kolkin et al., 2019) overall can capture the style of appearance images, but their method does not produce spatially consistent transfer results, leading to salient artifacts. DST (Kim et al., 2020) generates high-quality results, but their results fail to capture the appearance of the reference images, possibly due to their strong geometry constraint. For StyleGAN2 (Karras et al., 2020), the identities of the generated faces are changed, since their projected latent code could not reconstruct the images accurately and the mixing operation can affect the geometry of the generated images. STROTSS (Kolkin et al., 2019), DST (Kim et al., 2020) and StyleGAN2 (Karras et al., 2020) all require an optimization process to generate style transfer images or projected latent code, thus cannot be real-time. Different from the above three algorithms, WCT (Yoo et al., 2019) is a fast feed-forward method without any optimization for style transfer. Despite the realistic face images generated by WCT (Yoo et al., 2019), the style transfer results of their method show a mixed appearance of two reference images without disentangled geometry and appearance.
w/o swap &
Sketch-based Face Image Synthesis
As our method can directly extract the geometry feature from a sketch and combine it with the appearance feature extracted from a reference image, it can be used for sketch-to-image translation. As shown in Fig.6, we thus compare our method with the existing solutions for this task, including pSp (Richardson et al., 2020), (Liu et al., 2020), and (Zhang et al., 2020). For a fair comparison, we use their official source code but re-train their networks on our training dataset. pSp combines an encoder with the StyleGAN decoder and can be applied to sketch-to-image translation. However, since the sketch geometry is encoded into the latent code, generated faces by pSp often do not faithfully respect the input sketches. The style mixing operation adopted in pSp also affects the geometry of synthesized faces undesirably. Liu et al. (2020) design a general network to translate a sketch to an image given a reference image. The geometry of their synthesized faces is maintained well while the appearance shows variations from the reference images (see the color of mouth). Zhang et al. (2020) propose a cross-domain correspondence network by finding a dense mapping between an input image and a reference image, and then use a warped exemplar to generate results, which, however, may lack the control of details. There are obvious artifacts in the results of Zhang et al. (2020), e.g., the fuzzy nose and the blurred eyes shown in the second and third columns of Fig. 6 (e). It is easy to see that our method combines the geometry and appearance features consistently and generates realistic and detailed results (e.g., the resulting eyebrows, mouth and gradually changed hair). More results generated by our method are shown in the supplementary material with diverse geometry and appearance.
As for face image editing, we compare our method with the state-of-the-art technique SC-FEGAN (2019). Taking sketches as inputs, our method is able to edit the face images based on modifications to sketches, both globally and locally. In contrast, SC-FEGAN (2019) is not able to synthesize a face image entirely from an input sketch since their method is based on image completion. For a fair comparison, we thus only edit partial sketches and set paired rectangle masks for SC-FEGAN. Without released training code from the authors, we use their official editing GUI with pre-trained model for comparison. As shown in Fig. 8, our method produces more realistic and visually consistent results than SC-FEGAN (see the second row for an edited nose). In addition, their method requires a pre-defined mask for users to specify an area of interest, thus constraining the user experience. Pre-defining such areas sometimes is not easy, especially for detailed lines. For example, in Fig. 7, we draw lines around the chin to generate wrinkles, resulting in lighting and shadow variations. In such cases it is difficult for users to pre-define areas of variations.
With disentangled geometry and appearance, we can also edit the global and/or local appearance without changing the geometry of the face. As shown in Fig. 9, given a sketch and appearance image as inputs, our method can generate a face retaining both global sketch wrinkles and appearance colors. With the fixed sketch, i.e. fixed geometry feature, we give two new reference images replacing the mouth and/or two eyes cropped from other images respectively. Our model is able to robustly fuse the newly added component-level image patches with the original face component images and generate photo-realistic edited face images.
4.2. Ablation Study
We conduct an ablation study on the test set to show the impact of individual key components of our system. Since our method is based on a local-to-global framework, we first show the results without the fusion of local parts. As our swapping strategy contains appearance swapping and cyclic reconstruction, we perform ablation study in two ways, one is to remove swapping and , the other is to remove but keep . From Fig. 10, it is obvious that without the local-to-global strategy, the results exhibit artifacts in local details. As shown in the last column, even for non-frontal faces, the local-to-global strategy improves the quality of local details. Results in the fourth row to the last row illustrate that when swapping and cycle consistency loss are utilized, the quality of the synthesized results improves. On the other hand, our baseline model without swapping and
only successfully performs reconstruction during training, where the geometry reference and appearance reference are matched. When the appearance or geometry features are changed during testing, the color translation seems to behave like a pixel-by-pixel copy, resulting in obvious inappropriate colorization of pixels. These results confirm that this network does poorly in separating the appearance and geometry. By adding the swapping strategy, the baseline withoutavoids such wrong colors, but does not properly combine those two features. To further enhance the information flow, we add the cycle consistency loss to decouple the features from the generated results. This constraint effectively improves the quality of image synthesis.
4.3. Perception Study
As for the tasks of style transfer and sketch-to-image translation, the face images generated by alternative methods tend to generate reasonably realistic face images (see Figs. 5, 6), and the visual quality differences are subtle. It is therefore difficult for existing general image quality measures such as FID (Heusel et al., 2017) and LPIPS (Zhang et al., 2018) to distinguish between their qualities. To evaluate the visual quality and the faithfulness (i.e., the similarity to the geometry and appearance images) of synthesized faces, we conducted a perception study.
The evaluation was done via two online questionnaires. The first perception study aimed to evaluate the effect of style transfer. We showed two input images, the geometry reference image and the appearance reference image, and five synthesized images (including WCT (Yoo et al., 2019), STROTSS (Kolkin et al., 2019), DST (Kim et al., 2020), StyleGAN2 (Karras et al., 2020) and ours) for each example, placed side by side in a random order to avoid bias. Each participant was asked to evaluate 20 examples according to three criteria: the maintenance of the geometry reference, the similarity to the appearance reference and the realism of generated images, each in a five-point Likert scale (1 = strongly negative to 5 = strongly positive). In total, 40 participants participated in this study and we got 40 (participants) 20 (questions) = 800 subjective evaluations for each method. We performed one-way ANOVA tests on five methods in aspects of ‘Geometry’, ‘Appearance’ and ‘Realism’ corresponding to the three criteria respectively. As shown in Fig. 11(a), the statistics of the evaluation results were plotted. We found significant effects of our method for all three criteria: geometry (), appearance () and realism ().
The second perception study was conducted to evaluate the quality of sketch-to-image translation. Similar to the former one, for each example we showed the user two input images, the sketch image for geometry and the appearance image, and three synthesized images by Liu et al. (2020), Zhang et al. (2020) and ours, placed side by side in a random order. Each participant was asked to evaluate 20 examples according to three criteria: the visual quality of synthesized images, the similarity to the input sketch and the faithfulness in appearance, each in a five-point Likert scale (1 = strongly negative to 5 = strongly positive). In total, we got 40 (participants) 20 (questions) = 800 subjective evaluations for each method. The second column of Fig. 11(b) shows the statistics of these three methods. We also performed the ANOVA tests on the three aspects, and get the values for geometry (), appearance () and realism (). It is clear that our method achieved a significant improvement over the other two methods.
Thanks to the disentanglement of geometry and appearance, our method can be adapted to many applications. In this section, we describe three applications: 1) Sketch editing interface; 2) Hand-drawn sketch to face image generation; 3) Disentangled morphing.
5.1. Sketch Editing Interface
First, we design a real-time sketch-based face editing user interface (Fig. 12), enabling detailed face editing via sketching. The control panel (Fig. 12 (c)) consists of some necessary tools such as image opening tools, editing tools, etc. We also offer a sketch extraction function, facilitating users to translate a photo to a sketch for further editing. Our system generates synthesized results in real-time, based on modifications of sketches. A list of reference appearance images are placed at the bottom of interface (Fig. 12 (d)), and users can upload their own face images to the list. By selecting an appearance image, users can control the appearance of the generated results.
5.2. Hand-drawn Sketch to Image Generation
The editing module in our system enables users to draw faces via coarse or fine sketches. For users with little experience in drawing, they can produce realistic faces even from simple sketches (see the first sketch image only drawing an outline of a face in Fig. 13). If the user is expert at drawing, he/she can edit the sketches with a brush or an eraser in our system to generate a more specific face closer to his/her conception. As shown in Fig. 13, a progressive sketch sequence is drawn by the user, and we compare the synthesized face sequence with DeepFaceDrawing (Chen et al., 2020), DeepFaceDrawing with style transfer and swapping autoencoder from a reference image, and our results based on the geometry of sketches and appearance of the same reference. It is evident that the results of DeepFaceDrawing (the 2nd row) show faces with diverse appearances, e.g. varied hair and skin colors. Such sudden and uncontrollable changes are not desired by the user who is making incremental changes to the sketch. Even combined with a style transfer method (Kolkin et al., 2019) using the left image as a reference, the results of DeepFaceDrawing show some artifacts in local areas like the mouth or eyes. Besides, our method is more efficient than (Kolkin et al., 2019), which takes around 1 minute to generate a result. Combining DeepFaceDrawing with the swapping autoencoder (Park et al., 2020), the results show slight color difference with the reference image and are also affected by the generation of DeepFaceDrawing (4th column). Even with the appearance swapping, differences between editing frames also exist (especially on hair), while our results are more robust and faithful to incremental changes. A similar comparison is shown in Fig. 14 where we use full hand-drawn sketches instead of an editing sequence. DeepFaceDrawing (Chen et al., 2020) generates plausible face images (2nd row), but without appearance control. Applying a style transfer technique (Kolkin et al., 2019) and the swapping autoencoder (Park et al., 2020) to the results of DeepFaceDrawing can partially address this problem. However, applying these two steps in succession may accumulate errors. The results of (Kolkin et al., 2019) (3rd row) have some artifacts on the mouth and hair. The swapping autoencoder (Park et al., 2020) (4th row) can generate attractive results, but the resulting faces may not retain the geometry of the sketches (third and last columns). As shown in Fig. 14 (bottom row) and Fig. 15, in our system, users can control both the geometry with hand-drawn sketches and the appearance with diverse references.
5.3. Disentangled Morphing
Our method learns two disentangled latent spaces of the geometry and appearance respectively and enables a novel application of generating face images with reference of given geometry or appearance patterns. We demonstrate via disentangled face image morphing that a clear disentanglement between geometry and appearance has been achieved by our method. As shown in Fig. 16, we achieve controllable interpolation along two dimensions (Appearance and Geometry). Given two input images , our method can extract the geometry code and the appearance code from both images. Then, we conduct linear interpolation in the geometry and/or appearance latent spaces, using the corresponding latent codes from image and . The results demonstrate that our interpolation faces are controllable and every interpolated face enjoys high fidelity.
|cfbox = teal 1pt 0pt|
|cfbox = teal 1pt 0pt|
6. Conclusion, Limitations and Future Work
This work presented a structured disentanglement framework for face generation and editing, carried out in a local-to-global manner. Our key observation is that geometry and appearance features of face images can be effectively disentangled, and sketches serve as an ideal intermediate representation for geometry features. Therefore, sketches can impose a strong constraint during disentanglement. The component-level geometry and appearance disentanglement is achieved by Local Disentanglement modules which are trained using a swapping strategy, while the Global Fusion module performs coherent local-to-global image generation from feature maps of facial image patches. Through extensive experiments, we prove that our approach can generate much more realistic results than existing methods. We also adapted our system for novel applications such as sketch editing, hand-drawn sketch based face image generation and disentangled face morphing.
One limitation of this work is that we only disentangle the geometry and appearance while other attributes such as head pose have not been considered in our current implementation. As shown in Figs. 17 and 18, when there is a large rotation of the subject’s head or there is substantial occlusion, there may be some color bias in the generated results. Lighting is another challenging problem for most existing face synthesis methods and is not explicitly disentangled in our framework, so it is difficult to finely control complicated lighting conditions by sketches or reference images. As future work, it would be useful to explore disentanglement of other attributes such as head pose and lighting to make the method more general. Besides, sketches have semantic ambiguity and in some extreme cases, it is hard even for humans to distinguish the accurate boundary between neck, hair and background. This ambiguity may sometimes cause some artifacts on the outer boundary of the face foreground, leading to blurred hair. As future work, semantic masks may be combined with sketches to generate more attractive results. Furthermore, while we allow detailed editing of geometry through sketching, our current approach uses an appearance reference image to control the appearance of the generated face image. This could be improved using other forms of input, such as strokes with color, which would be more flexible, such as (Sangkloy et al., 2017). This is a promising research direction in the future but it remains challenging, because the face appearance not only contains the color but also the material attributes, and it is very difficult to stroke the complex materials of the face. The consistency between the color strokes should also be maintained. These will be explored in the future research work.
Acknowledgements.This work was supported by National Natural Science Foundation of China (No. 61872440 and No. 62061136007), Science and Technology Service Network Initiative of the Chinese Academy of Sciences (No. KFJ-STS-ZDTP-070, No. KFJ-STS-QYZD-129 and No. KFJ-STS-QYZD-2021-11-001), Royal Society Newton Advanced Fellowship (No. NAFR2192151), Youth Innovation Promotion Association CAS, and Beijing Program for International S&T Cooperation Project (No. Z191100001619003). Hongbo Fu was supported by HKSAR RGC General Research Fund (No. 11212119) and City University of Hong Kong (SCM ACIM Collaborative Research Fellowship).
- StyleFlow: attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. arXiv preprint. External Links: Cited by: §2.3.
- Neural photo editing with introspective adversarial networks. In ICIR, Cited by: §2.2.
- A computational approach to edge detection. PAMI, pp. 679–698. Cited by: §3.3.
- DeepFaceDrawing: deep generation of face images from sketches. Vol. 39, New York, NY, USA, pp. 72:1–72:16. External Links: Cited by: §1, §2.1, §3.1, §3.2, §3.3, §3, Figure 13, Figure 14, §5.2.
- Sketch2Photo: internet image montage. ACM Trans. Graph. 28 (5), pp. 1–10. External Links: Cited by: §2.1.
- SketchyGAN: towards diverse and realistic sketch to image synthesis. In CVPR, pp. 9416–9425. Cited by: §1, §2.1.
- Disentangled and controllable face image generation via 3d imitative-contrastive learning. In CVPR, pp. 5153–5162. Cited by: §2.3.
Image style transfer using convolutional neural networks. In CVPR, pp. 2414–2423. Cited by: §2.4.
- Image-to-image translation for cross-domain disentanglement. In NeurIPS, pp. 1294–1305. Cited by: §2.3.
- Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §2.1.
- Mask-guided portrait editing with conditional gans. In CVPR, pp. 3436–3445. Cited by: §2.1, §2.2.
DeepSketch2Face: a deep learning based sketching system for 3d face and caricature modeling. ACM Trans. Graph. 36 (4), pp. 126:1–126:12. External Links: Cited by: §2.1.
- GANSpace: discovering interpretable gan controls. In NeurIPS, pp. 9841–9850. Cited by: §2.3.
- GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, pp. 6626–6637. Cited by: §4.3.
- Jittor: a novel deep learning framework with meta-operators and unified graph execution. Cited by: §4.
- Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. In ICCV, pp. 2458–2467. Cited by: §3.
- Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, pp. 1510–1519. Cited by: §2.4, §3.1.
- Multimodal unsupervised image-to-image translation. In ECCV, pp. 179–196. Cited by: §2.4.
- Image-to-image translation with conditional adversarial networks. In CVPR, pp. 5967–5976. Cited by: §1, §2.1.
- SC-fegan: face editing generative adversarial network with user’s sketch and color. In ICCV, pp. 1745–1753. Cited by: §2.2, §4.1.
Perceptual losses for real-time style transfer and super-resolution. In ECCV, pp. 694–711. Cited by: item -.
- A style-based generator architecture for generative adversarial networks. In CVPR, pp. 4401–4410. Cited by: §1, §2.1, §2.3, §3.3.
- Analyzing and improving the image quality of StyleGAN. In CVPR, pp. 8107–8116. Cited by: §1, Figure 11, §4.1, §4.3, §4.
- Deformable style transfer. In ECCV, pp. 246–261. Cited by: §1, §2.4, Figure 11, §4.1, §4.3, §4.
- Adam: a method for stochastic optimization. In ICIR, Cited by: §4.
- Style transfer by relaxed optimal transport and self-similarity. In CVPR, pp. 10051–10060. Cited by: §1, §2.4, Figure 11, Figure 13, Figure 14, §4.1, §4.3, §4, §5.2.
- MaskGAN: towards diverse and interactive facial image manipulation. In CVPR, pp. 5548–5557. Cited by: §2.1, §2.2.
- DRIT++: diverse image-to-image translation via disentangled representations. pp. 2402–2417. Cited by: §2.3.
- Reference-based sketch image colorization using augmented-self reference and dense semantic correspondence. In CVPR, pp. 5800–5809. Cited by: §2.1.
- Combining markov random fields and convolutional neural networks for image synthesis. In CVPR, pp. 2479–2486. Cited by: §2.4.
- Convolutional network for attribute-driven and identity-preserving human face generation. arXiv preprint. Cited by: §2.1.
- Universal style transfer via feature transforms. In NeurIPS, pp. 386–396. Cited by: §2.4.
- LinesToFacePhoto: face photo generation from lines with conditional self-attention generative adversarial networks. In ACM Multimedia, pp. 2323–2331. Cited by: §2.1.
- DeepFacePencil: creating face images from freehand sketches. In ACM Multimedia, pp. 991–999. Cited by: §1, §2.1.
- Visual attribute transfer through deep image analogy. ACM Trans. Graph. 36 (4), pp. 120:1–120:15. External Links: Cited by: §2.4.
- A unified feature disentangler for multi-domain image translation and manipulation. In NeurIPS, pp. 2595–2604. Cited by: §2.3.
- Self-supervised sketch-to-image synthesis. arXiv preprint. External Links: Cited by: §2.1, Figure 11, Figure 6, §4.1, §4.3, §4.
- Few-shot unsupervised image-to-image translation. In ICCV, pp. 10550–10559. Cited by: §2.4.
- The contextual loss for image transformation with non-aligned data. In ECCV, pp. 800–815. Cited by: §2.4.
- Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.1, §2.2.
HoloGAN: unsupervised learning of 3d representations from natural images. In ICCV, pp. 7587–7596. Cited by: §2.3.
- Semantic image synthesis with spatially-adaptive normalization. In CVPR, pp. 2337–2346. Cited by: §2.1.
- Swapping autoencoder for deep image manipulation. In NeurIPS, pp. 7198–7211. Cited by: §1, §2.3, §3, Figure 13, Figure 14, §5.2.
- PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, pp. 8024–8035. Cited by: §4.
- Faceshop: deep sketch-based face image editing. ACM Trans. Graph. 37 (4), pp. 99:1–99:13. Cited by: §2.2.
- Encoding in style: a stylegan encoder for image-to-image translation. arXiv preprint. External Links: Cited by: §2.1, §4.1, §4.
- U-Net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.2.
- Scribbler: controlling deep image synthesis with sketch and color. In CVPR, pp. 6836–6845. Cited by: §1, §2.1, §6.
- GRAF: generative radiance fields for 3d-aware image synthesis. In NeurIPS, pp. 20154–20166. Cited by: §2.3.
- Interpreting the latent space of gans for semantic face editing. In CVPR, pp. 9240–9249. Cited by: §2.3.
- MichiGAN: multi-input-conditioned hair image generation for portrait editing. ACM Trans. Graph. 39 (4), pp. 95:1–95:13. External Links: Cited by: §2.3, item -.
- StyleRig: rigging stylegan for 3d control over portrait images. In CVPR, pp. 6141–6150. Cited by: §2.3.
- DFT-net: disentanglement of face deformation and texture synthesis for expression editing. In International Conference on Image Processing (ICIP), pp. 3881–3885. Cited by: §2.3.
- High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, pp. 8798–8807. Cited by: §1, §2.1, item -, §3.2, §3.3.
- Holistically-nested edge detection. In ICCV, pp. 1395–1403. Cited by: §3.3.
- Deep plastic surgery: robust and controllable image editing with human-drawn sketches. In ECCV., pp. 601–617. Cited by: §2.2.
- Photorealistic style transfer via wavelet transforms. In ICCV, pp. 9035–9044. Cited by: §1, §2.4, Figure 11, §4.1, §4.3, §4.
Free-form image inpainting with gated convolution. In ICCV, pp. 4470–4479. Cited by: §2.2.
- Multi-mapping image-to-image translation via learning disentanglement. In NeurIPS, pp. 2990–2999. Cited by: §2.3.
- Multi-attribute transfer via disentangled representation. In AAAI, pp. 9195–9202. Cited by: §2.3.
- Cross-domain correspondence learning for exemplar-based image translation. In CVPR, pp. 5142–5152. Cited by: §2.1, Figure 11, Figure 6, §4.1, §4.3, §4.
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp. 586–595. Cited by: §4.3.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pp. 2242–2251. Cited by: §2.4.
- Toward multimodal image-to-image translation. In NeurIPS, pp. 465–476. Cited by: §2.4.
- SEAN: image synthesis with semantic region-adaptive normalization. In CVPR, pp. 5103–5112. Cited by: §2.2.