Human hair is so delicate, variable, and expressive that it constantly plays a unique role in depicting the subject in a face image. Given its diversity and flexibility, the urge to manipulate hair in a portrait photo, with a full spectrum of user control, has been long coveted to facilitate truly personalized portrait editing – from extending a particular hair wisp, altering the structure of a local region, to entirely transferring the hair color and appearance.
However, unlike most other parts of the human face, its peculiar geometry and material make hair extremely difficult to analyze, represent, and generate. The complexity of visual perception comes from multiple factors that the user may intend to edit or preserve. If we take a closer look, by roughly decomposing the hair region into its corresponding mask shape and color appearance, a few key aspects can be identified that are crucial to the visual quality. Shape-wise, the intricate mask boundaries, and varying opacity can lead to challenges of precise hair shape control and seamless background blending. Appearance-wise, the anisotropic nature of hair fibers tends to entangle both the strand flows and the material properties, in which the former renders the major local structures while the latter determines the color palette and inter-strand variations in a globally consistent manner. Therefore, to achieve controllable hair manipulation, we need to have the system not only capable of generating photo-realistic hair images, but also able to bridge various user controls and use them to condition the hair generation in ways that respect the natures of different visual factors.
Fortunately, the latest advances in deep neural networks, or more specifically, generative adversarial networks (GAN), have driven significant progress on human face synthesis. Recent successes, such as ProgressiveGAN and StyleGAN , already make it possible to produce highly realistic human face, including hair, from a random latent code. Yet conditioned generation, the heart of interactive face editing, is thought to be more challenging and less explored. Despite the recent efforts on conditioned face generation with semantic maps  or sketches , we are still far from reaching high-quality hair editing that is both fully-controllable and user-friendly.
In this work, we present MichiGAN (Multi-Input-Conditioned Hair Image GAN), a novel conditional hair image generation method for interactive portrait manipulation. To achieve so, we explicitly disentangle the information of hair into a quartet of attributes – shape, structure, appearance, and background, and design deliberate representations and conditioning mechanisms to enable orthogonal control within the image generation pipeline.
To this end, we propose three distinct condition modules for these attributes according to their particular perceptual characteristics and scales. Specifically, spatially-variant attributes of shape and structure are represented as fuzzy semantic masks and weighted orientation maps, respectively, which modulate the network with the conditional normalization structures [22, 45]. More globally, appearance
is encoded through our mask-transformed feature extracting network, which acts as the latent code input to the very beginning of the network and bootstraps the generator with appearance styles guided by arbitrary references. In addition, abackground encoder is placed parallel to the generation branch, which keeps background intact by progressively injecting features into the generator in a mask-aware way without loss of generation capability.
By integrating all these condition modules, we have an end-to-end image generation network that can provide complete control over every major hair attribute. Tailored for such a complicated network, we further develop a data preparation and augmentation pipeline for disentangling hair into compatible training data pairs, together with a combination of novel loss functions to achieve smooth and robust training. By extensively evaluating MichiGAN on a wide range of in-the-wild portrait images and comparing it with state-of-the-art conditioned image generation methods and alternative network designs, we demonstrate the superiority of the proposed method on both result quality and controllability.
Moreover, we build an interactive portrait hair editing system based on MichiGAN. It enables straightforward manipulation of hair by projecting intuitive and high-level user inputs such as painted masks, guiding strokes, or reference photos to well-defined condition representations. By using this hair editing interface, users are allowed to modify the hair shape through mask painter, alter or inpaint local hair structure with sparse stroke annotations, replace hair appearance using another hair reference, or even edit multiple attributes all at once concurrently.
In summary, the main contributions of this work are:
An explicit disentanglement of hair visual attributes, and a set of condition modules that implement the effective condition mechanism for each attribute with respect to its particular visual characteristics;
An end-to-end conditional hair generation network that provides complete and orthogonal control over all attributes individually or jointly;
An interactive hair editing system that enables straightforward and flexible hair manipulation through intuitive user inputs.
The training code, pre-trained models, and the interactive system are made publicly available to facilitate future research (GitHub).
2 Related Work
Portrait image manipulation is a long-standing while extensively studied area in both vision and graphics communities. Early approaches tackle specific application-driven problems with conventional image editing techniques, such as face swapping , face shape beautification , and portrait enhancement . More recently, the popularization of image-based 3D face reconstruction  enables lots of 3D-aware portrait manipulation methods, on expression editing , relighting , aging , camera manipulation , geometry reshaping , identity replacement [13, 30], and performance reenactment . However, the limited fitting accuracy and model quality of face proxy tends to affect the result realism. Thanks to the recent progress on deep neural networks, high-fidelity portrait manipulation has been made feasible in multiple scenarios, including sketch-based generation [32, 25], attribute editing [49, 16], view synthesis , and relighting .
Hair Modeling and Editing
As a critical yet challenging component of the human face, hair is also of great interest to researchers. Most hair acquisition pipelines focus on capturing hair geometry from various image inputs, including multi-view [44, 38, 20, 63] and single-view [9, 21, 8, 66, 48, 34]. Utilizing the coarse hair geometry approximated from single-view images, a few hair editing methods are proposed on hair transferring , animation , relighting , and morphing . But the lack of in-depth understanding and manipulation approaches to many of key visual factors greatly compromise the result quality and editing flexibility. The highly realistic portrait synthesis results from recent GAN-based image generated methods  have shown great potentials in bridging this quality gap. However, conditioned image generation is generally thought to be much more difficult than generation from random codes. Although some recent work [57, 32, 25, 46] have achieved progress on hair generation conditioned by limited types of inputs, these methods are unfortunately not intuitively controllable and universally applicable. In this paper, we propose a set of disentangled hair attributes to cover the full spectrum of visual hair factors, and for each attribute, we design a conditional module that could effectively control the generator with user-friendly inputs. This completely-conditioned and controllable hair generation has not been achieved before.
GAN-based Image Generation
Thanks to the powerful modeling capacity of deep neural networks, varieties of generative models have been proposed to learn to synthesize images in an unconditional setting. Typical methods include generative adversarial networks (GANs) [17, 40, 39] and variation auto-encoders (VAE) [14, 52]. Compared to VAE, GANs are demonstrated to be more popular and capable of modeling fine-grained details. Starting from a random latent code, GANs can generate fake images with the same distribution as natural images in a target domain. The famous work ProgressiveGAN  first leveraged a progressive generative network structure that can generate very highly realistic facial images, including hair. By further incorporating the latent and noise information into both the shallow and deep layers, StyleGAN  further improved the generation quality significantly. Our work is built on GANs but aims for the conditional image generation task, which is much more challenging due to multiple elaborate conditions for controlling desired hair synthesis. Utilizing these pre-trained GAN networks, many recent works [1, 2] achieve convincing image editing results (including portrait) by manipulating the embedded features in the latent spaces of these networks.
Conditional Image Generation
Many recent works leverage conditional GAN (cGAN) for image generation, conditioned on different types of inputs, such as category labels [41, 43, 5], text [47, 62, 61], or images [24, 55, 45, 25, 32]
. Among these various forms, image-conditioned generation, i.e., image-to-image translation, lies in the heart of interactive image editing. Pix2Pix proposes a unified framework for different applications, such as transforming a semantic segmentation mask or sketch to a photo-realistic image. It is further improved by Pix2PixHD  with multi-scale generators for high-resolution image generation. SPADE  proposes spatially-adaptive normalization for better converting segmentation mask into photo-realistic image. A few works [51, 60, 19] propose disentangled image generation, which share similar motivations as ours. Recently, some works have made efforts on interactive facial image editing using conditional GAN. For example, MaskGAN  focuses on facial image generation based on fine-grained semantic masks. SC-FEGAN  extends the conditional inputs to sketches and color, which allows users to control the shape and color of the face. Despite the fact that great progress has been made in conditional image generation, we are still far from reaching high-quality hair editing that is both fully-controllable due to its complexity. In this paper, we focus on hair image generation conditioned on multiple attributes that can enable user-friendly interactive hair editing.
Given an input portrait image , our MichiGAN aims at performing conditional hair editing to the hair region according to various user control inputs, while keeping the background unchanged. The output is denoted as .
As illustrated in Fig. 1, the proposed pipeline consists of a backbone generation network and three condition modules, which modulate the image generation network in various ways that will be introduced later. Thanks to these three condition modules, the network provides orthogonal controls over four different hair attributes: shape, structure, appearance, and background; represented as semantic hair mask , dense orientation map , hair appearance reference , and the background region of the input image , respectively. By combining all condition inputs, we can formulate the generation process as a deterministic function as:
These conditions, except the background, can either come from new sources in the interactive system or simply be extracted from the target/reference images. Therefore, users are not just allowed only to edit one attribute but fix all others, but can change multiple of them altogether at the same time.
We will elaborate on the technical method and the entire system in the next three sections, which are organized as follows. We first introduce the representations of our conditions, structures, and mechanisms of their corresponding condition modules (Sec. 4). Then, we present the backbone generation network of MichiGAN that integrates all condition modules, and discuss its training strategies (Sec. 5). Finally, our interactive hair editing system, built upon MichiGAN, is introduced together with various interaction methods that enable user-friendly control over all attributes (Sec. 6).
4 Condition Modules
In this section, we introduce our condition modules for controlling specific semantic attributes of hair during editing. In light of different natures of target attributes, their corresponding condition modules should vary as well. Here we propose three distinct modules to handle four types of main attributes, and also introduce the condition representation, component network design, and integration to the backbone network for each of them.
4.1 Shape & Structure
We represent the hair shape as the 2D binary mask of its occupied image region, similar to those image generation methods conditioned by semantic label maps .
However, unlike other types of objects, the visual translucency and complexity of hair boundaries make binary mask inevitably a coarse approximation of the shape. To robustly generate realistic shape details and natural blending effects from rough masks either estimated with semantic segmentation networks or interactively painted by users, we relax the strictness of the condition by adopting fuzzy mask conditioning, to allow a certain level of flexibility within the boundary area. To achieve so, we dilate or erode the mask by a random width for each data pair during training, to decouple the boundary details with the accurate shape contours to some extent.
Regarding the structure of the target object, despite its generality, a mask shape is usually too sparse and ambiguous to condition the generation, which is particularly true for hair. But luckily, hair is such a special object with strong anisotropy and homogeneity that you can always depict the internal structure of hair with the strand flows uniformly distributed within the region. Therefore, we use a dense orientation map to represent the hair structure, which defines the 2D orientation angle for each pixel.
We use the oriented filter to estimate confident pixel-wise orientation maps, which has been widely used in hair reconstruction  and rendering . Given a bank of oriented filter kernels (we use Gabor filters in our implementation), with , we estimate the orientation label map of image , and its associated confidence , at pixel position as:
where represents the convolution operator.
Continuous rotation representation is non-trivial for neural networks . The orientation label map contains discrete angle values in , which are actually non-continuous since orientation is cyclic, i.e., orientation equals to . Thus, direct use of the orientation label map will introduce ambiguity issues. We convert the label map to two-channel continuous orientation map for each pixel as follows:
and do the backward conversion as:
We use the orientation map in two ways during both network training and inference to achieve structure control:
Condition input. We use a dense orientation map as the input to the shape & structure condition module. However, the raw output from orientated filters may suffer from inaccurate results due to the existence of noise and dark regions. Therefore, we calculate the final dense orientation map by applying one pass of Gaussian smoothing to , weighted locally by confidence .
Training supervision. To explicitly enforce supervision on the structure of the result, we also propose a novel structural loss during training. Basically, we formulate the orientation estimation steps as one differentiable layer in the network following Eq. (2,3) after output , and measure per-pixel L1 orientation distance in the hair region weighted by confidence . The detailed formulation is discussed in Sec. 5.2.
Different from shape or structure, hair appearance describes the globally-consistent color style that is invariant to specific shape or structure. It includes multiple factors such as the intrinsic albedo color, environment-related shading variations, and even the granularity of wisp styles. Due to its global consistency, appearance can be compactly represented. Given the complexity of natural hair styles, instead of generating the appearance features from scratch, we address hair appearance as a style transfer problem, which extracts style from the reference image and transfers it to the target.
To force disentanglement of hair appearance, we design the appearance condition module such that it operates globally and is not susceptible to local shape and structure. Furthermore, it should be mask-invariant to be able to generalize to any hair mask at test time.
The detailed architecture of our appearance condition network is shown in Fig. 1 (b). To this end, we have the following key ideas that are critical to the result quality:
To ensure that the encoded appearance features focus only on the hair region and are not influenced by background information, we adopt partial convolution  with the hair mask of the reference image;
In order to capture the global hair appearance and discard spatial-variant information as much as possible, we add an instance-wise average pooling layer to the output of the last partial convolution, within the reference hair region ;
To align the appearance code with the target hair shape, and uniformly distribute the extracted appearance feature, we spatially duplicate the reference feature to every pixel within the target shape to produce the appearance map .
Therefore, the final appearance map is calculated as:
where is the output of the last partial convolution, denotes element-wise multiplication, and denotes the duplication operator.
The output of our mask-transformed feature extracting network, the appearance map , is used as the latent code input to the very beginning of the backbone generation network to bootstrap the generator with specific appearance styles instead of random noises.
Finally, we need to keep all non-hair pixels (namely the background) untouched, so that we can restrict the modification to the target hair region only. This is not as trivial as it appears , since a) it wastes lots of network capability and greatly limits the foreground generation flexibility, if we force the GAN to reconstruct the same background content; and b) it introduces severe blending artifacts, if we only generate the foreground region and then blend it with the original background as a post-processing step.
To pass necessary background information to the generation network in a gentle way that it can easily reconstruct the background pixels and won’t be distracted by the additional constraints, we design our background condition module to be an encoder network placed parallel to the last few layers in the backbone network, as shown at Fig. 1 (c). The encoder tries to produce multi-level feature maps that could be easily reused by the generator to reconstruct the background. The encoded feature after each convolution layer is designed to have the same dimension as the input to the corresponding layer in the backbone, so that it could be merged with the input feature to feed the generator progressively.
We perform feature merging in a mask-guided way that it keeps original generator features in the foreground hair region, but replace the background with the output of the encoder. The input feature map to the th last layer in the backbone is blended between the output of the th last layer in the backbone , and the output of the th layer in the condition module as:
The input to the background encoder is the original background region. We mask out the foreground region by blending the original input with random noise pattern :
Here represents the hair mask of the input image . indicates the dilated version of the target mask , in order to get rid of remaining hair information in the background due to inaccurate segmentation or stray hair strands. This also implicitly asks the network to refine the thin area to achieve better boundary blending quality. is the background image inpainter  that fills the hole mask with image content from , in case the edit mask is smaller than the original hair mask .
5 Backbone Generator
After introducing all our condition modules, we now present the backbone generation network that integrates all condition inputs to produce the final results, and its training strategy.
Given conditional information obtained through the aforementioned condition modules, the generation network, serves as the backbone data flow of the system, aims at producing desired portrait image output that meets all the conditions. According to their particular perceptual characteristics and scales, the outputs of three condition modules are treated in various ways.
As shown in Fig. 1 (d), our generator is a sequential concatenation of six up-sampling SPADE residual blocks (ResBlk)  and a convolutional layer that output the final generation result. As a brief recap of three types of condition inputs:
The appearance module inputs a feature map to the very first SPADE ResBlk, replacing randomly sampled latent code in traditional GANs;
The shape & structure module feeds every SPADE ResBlk with stacked hair mask and orientation map through the conditional input to modulate the layer activations;
The background module progressively blends background features into the outputs of the last SPADE ResBlks.
5.2 Training Strategy
Considering that we do not have the ground truth images when the conditional inputs come from different sources, training the proposed MichiGAN is essentially an unsupervised learning problem. However, most existing unsupervised learning algorithms[67, 23, 37]
are usually performing worse than supervised learning regarding the visual quality of the generated results. And more importantly, these unsupervised learning methods are targeted at image-wise translation between different domains, making them less feasible for our specific editing tasks with pixel-wise constraints. Motivated by
which faces a similar problem on image colorization, we propose to train our MichiGAN in a pseudo-supervised way. Specifically, we feed the four conditional inputs extracted from the same source image into MichiGAN and enforce it to recover the original image with explicit supervisions during training, and generalize it to arbitrary combinations of conditions from various sources in inference. Thanks to our hair attribute disentanglement and the powerful learning capacity, MichiGAN gains the generalization ability by leveraging information from all attributes during the supervised training, since ignoring any of them will incur larger losses.
To preprocess the training data, for each training portrait image , we generate its hair mask with a semantic segmentation network and its orientation map with the dense orientation estimation method presented in Sec. 4.1. Eventually, we have training pairs with the ground-truth portrait image , and the network output that is reconstructed from the conditions.
We adopt the following loss terms to train the network:
Chromatic loss. The generated result should be as close as possible to the ground-truth, but strong pixel reconstruction supervision could also harm the generalization ability of the network. Therefore, instead of measuring full-color reconstruction error, we convert the images into the CIELAB color space and only measure the chromatic distance in a and b channels, to relax the supervision but still penalize color bias:
Structural loss. We propose an additional structure loss to enforce the structural supervision, leveraging our differentiable orientation estimation layer. As described in Sec. 4.1, the orientation estimation layer outputs both orientation and confidence , and the raw orientation output may not be accurate at every pixel. Therefore we formulate the loss as the weighted sum of orientation differences:
Feature matching loss. To achieve more robust training of GAN, we also adopt the discriminator feature matching loss .
In summary, the overall training objective can be formulated as:
6 Interactive Editing System
Based on the proposed MichiGAN, we also build an interactive portrait hair editing system. Fig. 2 shows one screenshot of the user interface of the editing system. Within this system, users can control the shape (top left), structure (top middle), and the appearance (bottom left) separately or jointly, and see the result at the top right.
To facilitate user control over multiple approaches, we further propose two types of interaction modes, namely reference mode and painting mode.
One direct motivation for a user to manipulate the hair in a portrait, is to try hair attributes from another reference photo onto the target one. So we provide support to such an intuitive interaction approach by allowing the user to offer reference portraits for different attributes individually or jointly, and extracting network-understandable condition inputs to guide the generation.
Specifically, users are allowed to specify up to three references for shape (), structure (), and appearance (), respectively. The system then calculates hair mask from , orientation map from , and uses to extract the appearance features. These inputs are used to drive the generation of results.
When the target mask and orientation map come from different sources, the shape of the orientation may not fully cover the entire target shape. Therefore, we train an orientation inpainting network similar to the background one (Sec. 4.3) to complete the missing holes in the orientation map.
It is also natural to let a user perform local and detailed editing to the image with a much higher degree of flexibility than the reference mode. To achieve so, we provide a stroke-based interaction interface to enable local modification to both shape and structure, and a color palette tool to control the hair appearance. Their underlying methods are introduced below:
Shape painter. Users can change the binary hair mask with shape painter to augment or remove certain mask regions at will, similar to the interface of . The smooth boundaries of these edited regions may not match true hair boundary quality, but our fuzzy mask augmentation (Sec. 4.1) could still nicely capture these details.
Structure painter. Structure interaction is a way more challenging than the shape, since orientation is dense and not intuitively understandable. Regarding that, we propose a guided structure editing method to enable user-friendly manipulation. Given the current orientation map and a set of user guidance strokes , within a certain local region around the strokes, we synthesize new orientation information, which should be compatible with both the stroke guidance and the outside regions. The hole mask is generated by dilating the thin mask of user strokes with a certain radius. And the guidance orientation with the stroke is calculated with the known painting orders of line segments that form the stroke paths. We compose all the information into a partial orientation map as:
where represents a random noise pattern to hide the original orientation. We use an orientation inpainting network to synthesize new orientation inside the hole region , which is trained on paired data of ground-truth orientation maps and synthesized random user strokes traced on them.
Finding a specific good appearance reference could be a headache for users. In painting mode, we design a palette-like appearance picker tool to retrieve, cluster, and navigate references by annotating a single RGB hair color. This is enabled by embedding all our reference portraits (selected from our test dataset) into the color space by calculating the average hair color. Given an arbitrary target color, we do KNN search around it to findclosest references for the user to test. And once the user picks a reference, the system will instantly update the candidate list to contain references clustered around the selected one.
7.1 Implementation Details
To generate data pairs for supervised training, we use the large-scale portrait dataset of Flickr-Faces-HQ (FFHQ) . The whole dataset is divided into two parts: images for training and images for testing. The resolution of images is resized to .
We illustrate the structures of MichiGAN in Fig. 1
. Specifically, the appearance module uses five consecutive downsampling partial convolutions followed by an instance average pooling to get the appearance vector. All these partial convolutions are masked by the appearance reference, with the same kernel size ofand increasing feature channels of
. Each partial convolution is followed by an instance normalization and a leaky ReLu activation. A mask transform block is then used to spatially duplicate the appearance feature vector to an appearance map (size), guided by the target hair mask. The shape and structure modules follow the same modulation networks in SPADE to denormalize each SPADE ResBlk in the backbone with target hair mask and orientation (after inpainting). The background module generates the background input via inpainting and noise-filling, and produces scales of background features using a network that consists of four convolution layers with feature channels of . These features are blended into the backbone guided with the background mask. Finally, the backbone generator accepts the appearance map as input, and goes through six upsampling SPADE ResBlks and one final convolution to generate the result. The feature channels of these layers are . The inpainting networks for both orientation and background follow the structure proposed by .
We train all condition modules and the backbone generator jointly. The Adam optimizer  is used with a batch size of
and the total epoch number of. The learning rates for the generator and discriminator are set to and , respectively. The structural loss weight is , while all other loss weights, , , , and are simply set to (Sec. 5.2). To obtain the dilated target mask (Sec. 4.3), is randomly expanded by of the image width.
7.2 Comparison with SOTA
Semantic Image Generation.
We compare MichiGAN with two semantic image synthesis baselines: pix2pixHD  and SPADE . Pix2pixHD only supports generation based on semantic masks. And SPADE allows user control over both the semantic mask and the reference style image, which is the current state-of-the-art GAN-based conditional image synthesis framework. We train both pix2pixHD and SPADE models using the implementations provided by their authors, and exactly the same data as we used to train MichiGAN. For a fair comparison, the hair mask of the original image is converted to the input semantic mask while the reference image is set to be the original input. Besides, since pix2pixHD and SPADE can not maintain the background unchanged, we blend their generated hair with the original background with the soft hair mask. In this reconstruction setting, the original image can be regarded as the ground truth of the desired result, since all the input conditions are derived from it.
Quantitatively, MichiGAN outperforms the competing methods by a large margin, as shown in Tab. 1. Our method achieves an FID score of , which is less than half of pix2pixHD and SPADE. It indicates the edited portrait images by our method have much better realism than these two baseline methods.
Qualitatively, overall, all the methods can produce plausible hair results while our results have the best visual quality. A set of visual comparison results are shown in Fig. 4. Taking a closer look at the details of synthesized hair, we can find that the results generated by pix2pixHD and SPADE models are coarse-grained and have some apparent artifacts, while our results are more fine-grained with the rich and delicate texture details.
In terms of controllability, MichiGAN performs much better than pix2pixHD and SPADE. Pix2pixHD can only generate hair that looks adaptive to the input hair mask but with uncontrollable color and structure. SPADE can also synthesize hair adaptive to the semantic mask. Compared to pix2pixHD, it has additional ability to transfer the style of the reference image to the synthesized hair. However, the transferred results look apparently different from the desired ones, even using the original image as the reference, which is unsatisfactory. Besides, SPADE can not control the structure of the synthesized hair either. Compared to pix2pixHD and SPADE, MichiGAN can not only produce hair adaptive to the semantic mask, but also precisely control the different attributes of the hair, including appearance and structure, so that we can produce desired hair that looks most similar to the references.
In addition, the hair produced by pix2pixHD and SPADE is not compatible with the background of the original image. Despite that the results are softly blended with the backgrounds using high-quality hair masks, we can still easily see the stitching artifacts around the hair boundaries, especially for some complicated backgrounds, which introduces difficulties when applying pix2pixHD and SPADE to interactive editing. In contrast, MichiGAN does not suffer from this problem, since we take the background of the original image as one of the input conditions to directly generate the entire image with both the desired hair and the fixed background.
Conditional Face Generation.
MaskGAN uses fine-grain facial semantic masks as the representation to manipulate face images and is able to transfer styles between different subjects, which is similar to SPADE . Therefore we conduct the comparison with MaskGAN in the same mask-conditioned generation experiments, as shown in Fig. 4 and Tab. 1. As can be seen from both qualitative and quantitative results, although MaskGAN can synthesize visually plausible hair, our method still achieves better visual quality while being able to preserve or control the structural details.
SC-FEGAN adopts free-form sketches for portrait generation in the target hole regions to control both structure and appearance, which is comparable to our stroke-based hair generation. For a fair comparison, we adopt the hair mask as the target hole region, and use a same set of user strokes painted with our interactive system. The strokes are converted to the sketch input to SC-FEGAN, together with the corresponding hair color samples. A set of visual results are shown in Fig. 5, which contains comparisons on two target images with different sketches and appearance references. As demonstrated by, although SC-FEGAN allows control over hair structure and color to some extent, the generated results are less realistic comparing with ours. This may be partially caused by the gap between synthetic sketches generated by an edge detector and real user strokes. Our method avoids such a gap by using the dense orientation map as a unified representation for the hair structure.
7.3 Qualitative Results
Our interactive hair editing system based on MichiGAN supports flexible manipulation of hair attributes, including shape, structure, and appearance, individually or jointly. These attributes can be conveniently specified in two ways, either by giving a reference image or interactively modifying original ones with painters, which corresponds to two modes of the system: reference-based editing and painting-based editing.
We first show our reference-based hair editing results on in-the-wild portrait images. In this mode, the user first loads an original image into our system and then can simply adjust an attribute by providing another reference image. For example, once a reference image is given for the structure attribute, the orientation map will be extracted automatically by our system, and an image with a new hair structure will be generated based on the orientation condition. It is the same for appearance and shape. Figure 6 show the results with each attribute adjusted individually, while Figure 7 shows the results with multiple attributes adjusted jointly. It can be observed that MichiGAN disentangle these visual hair attributes well, without the interference of each other. And all these attributes are naturally embedded in generating a new image.
In this part, we show painting-based hair editing results, where the attributes are modified by the user with an interactive painting interface (Fig. 2). Compared to the reference-based mode, this painting-based mode is more natural and flexible for the user to perform some local and detailed edits. Two examples are given in Fig. 9. The user first uses a brush tool to modify the mask image, thus the shape of the hair is changed accordingly. Then several strokes are added to indicate the orientation of hair, and a new structure of hair is generated to follow the orientations. Finally, the user changes the appearance of the hair by specifying some fancy colors. Please refer to the accompanying video for a demonstration. Since our interface is user-friendly and the generated result is intermediately fed back to the user, editing hair with our system is quite convenient. The user study shows that 15-30 seconds on average is spent for an amateur to edit one example.
Hairstyle Transfer Validation.
We also validate the realism of our results by transferring the hairstyle from one image to another on a same subject. As shown in Fig. 8, we collect two images with different hairstyles of a same person, and use one image as the source of hair conditions, and the other one as the target to generate a hairstyle transfer result. Our method achieves realistic results with both appearance and structure similar to the ground truth photo.
7.4 User Study
To further evaluate the realism to real human perception, a simple user study is conducted on Amazon Mechanical Turk. We prepare two image sets, each has images. One contains real portrait photos, and the other contains results generated by our method with condition inputs from randomly selected test images. During the study, we mix both sets and randomly show one image at a time to a user, and ask the user to subjectively judge whether this image is real or fake. Each image has been evaluated by users. Each user spends seconds on each image on average. The final fooling rates of our results and real images are and , respectively.
7.5 Ablation Study
In order to achieve plausible results, we dedicatedly design the proposed framework with several key ingredients, ranging from the network architecture design to the newly introduced objective functions. But considering the space limitation and training cost, only three key ablation analysis experiments are conducted, and the image resolution is set to .
Partial Convolution and Instance-wise Average Pooling.
When adopting the reference image as the appearance guidance, we only want the appearance condition module to absorb the appearance information from the hair region but not the background region. But this is non-trivial because most existing operators in modern neural networks do not naturally support it. For example, the normal convolutional layer processes the pixels within a regular grid for all the possible sliding windows. Considering the hair region is often of an irregular shape, directly applying these normal operators in the appearance module will inevitably introduce the background information into the encoded appearance features. Based on this observation, we replace all the normal convolutional layers and average pooling layers with partial convolutional layers  and instance-wise average pooling  respectively.
To show its necessity, we conduct an ablation experiment that adopts normal convolutional and global average pooling layers, denoted as “Baseline-NCGA”. Obviously, from Fig. 10, we can find that the generated hair of the baseline model has significantly introduced the background color and becomes very unrealistic. Specifically, in these two cases, the target hair is polluted by the white color from the background sky and the green color from the background tree, respectively. By contrast, our method only adopts the reference hair color in the final generated results. As shown in Tab. 1, our method also achieves a better FID score than Baseline-NCGA.
Necessity of Structural Loss.
To enable structural hair control, besides the dense orientation map as the condition input, we also leverage one differentiable orientation layer in the network and add an extra structural loss as training supervision. In this ablation experiment, we are curious about whether the structural loss is necessary. To verify it, we still feed the orientation map into the network but ignore the structural loss term in the training objective. As shown in Tab. 1, our method performs better than this baseline in terms of the FID score. In Fig. 11, we provide two typical examples where both the appearance and orientation guidance are given by a reference image. It can be seen that, though the baseline method can adopt the appearance from one reference image, it cannot enable the structure control at all. Therefore the hair orientation in its results is very orderless and makes the overall result not realistic enough. From this experiment, two conclusions can be drawn: 1) Hair structure control is not only a feature but also a key component for realistic hair generation; 2) Without explicit structural supervision, it is difficult for the generative network to generate sensible hair orientation, let alone keeping the reference orientation.
Mask-Guided Feature Blending.
Though the main goal of our method is foreground hair editing, how to keep the original background unchanged and blend it with edited hair without artifacts is still very crucial. In order to address this problem, we choose to adopt a background condition module and fuse the background feature with the foreground feature in a mask-guided way in this paper. To show the necessity of the mask guidance, we further compare our method with the naive baseline that directly blends the foreground and background features with the mask guidance. In Fig.12, two representative examples are provided. Obviously, this naive blending way will introduce the background feature into the foreground region, and cannot preserve the reference appearance and orientation well. The better FID score shown in Tab. 1 further verifies the necessity of our design.
To conclude, we present MichiGAN, a conditional hair generation network for portrait manipulation applications. Different from existing conditioned face generation methods, our method provides user control over every major hair visual factor, which is achieved by explicitly disentangling hair into four orthogonal attributes, including shape, structure, appearance, and background. For each of them, we modulate the image generation pipeline with a specially designed condition module that processes user inputs with respect to the unique nature of the corresponding hair attribute. All these condition modules are integrated with the backbone generator to form the final end-to-end network, which allows fully-conditioned hair generation from various user inputs. Based on it, we also build an interactive portrait hair editing system that enables straightforward manipulation of hair by projecting intuitive user inputs to well-defined condition representations.
Despite the superior performance of our MichiGAN, there still exist several limitations to be explored in the future (Fig. 13):
The target shape mask, either painted or from a reference, should be aligned reasonably well with the face. The mismatched mask can result in unnatural hair shape. It would be of great help if we can automatically warp the hair shape according to different target face poses. We plan to investigate this topic in future research.
The appearance module generally works well on hair images with a globally consistent appearance. However, spatially-varying hair appearance might be smeared out in the results.
Even with our guided orientation inpainting, the guidance stroke should still be naturally compatible with the shape. Extreme stroke paths may cause unexpected structural issues since they can hardly be found in the synthetic training pairs.
We randomly dilate/erode hair shape masks and shrink the background masks during training to enforce a certain ability to synthesize natural hair boundaries from binary masks. However, the results can still be less satisfactory when hair shapes are dramatically changed, since we do not explicitly handle boundary matting. In the future, we are interested in combining MichiGAN with the latest learning-based alpha-matting methods to further increase the quality.
We would like to thank the reviewers for their constructive feedback, Mingming He and Jian Ren for helpful discussions, Nvidia Research for making the Flickr-Faces-HQ (FFHQ) dataset, and all Flickr users for sharing their photos under the Creative Commons License. This work was partly supported by Exploration Fund Project of University of Science and Technology of China, YD3480002001 and Hong Kong ECS grant #21209119, Hong Kong UGC.
-  (2019) Image2StyleGAN: how to embed images into the stylegan latent space?. In ICCV 2019, pp. 4431–4440. Cited by: §2.
-  (2019) Semantic photo manipulation with a generative image prior. ACM Trans. Graph. 38 (4), pp. 59:1–59:11. Cited by: §2, §4.3.
-  (2008) Face swapping: automatically replacing faces in photographs. ACM Trans. Graph. 27 (3), pp. 39. Cited by: §2.
-  (1999) A morphable model for the synthesis of 3d faces. In SIGGRAPH 1999, pp. 187–194. Cited by: §2.
-  (2019) Large scale gan training for high fidelity natural image synthesis. In ICLR 2019, Cited by: §2.
-  (2014) FaceWarehouse: a 3d facial expression database for visual computing. IEEE Trans. Vis. Comput. Graph. 20 (3), pp. 413–425. Cited by: §2.
-  (2015) High-quality hair modeling from a single portrait photo. ACM Trans. Graph. 34 (6), pp. 204:1–204:10. Cited by: §2.
-  (2016) AutoHair: fully automatic hair modeling from a single image. ACM Trans. Graph. 35 (4), pp. 116:1–116:12. Cited by: §2.
-  (2013) Dynamic hair manipulation in images and videos. ACM Trans. Graph. 32 (4), pp. 75:1–75:8. Cited by: §2.
-  (2012) Single-view hair modeling for portrait manipulation. ACM Trans. Graph. 31 (4), pp. 116:1–116:8. Cited by: §2.
-  (2017) Coherent online video style transfer. In ICCV 2017, pp. 1114–1123. Cited by: 3rd item.
-  (2017) StyleBank: an explicit representation for neural image style transfer. In CVPR 2017, pp. 2770–2779. Cited by: 3rd item.
-  (2011) Video face replacement. ACM Trans. Graph. 30 (6), pp. 130. Cited by: §2.
Tutorial on variational autoencoders. CoRR abs/1606.05908. Cited by: §2.
-  (2016) Perspective-aware manipulation of portrait photos. ACM Trans. Graph. 35 (4), pp. 128:1–128:10. Cited by: §2.
-  (2019) 3D guided fine-grained face manipulation. In CVPR 2019, pp. 9821–9830. Cited by: §2.
-  (2014) Generative adversarial nets. In NeurIPS 2014, pp. 2672–2680. Cited by: §2.
-  (2018) Deep exemplar-based colorization. ACM Trans. Graph. 37 (4), pp. 47:1–47:16. Cited by: §5.2.
-  (2018) Learning hierarchical semantic image manipulation through structured representations. In NeurIPS 2018, pp. 2713–2723. Cited by: §2.
-  (2014) Robust hair capture using simulated examples. ACM Trans. Graph. 33 (4), pp. 126:1–126:10. Cited by: §2.
-  (2015) Single-view hair modeling using a hairstyle database. ACM Trans. Graph. 34 (4), pp. 125:1–125:9. Cited by: §2.
-  (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV 2017, pp. 1510–1519. Cited by: §1.
-  (2018) Multimodal unsupervised image-to-image translation. In ECCV 2018, Vol. 11207, pp. 179–196. Cited by: §5.2.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR 2017, pp. 5967–5976. Cited by: §2, 4th item.
-  (2019) SC-fegan: face editing generative adversarial network with user’s sketch and color. In ICCV 2019, pp. 1745–1753. Cited by: §1, §2, §2, §2, §7.2.
-  (2010) Personal photo enhancement using example images. ACM Trans. Graph. 29 (2), pp. 12:1–12:15. Cited by: §2.
-  (2018) Progressive growing of gans for improved quality, stability, and variation. In ICLR 2018, Cited by: §1, §2.
-  (2019) A style-based generator architecture for generative adversarial networks. In CVPR 2019, pp. 4401–4410. Cited by: §1, §2, §2, §7.1.
-  (2014) Illumination-aware age progression. In CVPR 2014, pp. 3334–3341. Cited by: §2.
-  (2016) Transfiguring portraits. ACM Trans. Graph. 35 (4), pp. 94:1–94:8. Cited by: §2.
-  (2015) Adam: a method for stochastic optimization. In ICLR 2015, Cited by: §7.1.
-  (2019) MaskGAN: towards diverse and interactive facial image manipulation. CoRR abs/1907.11922. Cited by: §1, §2, §2, §2, §7.2.
-  (2008) Data-driven enhancement of facial attractiveness. ACM Trans. Graph. 27 (3), pp. 38. Cited by: §2.
-  (2018) Video to fully automatic 3d hair model. ACM Trans. Graph. 37 (6), pp. 206:1–206:14. Cited by: §2.
-  (2017) Visual attribute transfer through deep image analogy. ACM Trans. Graph. 36 (4), pp. 120:1–120:15. Cited by: 3rd item.
-  (2018) Image inpainting for irregular holes using partial convolutions. In ECCV 2018, Vol. 11215, pp. 89–105. Cited by: 1st item, §4.3, §7.1, §7.5.
-  (2019) Few-shot unsupervised image-to-image translation. In ICCV 2019, pp. 10550–10559. Cited by: §5.2.
-  (2013) Structure-aware hair capture. ACM Trans. Graph. 32 (4), pp. 76:1–76:12. Cited by: §2.
-  (2017) Least squares generative adversarial networks. In ICCV 2017, pp. 2813–2821. Cited by: §2.
-  (2017) Unrolled generative adversarial networks. In ICLR 2017, Cited by: §2.
-  (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. Cited by: §2.
-  (2019) Deep face normalization. ACM Trans. Graph. 38 (6), pp. 183:1–183:16. Cited by: §2.
Conditional image synthesis with auxiliary classifier gans. In ICML 2017, Vol. 70, pp. 2642–2651. Cited by: §2.
-  (2008) Hair photobooth: geometric and photometric acquisition of real hairstyles. ACM Trans. Graph. 27 (3), pp. 30. Cited by: §2, §4.1.
-  (2019) Semantic image synthesis with spatially-adaptive normalization. In CVPR 2019, pp. 2337–2346. Cited by: §1, §2, §4.1, §4.1, §5.1, 1st item, §7.2, §7.2.
-  (2019) Two-phase hair image synthesis by self-enhancing generative model. Comput. Graph. Forum 38 (7), pp. 403–412. Cited by: §2.
-  (2016) Generative adversarial text to image synthesis. In ICML 2016, Vol. 48, pp. 1060–1069. Cited by: §2.
-  (2018) 3D hair synthesis using volumetric variational autoencoders. ACM Trans. Graph. 37 (6), pp. 208:1–208:12. Cited by: §2.
-  (2017) Neural face editing with intrinsic image disentangling. In CVPR 2017, pp. 5444–5453. Cited by: §2.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR 2015, Cited by: 3rd item.
-  (2019) FineGAN: unsupervised hierarchical disentanglement for fine-grained object generation and discovery. In CVPR 2019, pp. 6490–6499. Cited by: §2.
-  (2016) Ladder variational autoencoders. In NeurIPS 2016, pp. 3738–3746. Cited by: §2.
-  (2019) Single image portrait relighting. ACM Trans. Graph. 38 (4), pp. 79:1–79:12. Cited by: §2.
-  (2016) Face2Face: real-time face capture and reenactment of rgb videos. In CVPR 2016, pp. 2387–2395. Cited by: §2.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR 2018, pp. 8798–8807. Cited by: §2, 4th item, 5th item, §7.2, §7.5.
-  (2009) Face relighting from a single image under arbitrary unknown lighting conditions. IEEE Trans. Pattern Anal. Mach. Intell. 31 (11), pp. 1968–1984. Cited by: §2.
-  (2018) Real-time hair rendering using sequential adversarial networks. In ECCV 2018, pp. 105–122. Cited by: §2, §4.1.
Hair interpolation for portrait morphing. Comput. Graph. Forum 32 (7), pp. 79–84. Cited by: §2.
-  (2011) Expression flow for 3d-aware face component transfer. ACM Trans. Graph. 30 (4), pp. 60. Cited by: §2.
-  (2018) 3D-aware scene manipulation via inverse graphics. In NeurIPS 2018, pp. 1891–1902. Cited by: §2.
-  (2019) Semantics disentangling for text-to-image generation. In CVPR 2019, pp. 2327–2336. Cited by: §2.
-  (2017) StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV 2017, pp. 5908–5916. Cited by: §2.
-  (2017) A data-driven approach to four-view image-based hair modeling. ACM Trans. Graph. 36 (4), pp. 156:1–156:11. Cited by: §2.
-  (2017) Energy-based generative adversarial networks. In ICLR 2017,
-  (2019) On the continuity of rotation representations in neural networks. In CVPR 2019, pp. 5745–5753. Cited by: §4.1.
HairNet: single-view hair reconstruction using convolutional neural networks. In ECCV 2018, pp. 249–265. Cited by: §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV 2017, pp. 2242–2251. Cited by: §5.2.