HairCLIP: Design Your Hair by Text and Reference Image

12/09/2021
by   Tianyi Wei, et al.
5

Hair editing is an interesting and challenging problem in computer vision and graphics. Many existing methods require well-drawn sketches or masks as conditional inputs for editing, however these interactions are neither straightforward nor efficient. In order to free users from the tedious interaction process, this paper proposes a new hair editing interaction mode, which enables manipulating hair attributes individually or jointly based on the texts or reference images provided by users. For this purpose, we encode the image and text conditions in a shared embedding space and propose a unified hair editing framework by leveraging the powerful image text representation capability of the Contrastive Language-Image Pre-Training (CLIP) model. With the carefully designed network structures and loss functions, our framework can perform high-quality hair editing in a disentangled manner. Extensive experiments demonstrate the superiority of our approach in terms of manipulation accuracy, visual realism of editing results, and irrelevant attribute preservation. Project repo is https://github.com/wty-ustc/HairCLIP.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

page 8

page 12

page 13

page 14

12/09/2021

CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields

We present CLIP-NeRF, a multi-modal 3D object manipulation method for ne...
10/21/2021

Each Attribute Matters: Contrastive Attention for Sentence-based Image Editing

Sentence-based Image Editing (SIE) aims to deploy natural language to ed...
10/30/2020

MichiGAN: Multi-Input-Conditioned Hair Image Generation for Portrait Editing

Despite the recent success of face image generation with GANs, condition...
03/31/2022

TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing

Recent advances like StyleGAN have promoted the growth of controllable f...
11/26/2021

Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model

To achieve disentangled image manipulation, previous works depend heavil...
09/09/2021

Talk-to-Edit: Fine-Grained Facial Editing via Dialog

Facial editing is an important task in vision and graphics with numerous...
11/30/2020

DeepCloth: Neural Garment Representation for Shape and Style Editing

Garment representation, animation and editing is a challenging topic in ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human hair, as the critical yet challenging component of the face, has long attracted the interest of researchers. In recent years, with the development of deep learning, many conditional GAN-based hair editing methods 

[tan2020michigan, xiao2021sketchhairsalon, Lee2020MaskGANTD] can produce satisfactory editing results. Most of these methods use well-drawn sketches [jo2019sc, tan2020michigan, xiao2021sketchhairsalon] or masks [Lee2020MaskGANTD, tan2020michigan]

as the input of image-to-image translation networks to produce the manipulated results.

However, we think that these interaction types are not intuitive or user-friendly enough. For example, in order to edit the hairstyle of one image, users often need to spend several minutes to draw a good sketch, which greatly limits the large-scale, automated use of these methods. We therefore wonder “Can we provide another more intuitive and convenient interaction way, just like human communication behaviors?”. And the language (or“text”) naturally meets our requirements.

Benefiting from the development of cross-modal vision and language representations [tan2019lxmert, lu2019vilbert, su2019vl], text-guided image manipulation has become possible. Recently, StyleCLIP [patashnik2021styleclip] has achieved amazing image manipulation results by leveraging the powerful image text representation capabilities of CLIP [Radford2021LearningTV]. CLIP has an image encoder and a text encoder, by joint training on 400 million image text pairs, they can measure the semantic similarity between an input image and a text description. Based on this observation, StyleCLIP proposes to use them as the loss supervision to make the manipulated results match the text condition.

Although StyleCLIP inherently supports text description based hair editing, they are not exactly suitable for our task. It suffers from the following drawbacks: 1) For each specific hair editing description, it needs to train a separate mapper, which is not flexible in real applications; 2) The lack of tailored network structure and loss design makes the method poorly disentangled for hairstyle, hair color, and other unrelated attributes; 3) In practical applications, some hairstyles or colors are difficult to describe in text. At this time, users may prefer to use reference images, but StyleCLIP does not support reference image based hair editing.

To overcome the aforementioned limitations, we propose a hair editing framework that simultaneously supports different texts or reference images as the hairstyle/color conditions within one model. Generally, we follow StyleCLIP and utilize the StyleGAN [Karras2020AnalyzingAI] pre-trained on a large-scale face dataset as our generator, and then the key is to learn a mapper network to map the input conditions into corresponding latent code changes. But different from StyleCLIP, we explore the potential of CLIP to go beyond measuring image text similarity, along with some new designs: 1) Shared Condition Embedding. To unify the text and image conditions into the same domain, we leverage the text encoder and image encoder of CLIP to extract their embedding as the conditions for the mapper network respectively. 2) Disentangled Information Injection. We explicitly separate hairstyle and hair color information and feed them into different sub hair mappers corresponding to their semantic levels. This helps our method achieve disentangled hair editing; 3) Modulation Module. We design a conditional modulation module to accomplish the direct control of input conditions on latent codes, which improves the manipulation ability of our method.

Since our goal is to achieve the hair editing based on the text or reference image condition while ensuring other irrelevant attributes unchanged, three types of losses are introduced: 1) Text manipulation loss is used to guarantee the similarity between the editing result and the given text description; 2) Image manipulation loss is used to guide hairstyle or hair color transfer from the reference image to the target image; 3) Attribute preservation loss is used to keep irrelevant attributes (e.g., identity and background) unchanged before and after editing.

Quantitative and qualitative comparisons and user study demonstrate the superiority of our method in terms of manipulation accuracy, manipulation fidelity, and irrelevant attribute preservation. And some example editing results are shown in Figure 1. We also conduct extensive ablation analysis and well justify the designs of our network structure and loss function.

To summarize, our contributions are three-fold as below:

  • We push the frontiers of interactive hair editing, i.e., unifying text and reference image conditions within one framework. It supports a wide range of text and image conditions in one single model without the need of training many independent models, which has never been achieved before.

  • In order to perform various hairstyle and hair color manipulation in a disentangled manner, we propose some new network structure designs and loss functions tailored for our task.

  • Extensive experiments and analysis are conducted to show the better manipulation quality of our method and the necessity of each new design.

2 Related Work

Generative Adversarial Networks. Since being proposed by Goodfellow et al. [Goodfellow2014GenerativeAN], GANs have made great progress in terms of loss functions [Ansari2020ACF, Arjovsky2017WassersteinGA], network structure design [Schnfeld2020AUB, Gulrajani2017ImprovedTO, tan2021diverse], and training strategies [Tao2020AlleviationOG, Guo2020OnPC]. As a representative GAN in the field of image synthesis, StyleGAN [Karras2019ASG, Karras2020AnalyzingAI] can synthesize very high-fidelity human faces with realistic facial details and hair. As the typical unconditional GANs, StyleGAN itself is difficult to achieve controllable image synthesis effects. But fortunately, its latent space demonstrates promising disentanglement properties [Collins2020EditingIS, Shen2020InterpretingTL, Goetschalckx2019GANalyzeTV, Jahanian2020OnT], and many works utilize StyleGAN to perform image manipulation tasks [wang2021cross, Nitzan2020FaceID, Wei2021ASB, Alaluf2021OnlyAM, patashnik2021styleclip]. In this paper, we convert the unconditional StyleGAN into our conditional hair editing network with the help of CLIP’s powerful image text representation capability. Moreover, we unify the text and reference image condition in one framework and achieve disentangled editing effects.

Image-based Hair Manipulation. As an important part of the human face, hair has attracted many works dedicated to hair modeling [Hu2014RobustHC, Chai2016AutoHairFA, Chai2015HighqualityHM] and synthesis [Wei2018RealTimeHR, Jo2019SCFEGANFE, Lee2020MaskGANTD, Zhu2020SEANIS]. Some works [Lee2020MaskGANTD, Zhu2020SEANIS] use mask which explicitly decouples facial attributes including hair as the conditional input for image-to-image translation networks to accomplish hair manipulation. There are also several works [tan2020michigan, xiao2021sketchhairsalon] that use sketches as input to depict the structure and shape of the desired hairstyle. However, such interactions are still relatively costly for users. To enable easier interaction, MichiGAN [tan2020michigan] supports hair transfer by extracting the orientation map of one hairstyle reference image as well as the appearance from another hair color reference image. However, MichiGAN is easy to fail for arbitrary shape changes during hair transfer. Recently, LOHO [Saha2021LOHOLO] performs a two-stage optimization in the space and noise space of StyleGANv2 [Karras2020AnalyzingAI] to complete the hair transfer for a given reference image. However, the area optimized by this method is limited to the foreground, which requires blending the reconstructed foreground with the original background and often brings obvious artifacts. Besides, it is very time-consuming, e.g., several minutes to optimize an image.

Text-based Hair Manipulation. Along with the booming development of cross-modal visual and language representations [tan2019lxmert, lu2019vilbert, su2019vl], especially the powerful CLIP [Radford2021LearningTV], many recent efforts  [chen2018language, jiang2021talk, xia2021tedigan, patashnik2021styleclip] start to study text based manipulation. However, there is no existing method specifically tailored for hair editing. Among these works, the most relevant to our work are StyleCLIP [patashnik2021styleclip] and TediGAN [xia2021tedigan]

. But StyleCLIP needs to train a separate mapper network for each specific hair editing description, which is not flexible for real applications. For TediGAN, it proposes two approaches: TediGAN-A encodes text and image separately into the latent space of StyleGAN and completes manipulation with style-mixing, which is less decoupled and difficult to complete hair editing; TediGAN-B accomplishes the manipulation with optimization using CLIP to provide text-image similarity loss, but the lack of knowledge learned from a large dataset makes the process unstable and time-consuming.

Different from existing works, this paper presents the first unified framework that enables the text and image conditions simultaneously. This provides a more intuitive and convenient interaction mode, and enables diverse text and image conditions within one single model. Besides, benefiting from the new designs tailored for this task, our method also shows much better hair manipulation quality.

3 Proposed Method

Figure 2:

The overview of our framework, here we show an example with hairstyle description text and hair color reference image as conditional inputs. Our framework supports to accomplish the corresponding hair editing according to the given reference images and texts, where images, texts are encoded by CLIP’s image encoder, text encoder to 512-dimensional vectors as conditional inputs for the hair mapper, respectively. Only three sub hair mappers are trainable, where

and take the hairstyle conditional input and takes the hair color conditional input .

3.1 Overview

Imagine we are in a barbershop and if someone wants to design his hair, the common interaction would be to name the desired hairstyle or provide the hairstylist with a corresponding picture. Inspired by this, we think empowering the AI algorithms to enable such an intuitive and efficient interaction mode is really needed. Thanks to the great image synthesis quality of StyleGAN  [Karras2019ASG, Karras2020AnalyzingAI] and the excellent image/text representation ability of CLIP  [Radford2021LearningTV], we are finally able to design such a unified hair editing framework to achieve this goal. Before diving into the framework details, we briefly introduce StyleGAN and CLIP respectively.

StyleGAN [Karras2019ASG, Karras2020AnalyzingAI] can synthesize high-resolution, high-fidelity realistic images with a progressive upsample network from noises. Its synthesis process involves multiple latent spaces. is the original noise space of StyleGAN. A randomly sampled noise vector is transformed to the latent space after fully connected layers. Several studies [Collins2020EditingIS, Shen2020InterpretingTL, Goetschalckx2019GANalyzeTV, Jahanian2020OnT] have demonstrated that StyleGAN spontaneously learns to encode rich semantics within its space during training, and thus exhibits good semantic decoupling properties. In addition, some recent StyleGAN inversion works [Abdal2019Image2StyleGANHT, richardson2021encoding, Wei2021ASB] extend space to space for better reconstruction. For a StyleGAN with 18 layers, it is defined by the cascade of different -dimensional vectors .

CLIP [Radford2021LearningTV] is a multi-modality model pretrained from million image-text pairs collected from the Internet. It consists of one image encoder and one text encoder that will encode the image and text into the -dimensional embedding vector, respectively. It adopts the typical contrastive learning framework, which minimizes the cosine distance between the encoded vectors of the correct image text pairs and maximizes the cosine distance of the incorrect pairs. Benefiting from large-scale pretraining, CLIP can well measure the semantic similarity between an image and a text, via learning one shared image-text embedding space.

3.2 HairCLIP

Inspired by the pioneering work StyleCLIP[patashnik2021styleclip], we utilize the powerful synthesis ability of the pretrained StyleGAN, and aim to learn an extra mapper network to achieve the hair editing function. More specifically, given the real image to edit, we first use the StyleGAN inversion method “e4e” [Tov2021DesigningAE] to get its latent code in the space, then use the mapper network to predict the latent code change based on and editing conditions (including hairstyle condition and hair color condition ). Finally, the modified latent code will be fed back into the pretrained StyleGAN to get the target editing result. The overall pipeline is illustrated in Figure 2, and each component will be elaborated below.

Shared Condition Embedding. To unify the conditions from the text and image domains under one framework, we naturally choose to represent them by embedding them in the joint latent space of CLIP. For the user-supplied text hairstyle prompt and text hair color prompt, we use CLIP’s text encoder to encode them into -dimensional conditional embedding, which are denoted as and respectively. Similarly, the hairstyle reference image and hair color reference image are encoded by the image encoder of CLIP and denoted as and respectively. Because CLIP is well trained on large-scale image-text pairs, all reside in the shared latent space, thus can be fed into one mapper network and flexibly switched.

Disentangled Information Injection. As demonstrated in many works [xia2021tedigan, Karras2019ASG], different layers of StyleGAN correspond to different semantic levels of information in the generated images, with the more preceding layers corresponding to higher semantic levels of information. Following the StyleCLIP [patashnik2021styleclip], we adopt three sub hair mappers , with the same network structure, which are responsible for predicting of hair editing corresponding to different parts (coarse, medium and fine) of the latent code . More specifically, correspond to the high semantic level, the middle semantic level, and the low semantic level respectively.

Noticing this semantic layering phenomenon in StyleGAN, we propose disentangled information injection, which aims to improve the decoupling ability of the network for hairstyle and hair color editing. In detail, we use the embedding of hairstyle information from CLIP as the conditional input for and , and the embedding of hair color information from CLIP as the conditional input for . This is based on the empirical observation that hairstyle often corresponds to middle and high level semantic information in StyleGAN while hair color corresponds to low level semantic information. Therefore, the hair mapper can be formulated as:

(1)

Modulation Module. As shown in Figure 2

, each sub hair mapper network follows a simple design and consists of five blocks, and each block consists of one fully connected (fc) layer, one newly designed modulation module, and one non-linear activation layer (leakly relu). Rather than simply concatenating the condition embedding with the input latent code, the modulation module uses the condition embedding

to modulate the intermediate output of the preceding fc layer. Mathematically, it follows the below formulation:

(2)

where and

denote the mean and standard deviation of

respectively. And and are implemented with simple fully connected networks (two fc layers with one intermediate layernorm and leaky relu layer). This design is motivated by recent conditional image translation works [park2019semantic, tan2021efficient, huang2017arbitrary]. During testing, if no conditional input is provided for hairstyle or hair color, then all modulation modules in the corresponding sub hair mapper will be implemented as identity functions, and we denote this case as or . In this way, we flexibly support users to edit only hairstyle, only hair color, or both hairstyle and hair color.

3.3 Loss Functions

Our goal is to manipulate the hair in a decoupled manner based on the conditional input, while requiring other irrelevant attributes (e.g., background, identity) well preserved. Therefore, we specifically design three types of loss functions to train the mapper networks: text manipulation loss, image manipulation loss, and attribute preservation loss.

Text Manipulation Loss. In order to perform the corresponding hair manipulation based on the text prompt of the hairstyle or color, we design the text manipulation loss with the help of CLIP as follows:

(3)

For the hairstyle text manipulation loss, we measure the cosine distance between the manipulated image and the given text in the CLIP’s latent space:

(4)

where

means cosine similarity,

represents the image encoder of CLIP, represents the pretrained StyleGAN generator, denotes the embedding of a given hairstyle description text which is encoded by the text encoder of CLIP, and . Similarly, color text manipulation loss is defined as follows:

(5)

where denotes the embedding of a given color description text which is encoded by the text encoder of CLIP, and .

Image Manipulation Loss. Given a reference image, we want the manipulated image to possess the same hairstyle as that of the reference image. However characterizing the similarity between two hairstyles is a challenging task. Exploiting the powerful potential of CLIP again, we encode them separately using CLIP’s image encoder to measure their similarity in CLIP’s latent space:

(6)

where the manipulated image , , , denotes the pre-trained facial parsing network [FaceParsing], represents the mask of the hair region of , and means the given reference image. Thanks to this supervision we propose, our method can yield plausible editing results for cases where the reference image and the input image are seriously misaligned, which is currently unavailable for other hairstyle transfer methods. Also, for reference image based hair color manipulation, we calculate the average color difference in the hair area between reference image and manipulated image as the loss:

(7)

where , , and . In summary, the image manipulation loss is defined as:

(8)

where , are set to , respectively by default.

Attribute Preservation Loss. To ensure identity consistency before and after hair editing, the identity loss is applied as follows:

(9)

where , , is a pretrained ArcFace [Deng2019ArcFaceAA]

network for face recognition and

denotes the reconstructed real image. In addition, we designed in the same way as in order to maintain the hair color when only manipulating the hairstyle:

(10)

where , , , and . Empirically, we find the hairstyle can be well preserved when only changing the color, so we do not add corresponding preservation loss.

Moreover, we introduced background loss with the help of facial parsing network [FaceParsing] :

(11)

where , represents the mask of the non-hair region of . In this way, we largely ensure that the non-relevant attribute regions remain unchanged. For the same purpose, the norm of the manipulation step in the latent space is utilized:

(12)

The overall attribute preservation loss is defined as:

(13)

where , , , are set to , , , respectively by default.

Finally, the overall loss function is defined as:

(14)

where , , are set to , , respectively by default.

4 Experiments

Input Image Ours StyleCLIP [patashnik2021styleclip] TediGAN [xia2021tedigan] Input Image Ours StyleCLIP [patashnik2021styleclip] TediGAN [xia2021tedigan]

afro hairstyle
green hair


bobcut hairstyle
blond hair

bowlcut hairstyle
braid brown

mohawk hairstyle
crewcut yellow

purple hair
perm gray

Figure 3: Visual comparison with StyleCLIP [patashnik2021styleclip] and TediGAN [xia2021tedigan]. The corresponding simplified text descriptions (editing hairstyle, hair color, or both of them) are listed on the leftmost side of each row, and all input images are the inversions of the real images. Our approach demonstrates better visual photorealism and irrelevant attributes preservation ability while completing the specified hair editing.

Implementation Details. We train and evaluate our hair mapper on the CelebA-HQ dataset [karras2018progressive]. Since we use e4e [Tov2021DesigningAE] as our inversion encoder, we follow its division of the training set and test set. The StyleGAN2 [Karras2020AnalyzingAI] pre-trained on the FFHQ dataset [Karras2019ASG] is used as our generator. For the text input, we collected hairstyle text descriptions and hair color text descriptions; The CelebA-HQ dataset is used to provide reference images of hairstyles or hair colors, and we also generated several edited images using our text-guided hair editing method to augment the diversity of the reference image set. During training, the hair mapper is randomly tasked to edit only the hairstyle or only the hair color or both hairstyle and hair color depending on the provided conditional input. The conditioned input is randomly set as text or reference image. Regarding the training strategy, the base learning rate is with batch size of . The number of training iterations is , and the Adam [kingma2015adam] optimizer is used, with and set to and , respectively. For all compared methods, we use the official training codes or pre-trained models.

To quantitatively evaluate irrelevant attributes preservation, four metrics are used: IDS denotes identity similarity before and after editing calculated by Curricularface [Huang2020CurricularFaceAC]. PSNR and SSIM are calculated in the region of intersection of non-hair regions before and after editing. ACD represents the average color difference of the hair region.

4.1 Quantitative and Qualitative Comparison

Comparison to Text-Driven Image Manipulation Methods. We compare our approach with current state-of-the-art text-driven image manipulation methods TediGAN [xia2021tedigan] and StyleCLIP [patashnik2021styleclip] on ten text descriptions. The optimization iteration number of TediGAN is set to according to their official recommendations. The visual comparison is shown in Figure 3. TediGAN fails in all hairstyle editing related tasks, only the hair color editing is barely successful but the results are still unsatisfactory. This phenomenon is consistent with the findings given in the StyleCLIP paper: the optimization method using CLIP similarity loss is very unstable due to the lack of knowledge learned from a large dataset.

StyleCLIP trains a separate mapper for each description and thus demonstrates stronger manipulation ability on the task of editing only the hairstyle, but excessive manipulation ability instead affects the image realism (see afro hairstyle). Thanks to our shared condition embedding, our method finds a balance between the degree of manipulation and realism by fully learning over many hair editing description inputs. On the task of editing both hairstyle and hair color, our method exhibits better manipulation ability. This is due to proposed disentangled information injection and modulation module, whereas StyleCLIP leaves this information in one description making it poorly decoupled and difficult to perform hairstyle and hair color editing tasks at the same time. In addition, benefiting from attribute preservation loss, our method exhibits better retention of irrelevant attributes (see mohawk hairstyle, purple hair).

In Table 1, we give the average quantitative comparison results in terms of irrelevant attributes preservation on these ten text descriptions. And the quantitative results lead to the same conclusions as the visual comparison. We do not compare the FID [Heusel2017GANsTB] used in TediGAN here since it can not reflect the manipulation capability. More quantitative results and analysis in terms of the FID metric are given in the supplementary material.

Methods IDS PSNR SSIM
Ours 0.83 27.8 0.92
StyleCLIP [patashnik2021styleclip] 0.79 23.2 0.87
TediGAN [xia2021tedigan] 0.17 24.1 0.79
Table 1: Quantitative comparison regarding the preservation of irrelevant attributes. Our approach exhibits the best irrelevant attributes preservation ability.

Comparison to Hair Transfer Methods. Given a hairstyle reference image and a hair color reference image, the purpose of hair transfer is to transfer their corresponding hairstyle and hair color attributes to the input image. We compare our method with the current state-of-the-art LOHO [Saha2021LOHOLO] and MichiGAN [tan2020michigan] in Figure 4. Both of these methods perform hairstyle transfer by direct replication in the spatial domain to generate more accurate details of the hair structure, although suffer from obvious artifacts in the boundary areas in some cases (see the results in the first row). However, as shown in the last two rows, they are sensitive to the pose of hairstyle reference images and cannot complete plausible hairstyle transfer when the hairstyle and pose are not well aligned between the hairstyle reference image and the input image. Unlike these two approaches, we transform the measure space of similarity into the latent space of CLIP during training and use the embedding of the hair region of the reference image from CLIP as the conditional input. As a result, our method provides a solution for the unaligned hairstyle transfer and shows its superiority compared to other existing methods.

Input HRI CRI Ours LOHO MichiGAN


Figure 4: Comparison of our approach with LOHO [Saha2021LOHOLO] and MichiGAN [tan2020michigan] on hair transfer. HRI means hairstyle reference image and CRI means hair color reference image.

User Study. To further evaluate the manipulation ability and the visual realism of the edited results of different methods in two types of hair editing tasks, we recruited participants for our user study. For the text-driven image manipulation methods, we provided groups of results from three methods at a time, which were randomly selected from two of each of ten hair editing descriptions. For the hair transfer methods, participants were also provided with groups of results, half of which were aligned hairstyle transfer cases and the other half were non-aligned. Participants were asked to rank three methods for each task with respect to manipulation accuracy and visual realism, where represents the best and represents the worst. The average ranking values are listed in Table 2, where our method outperforms the competitive approaches in both metrics.

Text-Driven Methods Hair Transfer Methods
Metrics Ours StyleCLIP TediGAN Ours LOHO MichiGAN
Acc. 1.39 1.66 2.95 1.79 2.26 1.95
Real. 1.42 1.63 2.95 1.09 2.48 2.43
Table 2: User study on text-driven image manipulation methods and hair transfer methods. Acc. denotes the manipulation accuracy for given conditional inputs and Real. denotes the visual realism of the manipulated image. The numbers in the table are average rankings, the lower the better.

4.2 Ablation Analysis

To verify the effectiveness of our proposed network structure and loss functions, we alternately ablate one of these key components to retrain variants of our method, by keeping all but the selected component unchanged.

Importance of Attribute Preservation Loss. To verify the role of each component in the attribute preservation loss, we randomly selected images for qualitative and quantitative ablation studies across the task of editing only hairstyles. Consistent conclusions can be drawn from Table 3 and Figure 5: , , and all contribute to the maintenance of irrelevant attributes, and helps keep hair color unchanged when only editing the hairstyle.

Methods IDS PSNR SSIM ACD
Ours 0.85 27.0 0.91 0.02
w/o 0.82 19.9 0.82 0.02
w/o 0.25 22.8 0.80 0.03
w/o 0.82 26.6 0.90 0.09
w/o 0.75 24.9 0.87 0.03
Table 3: Quantitative ablation experiments on attribute preservation loss.

Superiority of Network Structure Design. We compare our model with three variants. (a) replace the modulation module with vanilla layernorm layer, and concatenate conditional inputs with the latent code and then feed them into the network. (b) replace the conditional inputs of the coarse and medium sub hair mappers with hair color embedding, and the fine sub hair mapper with hairstyle embedding. (c) replace the conditional input of the medium sub hair mapper with the hair color embedding and leave the rest unchanged. As shown in Figure 6, only our model completes both hairstyle and hair color manipulation. The unsatisfactory result of (a) proves that our modulation module can better fuse the condition information into latent space and improve manipulation capability. (b) and (c) confirm the correctness of our disentangled semantic-matching-based information injection.

Input Ours w/o w/o w/o w/o


Figure 5: The effect of attribute preservation loss. The text description is “slicked back hairstyle”.
Input Image Ours (a) (b) (c)

Figure 6: Visual comparison of the results generated by our method and variants of our model. The text description is “perm hairstyle and red hair”. (a) concatenate conditional inputs with the latent code. (b) replace the conditional inputs of the coarse and medium sub hair mappers with hair color embedding, and the fine sub hair mapper with hairstyle embedding. (c) replace the conditional input of the medium sub hair mapper with the hair color embedding and leave the rest unchanged.

Hair Interpolation.

Given two edited latent codes , we can achieve fine-grained hair editing by interpolation. In detail, we combine the two latent codes by linear weighting to generate the intermediate latent code . Finally, the image corresponding to the intermediate latent code is generated. By gradually increasing the blending parameter from to , we can manage hair editing at a fine-grained level, as shown in Figure 7.

Generalization Ability. In Figure 8, we demonstrate the generalization ability of our method to unseen text descriptions. Thanks to our strategy of shared condition embedding, our method possesses some extrapolation ability after training with only a limited number of hair editing descriptions, which yields reasonable editing results for texts that never appear in the training descriptions.

Cross-Modal Conditional Inputs. Our method supports conditional inputs from the image and text domains individually or jointly, which is not feasible with current existing hair editing methods, and the results are shown in Figure 1. More results will be given in the supplementary materials.

Figure 7: Hair interpolation results. By gradually increasing the blending parameter from to , we can manage hair editing at a fine-grained level, such as changing from yellow hair to pink hair, from ringlets hairstyle to shingle bob hairstyle.
Input Image Curly Short Mushroom Violet Silver

Figure 8: Generalization ability to unseen descriptions. Despite never being trained on these descriptions of “curly short hairstyle”, “mushroom hairstyle”, “violet hair”, and “silver hair”, our method can still yield plausible manipulation results.

5 Limitations and Negative Impact

Since our editing is done in the latent space of pretrained StyleGAN, we can not complete the editing for some rare hairstyle descriptions or reference images that are not within the domain of StyleGAN. But this limitation can be well solved by adding corresponding images to the pre-training process of StyleGAN. In the hairstyle transfer task, we use the embedding of the hairstyle reference image in the latent space of CLIP as the conditional input for our hair mapper, which sometimes will lose the fine-grained structural information and thus cannot achieve a perfect transfer for the hairstyle’s structural details. In addition, the hair-edited images by our method may be used to spread malicious information, which can be evaded using the existing state-of-the-art GAN-generated image detectors [wang2020cnn].

6 Conclusions

In this paper, we propose a new hair editing interaction mode that unifies conditional inputs from text and image domains in a unified framework. In our framework, users can individually or jointly provide textual descriptions and reference images to complete the hair editing. This multi-modal interaction greatly increases the flexibility of hair editing and reduces the interaction cost for users. By maximizing the great potential of CLIP, tailored network structure designs and loss functions, our framework supports high-quality hair editing in a decoupled manner. Extensive qualitative and quantitative comparisons and user study demonstrate the superiority of our method compared to competing methods in terms of manipulation capability, irrelevant attributes preservation, and image realism.

References

Appendix A Quantitative Results

We give the detailed quantitative results about FID [Heusel2017GANsTB] in Table 4 for each hair description. Although TediGAN [xia2021tedigan] performs the best in terms of FID, it hardly performs any hair editing task very well (as demonstrated by the qualitative results). We argue that FID may not be a suitable metric for evaluating the manipulation ability. Similar conclusion was also drawn in e4e [Tov2021DesigningAE].

Appendix B More Qualitative Results

In Figures 9, 10, and 11 we give more visual comparison results with other state-of-the-art methods and results for the cross-modal conditional inputs.

Input Image Ours StyleCLIP [patashnik2021styleclip] TediGAN [xia2021tedigan] Input Image Ours StyleCLIP [patashnik2021styleclip] TediGAN [xia2021tedigan]

afro hairstyle

bobcut hairstyle

bowlcut hairstyle

mohawk hairstyle

purple hair


green hair

blond hair

braid brown

crewcut yellow

perm gray

Figure 9: Visual comparison with StyleCLIP [patashnik2021styleclip] and TediGAN [xia2021tedigan]. The corresponding simplified text descriptions (editing hairstyle, hair color, or both of them) are listed on the leftmost side of each row, and all input images are the inversions of the real images. Our approach demonstrates better visual photorealism and irrelevant attributes preservation ability while completing the specified hair editing.
Input Hairstyle Ref Color Ref Ours LOHO [Saha2021LOHOLO] MichiGAN [tan2020michigan]






Figure 10: Comparison of our approach with LOHO [Saha2021LOHOLO] and MichiGAN [tan2020michigan] on hair transfer. Even for extreme examples like the third row in this figure, our method can yield plausible hairstyle transfer results.
Figure 11: Our single framework supports hairstyle and hair color editing individually or jointly, and conditional inputs can come from either image or text domain. “Style” in the text description is the abbreviation for hairstyle.