Barbershop: GAN-based Image Compositing using Segmentation Masks

06/02/2021 ∙ by Peihao Zhu, et al. ∙ King Abdullah University of Science and Technology Miami University 11

Seamlessly blending features from multiple images is extremely challenging because of complex relationships in lighting, geometry, and partial occlusion which cause coupling between different parts of the image. Even though recent work on GANs enables synthesis of realistic hair or faces, it remains difficult to combine them into a single, coherent, and plausible image rather than a disjointed set of image patches. We present a novel solution to image blending, particularly for the problem of hairstyle transfer, based on GAN-inversion. We propose a novel latent space for image blending which is better at preserving detail and encoding spatial information, and propose a new GAN-embedding algorithm which is able to slightly modify images to conform to a common segmentation mask. Our novel representation enables the transfer of the visual properties from multiple reference images including specific details such as moles and wrinkles, and because we do image blending in a latent-space we are able to synthesize images that are coherent. Our approach avoids blending artifacts present in other approaches and finds a globally consistent image. Our results demonstrate a significant improvement over the current state of the art in a user study, with users preferring our blending solution over 95 percent of the time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

page 8

page 9

page 10

page 11

Code Repositories

Barbershop

Barbershop: GAN-based Image Compositing using Segmentation Masks


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Due to the rapid improvement of generative adversarial networks (GANs), GAN-based image editing has recently become a widely used tool in desktop applications for professional and social media photo editing tools for casual users. Of particular interest are tools to edit photographs of human faces. In this paper, we propose new tools for image editing by mixing elements from multiple example images in order to make a composite image. Our focus is on the task of hair editing.

Despite the recent success of face editing based on latent space manipulation (abdal2019image2stylegan; abdal2020image2stylegan++; zhu2020improved), most editing tasks operate on an image by changing global attributes such as pose, expression, gender, or age. Another approach to image editing is to select features from reference images and mix them together to form a single, composite image. Examples of composite image editing that have seen recent progress are problems of hair-transfer and face-swapping. These tasks are extremely difficult for a variety of reasons. Chief among them is the fact that the visual properties of different parts of an image are not independent of each-other. The appearance of hair, for example, is heavily influenced by ambient and reflected light as well as transmitted colors from the underlying face, clothing, and background. The pose of a head influences the appearance of nose, eyes and mouth, and the geometry of a persons head and shoulders influences shadows and the structure of hair. Other challenges include disocclusion of the background, which happens when the hair region shrinks with respect to the background. Disocclusion of the face region can expose new parts of the face, such as ears, forehead, or the jawline. The shape of the hair is influenced by pose and also by the camera intrinsic parameters, and so the pose might have to change to adapt to the hair.

Failure to account for the global consistency of an image will lead to noticeable artifacts - the different regions of the image will appear disjointed even if each part is synthesized with a high level of realism. In order for the composite image to seem plausible, our aim is to make a single coherent composite image that balances the fidelity of each region to the corresponding reference image while also synthesizing an overall convincing and highly realistic image.

Previous methods of hair transfer based on GANs either use a complex pipeline of conditional GAN generators (Tan_2020), each condition module specialized to represent, process, and convert reference inputs with different visual attributes, or make use of the latent space optimization with carefully designed loss and gradient orthogonalization (saha2021loho) to explicitly disentangle hair attributes. While both of these methods show very promising initial results, we found that they could be greatly improved. For example both of them need pretrained inpainting networks to fill holes left over by misaligned hair masks, which may lead to blurry artifacts and unnatural boundaries. We believe that better results can be achieved without an auxiliary inpainting network to fill the holes, as transitions between regions have higher quality if they are synthesized by a single GAN. Also these previous methods do not make use of a semantic alignment step to merge semantic regions from different reference images in latent space, e.g. to align a hair region and a face region from different images.

In this work, we propose Barbershop, a novel optimization method for photo-realistic hairstyle transfer, face swapping, and other composite image editing tasks applied to faces. Our approach uses GAN-inversion to generate high-fidelity reconstructions of reference images. We suggest a novel latent space which provides coarse control of the spatial locations of features via a

structure tensor

, as well as fine control of global style attributes via an appearance code . This latent space allows a trade-off between a latent-code’s capacity to maintain the spatial locations of features such as wrinkles and moles while also supporting latent code manipulation. We edit the codes to align reference images to a target features locations. This alignment step is a key extension to existing GAN-embedding algorithms. It embeds images while at the same time slightly altering them to conform to a different segmentation mask. Then we find a blended latent code, by mixing reference images in latent space, rather than compositing images in the spatial domain. The result is a latent code of an image. By blending in the new spatially-aware latent space we avoid many of the artifacts of other image compositing approaches.

Our proposed approach is demonstrated in Fig. 1. We are able to transfer only the shape of a subject’s hair (Fig. 1b). We influence the shape by altering the hair region in a segmentation mask. we can transfer both the shape and the coarse structure (Fig. 1c). We use the term structure to refer to information captured by earlier GAN layers, such as the geometry of the hair strands. For example, the structure encodes the difference between straight, curly, or wavy hair. We can also transfer shape, structure, and detailed appearance (Fig. 1(d,f)). We use the term appearance to describe information encoded in later GAN layers, including hair color, texture, and lighting. Our approach also supports the use of different reference images to be used for structure vs the appearance code as shown in Fig. 1(g,h).

Our main contributions are:

  • A novel latent space, called space, for representing images. The new space is better at preserving details, and is more capable of encoding spatial information.

  • A new GAN-embedding algorithm for aligned embedding. Similar to previous work, the algorithm can embed an image to be similar to an input image. In addition, the image is slightly modified to conform to a new segmentation mask.

  • A novel image compositing algorithm that can blend multiple images encoded in our new latent space to yield a high quality results.

  • We achieve a significant improvement in hair transfer, with our approach being preferred over existing state of the art approaches by over 95% of participants in a user study.

2. Related Work

GAN-based Image Generation.

Since their advent, GANs (goodfellow2014generative; radford2015unsupervised) have contributed to a surge in high quality image generation research. Several state-of-the-art GAN networks demonstrate significant improvements in the visual quality and diversity of the samples. Some recent GANs such as ProGAN (karras2017progressive), StyleGAN (STYLEGAN2018), StyleGAN2 (Karras2019stylegan2) show the ability of GANs to produce very highly detailed and high fidelity images that are almost indistinguishable from real images. Especially in the domain of human faces, these GAN architectures are able to produce unmatched quality and can then be applied to a downstream task such as image manipulation (shen2020interfacegan; abdal2019image2stylegan). StyleGAN-ada (Karras2020ada) showed that a GAN can be trained on limited data without compromising the generative ability of a GAN. High quality image generation is also attributed to the availability of high quality datasets like FFHQ (STYLEGAN2018), AFHQ (choi2020starganv2) and LSUN (yu15lsun) objects. Such datasets provide both the quality and diversity to train the GANs and have further contributed to produce realistic applications. On the other hand, BigGAN (brock2018large)

can produce high quality samples using complex datasets like ImageNet 

(imagenet_cvpr09)

. Some other notable methods for generative modeling include Variational Autoencoders 

(VAE2013) (VAEs), PixelCNNs (Salimans2017PixeCNN), Normalizing Flows (chen2018neural) and Transformer based VAEs (esser2020taming) also have some unique advantages. However, in this work, we focus on StyleGAN2 trained on the FFHQ dataset because it is considered state of the art for face image generation.

Embedding Images into the GAN Latent Space

In order to edit real images, a given image needs to be projected into the GAN latent space. There are broadly two different ways to project/embed images into the latent space of a GAN. The first one is the optimization based approach. Particularly for StyleGAN, I2S (abdal2019image2stylegan) demonstrated high quality embeddings into the extended W space, called W+ space, for real image editing. Several followup works (zhu2020domain; tewari2020pie) show that the embeddings can be improved by including new regularizers for the optimization. An Improved version of Image2StyleGAN (II2S) (zhu2020improved) demonstrated that regularization in space can lead to better embeddings and editing quality. It is also noted that the research in these optimization based approaches with StyleGAN lead to commercial software such as Adobe Photoshop’s Neural Filters (Adobe). The second approach in this domain is to use encoder based methods that train an encoder on the latent space. Some notable works (tov2021designing; richardson2020encoding) produce high quality image embeddings that can be manipulated. In this work, we propose several technical extensions to build on previous work in image embedding.

Latent Space Manipulation for Image Editing.

GAN interpretability and GAN-based image manipulation has been of recent interest to the GAN research community. There are broadly two spaces where semantic manipulation of images is possible: the latent and the activation space. Some notable works in the latent space manipulation domain try to understand the nature of the latent space of the GAN to extract meaningful directions for edits. For instance, GANspace (harkonen2020ganspace)

is able to extract linear directions from the StyleGAN latent space (W space) in an unsupervised fashion using Principal Component Analysis (PCA). Another notable work, StyleRig 

(tewari2020stylerig) learns a mapping between a riggable face model and the StyleGAN latent space. On the other hand, studying the non-linear nature of the StyleGAN latent space, StyleFlow (abdal2020styleflow) uses normalizing flows to model the latent space of StyleGAN to produce various sequential edits. Another approach StyleCLIP (patashnik2021styleclip) uses text information to manipulate the latent space. The other set of papers focus on the layer activations (Bau:Ganpaint:2019; bau2020units) to produce fine grained local edits to an image generated by StyleGAN. Among them are TileGAN (fruhstuck2019tilegan), Image2StyleGAN++ (abdal2020image2stylegan++), EditStyle (collins2020editing) which try to manipulate the activation maps directly to achieve a desired edit. Recently developed StyleSpace (wu2020stylespace) studies the style parameters of the channels to produce fine grained edits. StylemapGAN (kim2021stylemapgan) on the other hand converts the latent codes into spatial maps that are interpretable and can be used for local editing of an image.

Conditional GANs.

One of the main research areas enabling high quality image manipulation is the work on conditional GANs(CGANs) (mirza2014conditional). One way to incorporate a user’s input for manipulation of images is to condition the generation on another image. Such networks can be trained in either paired (park2019SPADE; Zhu_2020) or unpaired fashion (Zhu_2017; zhu2017multimodal)

using the cycle-consistency losses. One important class of CGANs uses images as conditioning information. Methods such as pix2pix 

(pix2pix2017), BicycleGAN (zhu2017multimodal), pix2pixHD (wang2018pix2pixHD), SPADE (park2019SPADE), MaskGAN (fedus2018maskgan), controllable person image synthesis (Men_2020), SEAN (Zhu_2020) and SofGAN (chen2020free) are able to produce high quality images given the condition. For instance, these networks can take a segmentation mask as an input and can generate the images consistent with manipulations done to the segmentation masks. Particularly on faces, StarGANs1&2 (Choi_2018; choi2020starganv2) are able to modify multiple attributes. Other notable works, FaceShop (Portenier_2018), Deep plastic surgery (Yang_2020), Interactive hair and beard synthesis (Olszewski_2020) and SC-FEGAN (Jo_2019) can modify the images using the strokes or scribbles on the semantic regions. For the hairstyle and appearance editing, we identified two notable relevant works. MichiGAN (Tan_2020) demonstrated high quality hair editing using an inpainting network and mask-conditioned SPADE modules to draw new consistent hair. LOHO (saha2021loho) decomposes the hair into perceptual structure, appearance, and style attributes and uses latent space optimization to infill missing hair structure details in latent space using the StyleGAN2 generator. We compare with both these works quantitatively and qualitatively in Sec. 4.2.

Figure 2. The latent space. The first eight blocks of the code are replaced by the output of the output of the eighth style block to form a structure tensor , and the remaining parts are used as an appearance code

.

3. Method

3.1. Overview

We create composite images by selecting semantic regions (such as hair, or facial features) from reference images and seamlessly blending them together. To this end, we employ automatic segmentation of reference images and make use of a target semantic segmentation mask image . To perform our most important example edit, hairstyle transfer, one can copy the hairstyle from one image, and use another image for all other semantic categories. More generally, a set of reference images, for , are each aligned to the target mask and then blended to form a novel image. The output of our approach is a composite image, , in which the region of semantic-category has the style of reference image See Fig. 3 for an overview.

Figure 3. An overview of the method; (a) reference images for the face (top) and hair (bottom) features, (b) reconstructed images using the latent space, (c) a target mask, (d) alignment in space, (e) a close-up view of the face (top) and hair (bottom) in space, (f) close-up views after details are transferred, (g) an entire image with details transferred, (h) the structure tensor is transferred into the blended image, and (i) the appearance code is optimized.

Our approach to image blending is based on StyleGAN (Karras2019style; Karras2019stylegan2; Karras2020ada), and StyleGAN embedding algorithms to find latent codes for given photographs, e.g. (abdal2019image2stylegan). In particular, we build on the StyleGAN2 architecture (Karras2019stylegan2) and extend the II2S (zhu2020improved) embedding algorithm. The II2S algorithm uses the inputs of the 18 affine style blocks of StyleGAN2 as a single latent code. The

latent code allows the input of each block to vary separately, but II2S is biased towards latent codes that have a higher probability according to the StyleGAN2 training set. Our approach to image blending finds a latent code for the blended image, which has the benefit of avoiding many of the traditional artifacts of image blending, particularly at the boundaries of the blended regions. However, there is a potential for latent-codes to smooth or elide unusual features of reference images.

In order to increase the capacity of our embedding and capture image details, we embed images using a latent code comprised of a structure tensor, which replaces the output of style block eight of the StyleGAN2 image synthesis network, and an appearance code, that is used as input to the remaining style blocks. This proposed extension of traditional GAN embedding, which we call

space, provides more degrees of freedom to capture individual facial details such as moles. However, it also requires a careful design of latent code manipulations, because it is easier to create artifacts.

Our approach includes the following major steps:

  • Reference images are segmented and a target segmentation is generated automatically, or optionally the target segmentation is manually edited.

  • Individual reference images are aligned to the target segmentation and latent codes are found for the aligned images.

  • A combined structure tensor is formed by copying region of for each .

  • Blending weights for the appearance codes are found so that the appearance code

    is a mixture of the appearances of the aligned images. The mixture weights are found using a novel masked-appearance loss function.

3.2. Initial Segmentation

The first step is to select reference images, (automatically) segment them, and to select regions in the reference images that should be copied to the target image. Let indicate the segmentation of reference image , where Segment is a segmentation network such as BiSeNET (Yu2018). The aim is to form a composite image consistent with a target segmentation mask so that at locations in the image where , the visual properties of will be transferred from reference images . The target mask is created automatically, however one can also edit the segmentation mask manually to achieve more control over the shapes of each semantic region of the output. In this exposition we will assume that masks are automatically created. Then each pixel target mask is set to a value that satisfies the condition that . If multiple choices of would satisfy the condition, then the larger is used. This would happen, for example, if the pixel is covered by skin (label 1) in a reference image corresponding to the label skin, but also covered by hair (label 13) in a reference image corresponding to hair, and so the label for hair would be chosen. If no choice of

satisfies the condition, then a portion of the target mask will be in-painted using a heuristic method. The process of automatically creating a mask is illustrated in Fig. 

4.

Figure 4. Generating the target mask. In this example, 19 semantic regions are relabeled to form four semantic categories including background. The label used in the target mask is the largest index such that .

3.3. Embedding:

Before blending images, we first align each image to the target mask . This is important because the appearance of many features such as hair, nose, eyes, and ears depend on the pose of the head as a whole, which introduces a dependency between them. Our approach to aligning the reference images has two parts:

  1. Reconstruction: A latent code is found to reconstruct the input image .

  2. Alignment: A nearby latent code is found that minimizes the cross-entropy between the generated image and the target mask .

3.3.1. Reconstruction

Given an image we aim to find a code so that reconstructs the image , where is the StyleGAN2 image synthesis network. Our approach to finding a reconstruction code is to initialize it using II2S (zhu2020improved), which finds a latent code in the latent-space of StyleGAN2. The challenge of any reconstruction algorithm is to find a meaningful trade-off between reconstruction quality and suitability for editing or image compositing. The latent space of StyleGAN2 has only 512 components, and is not expressive enough to include specific facial details such as moles, wrinkles, or eyelashes. While the latent space is expressive enough to capture generic details, such as wrinkles, it is not possible to encode specific wrinkles in specific locations determined by a reference image. The use of space instead of space improves the expressiveness of the latent space, but it is still not expressive enough to capture specific facial details. One possible approach is noise embedding that leads to embedded images with almost perfect reconstruction, but leads to strong overfitting which manifests itself in image artifacts in downstream editing and compositing tasks. Our idea is to embed into a new latent space, called space, that provides better control than space without the problems of noise embedding. Similarly to embedding, we need to carefully design our compositing operation so that image artifacts do not manifest themselves. The difference between reconstruction in vs space is shown in Fig. 5, illustrating that key identifying features of a person (such as a facial mole) or important characteristics of a subject’s expression (hairstyle, furrows in the brow) are captured in the new latent space.

Figure 5. Reconstruction results on different spaces; (top row) in space, structure of the subject’s curly hair on the left of the image is lost, and a wisp of hair on her forehead as well as her necklace is removed, but they are preserved in space; (middle row) the hair and brow furrows details are important to the expression of the subject, they are not preserved in space but they are in space; (bottom row) the ground-truth image has freckles, without noise optimization this is not captured in space but it is preserved in space.

We capture specific facial details by using a spatially correlated signal as part of our latent code. We use the output of one of the style-blocks of the generator as a spatially-correlated structure-tensor , which replaces the corresponding blocks of the latent. The choice of a particular style block is a design decision, however each choice results in a different-sized latent code and in order to keep the exposition concise our discussion will use style-block eight.

The resulting latent code has more capacity than the latent-codes, and we use gradient descent initialized by a -code in order to reconstruct each reference image. We form an initial structure tensor , and the remaining 10 blocks of are used to initialize the appearance code . Then we set to the nearest local minimum of

(1)
where
(2)

The term in the loss function (2) encourages solutions in which remains similar to the activations of a code so that the result remains close to the valid region of the StyleGAN2 latent space.

3.3.2. Alignment

We now have each reference image encoded as a latent code consisting of a tensor and appearance code . While captures the appearance of the reference image , the details will not be aligned to the target segmentation. Therefore, we find latent codes that match the target segmentation, and which are nearby . However, directly optimizing is challenging because the details of are spatially correlated. Instead we first search for a latent code, for the aligned image and then we transfer details from into where it is safe to do so.

We build on the idea that we can retrieve an image from GAN latent space that conforms to a segmentation mask using the cross-entropy loss of the given segmentation mask and a segmentation mask derived from the GAN output using a pre-trained segmentation network. However, here we deal with a specialized version of this problem. We would like to retrieve a latent representation that conforms to a segmentation mask and that is similar to a given reference image in a given region. Simply initializing with the input representation and then optimizing for a segmentation loss does not work. The image would not be similar enough to the input image. We therefore experimented with a combination of , , and style losses to preserve the content of the reference images and found that only using the style loss produces the best results.

In order to preserve the style between an aligned image and the original image , we use a masked style-loss. The masked loss described in LOHO (saha2021loho) uses a static mask in order compute the gram matrix of feature activiations only within a specific region, whereas each step of gradient descent in our method produces a new latent code, and leads to a new generated image and segmentation. Therefore the mask used at each step is dynamic. Following (saha2021loho), we base the loss on the gram matrix

where is a matrix formed by the activations of layer of the VGG network. In addition, we define a mask

where is the indicator function, so is an indicator for the region of an image that is of semantic category . Then the style loss is the magnitude of the difference between the gram matrices of the images generated by a latent code and the target image , and it is evaluated only within semantic region of each image

(3)

where the summation is over layers , , , and of VGG-16, as was done in LOHO (saha2021loho). The formulation describes the masking of an image by setting all pixels outside the semantic region to 0.

In order to find an aligned latent code, we use gradient descent to minimize a loss function which combines the cross-entropy of the segmented image, and the style loss

where XEnt is the multiclass cross-entropy function. We rely on early-stopping to keep the latent code nearby the initial reconstruction code and is set to the value recommended by (saha2021loho).

In order transfer the structure and appearance from image into , we use binary masks to define safe regions to copy details,

(4)
(5)

where is the indicator function. Let denote resized using bicubic-resampling to match the dimensions of the activations layer . The mask is a region where it is safe to copy structure from the code because the semantic classes of the target and reference image are the same. The mask is a region where we must fall-back to , which has less capacity to reconstruct detailed features. We use the structure-tensor

(6)

where is output of style-block eight of the generator applied to input . We now have an aligned latent representation for each reference image . Next we can composite the final image by blending the structure tensors and appearance codes as described in the next two subsections.

3.4. Structure Blending:

In order to create a blended image, we combine the structure tensor elements of using weights to mix the structure tensors, so

(7)

The coarse structure of each reference image can be composited simply by combining the regions of each structure tensor, however mixing the appearance codes requires more care.

3.5. Appearance Blending

Our approach to image blending is to find a single style code , which is a mixture of the different reference codes . To find we optimize a masked version of the LPIPS distance function as a loss. Following (Zhang2018), the distance between an image and is

(8)

where is the activations of layer of convnet (VGG) applied to a generated image, and normalized across the channel-dimension, and

are the shape of a tensor, the vector

has per-channel weights, and the operator indicates elementwise multiplication.

A masked version of the loss uses the masks to blend the contributions.

(9)

where is a mask which has been resampled to to match the dimensions of each layer.

Given the different reference codes , we aim to find a set of different blending weights . The weights satisfy the constraint that and . The blended code satisfies

(10)

so that each element of is a convex combination of the aligned reference codes .

We find using projected gradient descent (landweber1951iteration). We initialize the so that the blended image would be a copy of one of the reference images, and solve for values that minimize subject to the constraints that and .

3.6. Mixing Shape, Structure, And Appearance

We have presented an approach to create composite images using a set of reference images in which we transfer the shape of a region, the structure tensor information , and also the appearance information . The LOHO (saha2021loho) approach demonstrated that different reference images can be used for each attribute (shape, structure, and appearance) and our approach is capable of doing the same. We simply use an additional set images for the appearance information, and we set using the last 10 blocks of the code that reconstructs instead of using the latent code that reconstructs . The additional images do not need to be aligned to the target mask. We show example of mixing shape, structure, and appearance in Fig. 1(g,h). The larger structures of the hair (locks of hair, curls) are transferred from the structure reference, and the hair color and micro textures are transferred from the appearance image.

Figure 6. Hair style gallery showing different hairstyles applied to a person by varying the hair structure and appearance. Reference images for the hair appearance are shown at the top of each column, Reference images for the hair structure and the target segmentation masks are shown to the left of each row. Also note that in the last two rows, the hair shape is different from the hair shape of the structure reference images.
Figure 7. Face swapping results achieved by our method. Each example shows three smaller insets on the left: a reference image (top left) from where the components of the face are transferred, an identity image (middle left) and the target segmentation mask (bottom left). We also vary the appearance of the facial components by changing the appearance reference image (bottom right); first row: examples of eye and eyebrow transfer by varying the appearance reference images; second row: examples of eye, eye brows, nose, mouth and teeth transfer by varying the appearance reference images and keeping the complexion the same as the identity image; third row: examples of eye, eye brows, nose, mouth, teeth and complexion transfer by varying the appearance reference images.

4. Results

In this section, we will show a quantitative and qualitative evaluation of our method. We implemented our algorithm using PyTorch and a single NVIDIA TITAN Xp graphics card. The process of finding an II2S embedding takes 2 minutes per image on average, the optimization in (

1) takes 1 minute per image. The resulting codes are saved and reused when creating composite images. For each composite image, we solve equation (3.3.2) and then (9) to generate a composite image in an average time of two minutes.

4.1. Dataset

We use a set of 120 high resolution () images from (zhu2020improved). From these images, 198 pairs of images were selected for the hairstyle transfer experiments based on the variety of appearances and hair shape. Images are segmented and the target segmentation masks are generated automatically.

4.2. Competing methods

We evaluate our method by comparing the following three algorithms: MichiGAN (Tan_2020), LOHO (saha2021loho), and our proposed method.

The authors of LOHO and MichiGAN provide public implementations, which we used in our comparison. However, MichiGAN uses a proprietary inpainting module that the authors could not share. The authors supported our comparison by providing some inpainting results for selected images on request. LOHO also uses a pretrained inpainting network. Based on our analysis, both methods can be improved by using different inpainting networks as proposed in the initial papers. We therefore replaced both inpainting networks by the current state of the art CoModGAN (zhao2021comodgan)

trained on the same dataset as LOHO. All hyperparameters and configuration options were kept at their default values.

Our approach was used to reconstruct images using a fixed number of gradient descent iterations for each step. To solve for in equation (1) we used 400 iterations, to solve for using (3.3.2) we stopped after 100 iterations, and to solve for the blending weights using (9) we stopped after 600 iterations. Source code for our method will be made public after an eventual publication of the paper.

4.3. Comparison

4.3.1. User Study

We conducted a user study using Amazon’s Mechanical Turk to evaluate the hairstyle transfer task. For this task we use the 19-category segmentation from CelebAMask-HQ. A hairstyle image was used as the reference for the the corresponding category in CelebAMask-HQ, and an Identity image was used for all other semantic categories. We generated composite images using our complete approach and compared the results to LOHO (saha2021loho) and to MichiGAN (Tan_2020). Users were presented with each image in a random order (ours on the left and the other method on the right, or with ours on the right and the other method on the left). The reference images were also shown at 10% the size of the synthesized images. The user interface allowed participants to zoom in and inspect details of the image, and our instructions encouraged them to do so. Each user was asked to indicate which image combined the face of one image and the hair of another with the highest quality, and fewest artifacts. On average, users spent 90 seconds comparing images before making a selection. We asked 396 participants to compare ours to LOHO, and our approach was selected 378 times (95%) and LOHO was selected 18 times (5%). We asked another 396 participants to compare against MichiGAN, and the results were 381 (96%) ours vs 14 (4%) MichiGAN. The results in both case are statistically significant.

Figure 8. Comparison of our framework with two state of the art methods: LOHO and MichiGAN. Our results show improved transitions between hair and other regions, fewer disocclusion artifacts, and a better consistent handling of global aspects such as lighting.

4.3.2. Reconstruction Quality

In this work, we measure the reconstruction quality of an embedding using various established metrics: RMSE, PSNR, SSIM, VGG perceptual similarity (simonyan2014very), LPIPS perceptual similarity, and the FID (heusel2017gans) score between the input and embedded images. The results are shown in Table 1.

RMSE PSNR SSIM VGG LPIPS FID
Baseline 0.07 23.53 0.83 0.76 0.20 43.99
LOHO 0.10 22.28 0.83 0.71 0.18 56.31
MichiGAN 0.06 26.51 0.88 0.48 0.12 26.82
Ours 0.03 29.91 0.90 0.38 0.06 21.21
Table 1. A comparison of our method to different algorithms using established metrics. Our method achieves the best scores in all metrics.

4.4. Ablation Study

We present a qualitative ablation study of the proposed approach for hairstyle transfer. Fig. 9 provides a visual comparison of the results of hairstyle transfer. A baseline version of our approach does not include the latent space and does not do image alignment and is shown in Fig. 9

(left column). It does solve for interpolated blending weights to minimize the masked loss function from equation (

9), however a mixture of unaligned latent codes does not always result in a plausible image. This is apparent when you compare the face reference image to the synthesized images, which do not faithfully capture the identity of the original subject and in some cases fail to even reconstruct facial features such as eyes when they are partially occluded. The second column of Fig. 9 includes alignment, but it does not use space. Without the additional capacity, the reconstructed images are biased towards a generic face image, with more symmetry and less expression, character, and identifying details than the reference images. The subject in row one has an asymmetric expression which is captured by the structure tensor in space but it is nearly lost using only the latent code. Details including the complexion of subject two, and his waxed mustache, are lost without the embedding. Overall the qualitative examples show that each successive modification to the proposed approach resulted in higher quality composite images.

Figure 9. A qualitative ablation study. We compare a baseline version that blends latent codes without image alignment (left column), a version that used alignment but uses rather than latent codes (center column), and our complete approach (right column). The reference images for the face, hairstyle, and the target mask are shown top-to-bottom on the left of each row. Each modification improves the fidelity of the composite image.

4.5. Qualitative Results

In this subsection, we discuss various qualitative results that can be achieved using our method.

In Fig. 6 we demonstrate that our framework can generate a large variety of edits. Starting from an initial photograph, a user can manipulate a semantic segmentation mask manually to change semantic regions, copy segmented regions from reference images, copy structure information for semantic regions from reference images, and copy appearance information from reference images. In the figure, we show many results where the shape of the hair, the structure of the hair, and the appearance of the hair is copied from three difference reference images. Together with the source image, that means that information from up to four images contributes to one final blended result image.

In Fig. 7 we demonstrate that our framework can handle edits to other semantic regions different from the hair. We show how individual facial features such as eyes and eyebrows can be transferred from other reference images, how all facial regions can be copied, and how all facial regions as well as the appearance can be transferred from other source images. We can also attain high quality results for such edits. We would like to remark that these edits are generally easier to perform than hair transfer.

In Fig. 8 we show selected examples to illustrate why our method is strongly preferred compared to the state of the art by users in the user study. While previous results gives good results to this very challenging problem, we can still achieve significant improvements in multiple aspects. First, one can carefully investigate the transition regions between hair and either the background and the face to see that previous work often creates hard transitions, too similar to copy and pasting regions directly. Our method is able to better make use of the knowledge encoded in GAN latent space to find semantic transitions between images. Second, other methods can easily create artifacts, due to misalignment in reference images. This manifests itself for example in features, e.g. hair structure, being cut off unnaturally at the hair boundary. Third, our method achieves a better overall integration of global aspects such as lighting. The mismatch in lighting also contributes to lower quality transitions between hair regions and other regions in other methods. By contrast, other methods also have some advantages over our method. Previous work is better in preserving some background pixels by design. However, this inherently lowers the quality of the transition regions. We only focus on hair editing for the comparison, because it seems to be by far the most challenging task. This is due to the possible disocclusion of background and face regions, the more challenging semantic blending of boundaries, and the consistency with global aspects such as lighting. Overall, we believe that we propose a significant improvement to the state of the art, as supported by our user study. We also submit all images used in the user study as supplementary materials to enable reviewers to inspect the quality of our results.

4.6. Limitations

Our method also has multiple limitations. Even though we increased the capacity of the latent space, it is difficult to reconstruct underrepresented features from the latent space such as jewelry indicated in Fig.10(2,4). Second, issues such as occlusion can produce confusing results. For example, thin wisps of hair which also partially reveal the underlying face are difficult to capture in Fig. 10(3,5). Many details such as the hair structure in Fig. 10(7) are difficult to preserve when aligning embeddings, and when the reference and target segmentation masks do not overlap perfectly the method may fall-back to a smoother structure. Finally, while our method is tolerant of some errors in the segmentation mask input, large geometric distortions cannot be compensated. In Fig. 10(2,7) we show two such examples.

These limitations could be addressed in future work by filtering-out unmatched segmentation as was done by LOHO (saha2021loho), or by geometrically aligning the segmentation masks before attempting to transfer the hair shape using regularization to keep the segmentation masks plausible and avoid issues such as Fig. 10(1,7). The details of the structure tensor could be warped to match the target segmentation to avoid issues such as Fig. 10(6). Issues of thin or transparent occlusions are more challenging and may require more capacity or less regularization when finding embeddings.

Figure 10. Failure modes of our approach; (1) misaligned segmentation masks lead to implausible images; (2, 4) the GAN fails to reconstruct the face, replacnig lips with teeth or removing jewelry ; (3,5) overlapping translucent or thin wisps of hair and face pose a challenge; (6) a region of the target mask that is not covered by in the hair image is synthesized with a different structure; (7) combining images taken from different perspectives can produce anatomically unlikely results, the original shape of the head is indicated in yellow.

5. Conclusions

We introduced Barbershop, a novel framework for GAN-based image editing. A user of our framework can interact with images by manipulating segmentation masks and copying content from different reference images. We presented several important novel components. First, we proposed a new latent space that combines the commonly used style code with a structure tensor. The use of the structure tensor makes the latent code more spatially aware and enables us to preserve more facial details during editing. Second, we proposed a new GAN-embedding algorithm for aligned embedding. Similar to previous work, the algorithm can embed an image to be similar to an input image. In addition, the image can be slightly modified to conform to a new segmentation mask. Third, we propose a novel image compositing algorithm that can blend multiple images encoded in our new latent space to yield a high quality result. Our results show significant improvements over the current state of the art. In a user study, our results are preferred over 95 percent of the time.

References