FaceShop: Deep Sketch-based Face Image Editing

04/24/2018 ∙ by Tiziano Portenier, et al. ∙ University of Maryland Universität Bern 0

We present a novel system for sketch-based face image editing, enabling users to edit images in an intuitive manner by sketching a few strokes on a region of interest. Our interface features tools to express a desired image manipulation by providing both geometry and color constraints as user-drawn strokes. As an alternative to the direct user input, our proposed system naturally supports a copy-paste mode, which allows users to edit a given image region by using parts of another exemplar image without the need of hand-drawn sketching at all. The proposed interface runs in real-time and facilitates an interactive and iterative workflow to quickly express the intended edits. Our system is based on a novel sketch domain and a convolutional neural network trained end-to-end to automatically learn to render image regions corresponding to the input strokes. To achieve high quality and semantically consistent results we train our neural network on two simultaneous tasks, namely image completion and image translation. To the best of our knowledge, we are the first to combine these two tasks in a unified framework for interactive image editing. Our results show that the proposed sketch domain, network architecture, and training procedure generalize well to real user input and enable high quality synthesis results without additional post-processing. Moreover, the proposed fully-convolutional model does not rely on any multi-scale or cascaded network architecture to synthesize high-resolution images and can be trained end-to-end to produce images of arbitrary size.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

page 7

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Interactive image editing is an important field in both computer graphics and computer vision. Social media platforms register rapidly growing contributions of imagery content, which is creating an increasing demand for image editing applications that enable untrained users to enhance and manipulate photos. For example, Instagram reports more than 40 billion uploaded photos, and the number of photos is growing by 95 million per day 

[Instagram, 2018]. However, there is a lack of tools that feature more complex editing operations for inexperienced users, such as changing the facial expression in an image. In this work, we propose a sketch-based editing framework that enables a user to edit a given image by sketching a few strokes on top of it. We show three examples produced by our system in Figure 1 and more results in Section 4.

Image editing techniques can be divided into global and local editing operations. Recently, numerous deep learning techniques have been proposed for global editing applications such as general image enhancing 

[Yan et al., 2016; Gharbi et al., 2017], grayscale to color transformation [Iizuka et al., 2016; Zhang et al., 2017], and geometric, illumination, and style transformations [Selim et al., 2016; Gatys et al., 2015; Kemelmacher-Shlizerman, 2016; Liao et al., 2017]

. Many of these techniques for global image editing build on image-to-image translation networks that learn to translate images from a source domain into a target domain 

[Isola et al., 2017; Sangkloy et al., 2017; Zhu et al., 2017; Chen and Koltun, 2017; Wang et al., 2017]. The resulting synthesis quality of such translation networks has reached astonishing levels, but they lack generalizability to imperfect input and do not naturally support more local editing operations. While our work is technically related to these approaches, we focus more on local image editing operations and propose a source domain that generalizes well to hand-drawn user input.

Local image editing techniques manipulate only a limited spatial region of an input image, for example in the case of adding or removing objects in a scene, or changing their pose or shape. Techniques that are particularly successful in this direction solve an image completion problem to synthesize a missing region in an image, given the context as input. Recently, successful deep learning methods have been proposed to solve the image completion problem [Pathak et al., 2016; Li et al., 2017; Dolhansky and Ferrer, 2017; Yeh et al., 2017; Iizuka et al., 2017]. A major issue with these approaches is that the synthesized contents are completely determined by the image context, and the user has no means to control the result. In this work, we also formulate local image editing operations as an image completion problem. However, we combine this approach with the aforementioned image translation approaches to enable fine-grained user control of the synthesis process.

An intuitive interface is crucial to enable users to express their intended edits efficiently, and similarly, interactivity is important to support an iterative workflow. Sketching has proven to be attractive to quickly produce a visual representation of a desired geometric concept, for example in image retrieval applications. Based on this observation, we propose a sketch-based interface for local image editing that allows users to constrain the shape and optionally the color of the synthesized result.

At the core of our framework is a novel formulation of conditional image completion: Given a spatial image region to be manipulated, the task is to replace this region by using the sketch-based user input and the spatial context surrounding the region. To solve this task we propose a novel sketch domain and training procedure for convolutional neural networks (CNNs). More precisely, we train CNNs on two task jointly: image completion and image translation. In addition, our sketch domain naturally supports a smart copy-paste mode that allows a user to copy content from a source image and to blend it onto a target image. We show that this is very robust to illumination, texture, and even pose inconsistencies in the input. Our proposed networks can be trained end-to-end on arbitrary image datasets without any supervision or other prior knowledge about the data distribution. Moreover, our architecture results in high-quality images of high resolution without the need of ad hoc multi-scale approaches or complicated hyper-parameter schedules.

We evaluate our approach on the task of face image editing and compare our system to existing image completion techniques. We show that our unified approach is beneficial because it enables fine-grained user control, and training is fully unsupervised and robust to illumination, texture, and pose inconsistencies. In summary, we make the following contributions:

  • the first end-to-end trained system that enables contextually coherent, high-quality, and high-resolution local image editing by combining image completion and image translation,

  • a network architecture and training procedure that results in a very stable training behavior,

  • a technique to generate training sketches from images that enjoys a high generalizability to real user input and enables sketch-based editing with color constraints and smart copy-paste in a unified framework,

  • globally consistent, seamless local edits without any further post-processing, and

  • an intuitive sketch-based, interactive interface that naturally enables an efficient and iterative workflow.

2. Related Work

Interactive image editing approaches have a long history in computer graphics, and providing a comprehensive survey exceeds the scope of this work. In this section, we discuss existing techniques that are highly related to our work. We group related works in two main categories: image completion and image translation. Image completion (also known as image inpainting) is the task of completing missing regions in a target image given the context. We tackle the problem of sketch-based local image editing by solving a conditional variant of the image completion problem. On the other hand, image translation considers the problem of transforming entire images from a source to a target domain, such as transforming a photo into an artistic painting, or transforming a semantic label map into a photo-realistic image. In our work, we consider transforming a sketch-based user input into a photographic image patch that replaces the region to be edited in the target image.

2.1. Image Completion

In their groundbreaking work, Bertalmio et al. [2000] were inspired by techniques used by professional image restorators, and their procedure attempts to inpaint missing regions by continuing isophote lines. Shen and Chan [2002] built on this work and connected it to variational models. Patch-based procedures are a popular alternative: given a target image with a missing region to be filled-in, the idea is to use image patches occurring in the context surrounding the missing region, or in a dedicated source image, to replace the missing region. Pioneering work by Efros and Leung [2001] leverages this strategy for texture synthesis, and various researchers later extended it for image completion [Drori et al., 2003; Bertalmio et al., 2003; Criminisi et al., 2004]. PatchMatch [Barnes et al., 2009] accelerates the patch search algorithm to enable image completion at interactive frame rates, and Image Melding [Darabi et al., 2012] further improves the quality of the patch search algorithm. A major issue with patch-based approaches is that they work by copying and pasting pixel values from existing examples. Therefore, they rely on suitable example patches in the context and they cannot invent new objects or textures.

Figure 2. System overview. Left: a web-based interface allows users to specify a masking region, an edge sketch, and color strokes. Right: the core of our system is a conditional image completion network that takes the user input (mask, sketch, and color strokes), the original image, and a noise layer, and reconstructs a full image. The reconstructed image is blended with the original by filling the masked region in the original with the reconstructed output, and returned.

Recently, methods based on convolutional neural networks (CNNs) have been proposed to overcome this issue. Training neural networks for image completion enables systems that can “invent” new content for missing regions, based on the data that they have seen during training. Using a pixel-wise reconstruction loss for training such networks leads to contextually consistent, but blurry results, because the problem is highly ambiguous. Context Encoders [Pathak et al., 2016] mitigate this problem by training Generative Adversarial Networks (GANs) [Goodfellow et al., 2014]. GANs consist of an auxiliary discriminator network, which is trained to distinguish synthesized from ground truth data, and a synthesis network, which is trained to fool the discriminator by generating high-quality synthetic data. This approach proved to be very successful for image completion tasks, and several similar systems have been proposed along this line. In particular, Dolhansky et al. [2017] apply the same concept on the task of eye inpainting to replace closed eyes in face images, given an exemplar image of the same identity with open eyes. Yeh et al. [2017] proposed an iterative optimization technique to complete images using adversarial training. Further improvements of synthesis quality have been achieved by using multiple discriminator networks for different scales [Iizuka et al., 2017; Li et al., 2017].

Most image completion approaches discussed above have in common that they condition the synthesis process only on the context of the image to be completed. The user has no further control on the contents that are synthesized and the completion result is largely determined by the training data. The only exception is the eye inpainting technique by Dolhansky et al. [2017]. However, their system can only complete missing eye regions. In our sketch-based image editing system we also leverage a GAN loss to train CNNs for image completion. In contrast to the techniques discussed above, however, we formulate a conditional image completion problem and the user can guide the geometry and color of the synthesized result using sketching. Finally, it must be mentioned that Poisson image editing [Pérez et al., 2003] can also be considered as a technique for conditioned image completion. Thus, we compare it to our sketch-domain approach and show that our approach is more robust to illumination, texture, or shape inconsistencies.

2.2. Image-to-Image Translation

Several techniques have been proposed in the past to translate the semantic and geometric content of a given image into a target domain with different style [Gatys et al., 2015; Selim et al., 2016; Kemelmacher-Shlizerman, 2016]. For example, Gatys et al. [2015]

trained a network that turns a source photos into paintings of famous artists, given an exemplar painting. The synthesis is an optimization procedure constrained to match the style of a given exemplar painting by matching deep feature correlations of the exemplar, and content is preserved by matching deep features of the source image. Recently, image-to-image translation networks have been proposed that translate an image from a source domain into a target domain using a single feedforward pass 

[Isola et al., 2017; Sangkloy et al., 2017; Zhu et al., 2017]. For example, the network by Isola et al. translates images from a source domain, for example edge maps or semantic label maps, into photos. The idea is to train a conditional variant of GANs [Mirza and Osindero, 2014] that encourages the synthesized result to correspond to auxiliary input information. Chen and Koltun [2017] propose a cascaded multi-scale network architecture to improve the visual quality and increase spatial resolution of the translated images. In our work, we also train conditional GANs in an image translation fashion to achieve sketch-based image editing, but we formulate it as a completion problem.

3. Deep Sketch-based Face Image Editing

In this section we introduce our sketch-based image editing system, see Figure 2 for a visualization. The system features an intuitive and interactive user interface to provide sketch and color information for image editing. The user inputs are then fed to a deep CNN that is trained on the task of conditional image completion using a GAN loss. Image completion proceeds at interactive rates, which allows a user to change and adapt the input in an efficient and iterative manner. The core component of our system is a CNN that incorporates both image context and sketch-based user input to synthesize high-quality images, unifying the concept of image completion and image translation.

We discuss our training data and propose a suitable sketch domain in Section 3.1. Using appropriate training data is crucial for our system to produce convincing results with real user input. In Section 3.2 we describe a network architecture that enables the synthesis of high-quality and high-resolution images. Then, in Section 3.3 we discuss a training procedure that allows us to train our model end-to-end with fixed hyper-parameters in a very stable manner. Finally, in Section 3.4 we discuss in detail our sketch-based user interface.

3.1. Training Data

In order to train our model on the task of conditional image completion, we generate training data by removing rectangular masks of random position and size from the training images (Section 3.1.1). To obtain a system that is sensitive to sketch-based user input, we provide additional information to the network for the missing region. The design of an appropriate sketch domain (Section 3.1.2) for this additional information is crucial to achieve convincing results with hand-drawn user input. In addition, our system also includes color data (Section 3.1.3).

3.1.1. Rectangular Masking

We first experimented with axis-aligned rectangular editing regions during training, similar to previous image completion methods [Pathak et al., 2016; Iizuka et al., 2017]. However, we observe that the model overfits to axis-aligned masks and the editing results exhibit distracting visible seams if the region to be edited is not axis-aligned. Hence we rotate rectangular editing regions during training by a random angle to teach the network to handle all possible slopes. In Section 4.1 we show that this extension effectively causes the model to produce seamless editing results for arbitrarily shaped editing regions.

3.1.2. Sketch Domain

On the one hand, it is beneficial to transform training imagery automatically into the sketch domain to quickly produce a large amount of training data. On the other hand, the trained system should generalize to real user input that may deviate significantly from automatically generated data. Recent work in image translation has shown that automatically generated edge maps produce high-quality results when translating from edges to photos [Isola et al., 2017; Sangkloy et al., 2017]. However, these models tend to overfit to the edge map domain seen during training, and it has been shown that the output quality of these models decreases significantly when using hand-drawn sketches instead of edge maps [Isola et al., 2017]. To mitigate this problem we propose an automatic edge map processing procedure that introduces additional ambiguity between the input sketch domain and the ground truth translated result, and we show in Section 4.1 that this approach increases the network’s ability to generalize to hand-drawn input after training.

Figure 3. Sketch domain. We extract edge maps using HED [Xie and Tu, 2015], and fit splines using AutoTrace [Weber, 2018]. We then remove small edges, smooth the control points, and rasterize the curves.

Figure 3 shows an example of our proposed sketch domain transform. We first extract edge maps using the HED edge detector [Xie and Tu, 2015], followed by splines fitting using AutoTrace [Weber, 2018]. After removing small edges with a bounding box area below a certain threshold, we smooth the control points of the splines. This step is crucial, since the smoothed curves may now deviate from the actual image edges. This introduced ambiguity encourages the model not to translate input strokes directly to image edges, but to interpret them as rough guides for actual image edges while staying on the manifold of realistic images. After these preprocessing steps we rasterize the curves. This final step completes the transformation into the proposed sketch domain.

3.1.3. Color Domain

In order to enable a user to constrain the color of the edited result, we provide additional color information for the missing region to the networks during training. An intuitive manner for a user to provide color constraints is to draw a few strokes of constant color and arbitrary thickness on top of the image. For this purpose we propose a technique to automatically transform a training image into a color map that is suitable to generate a random color strokes representation for training, see Figure 4 for an example. Our approach to produce random color strokes has some similarities to the method proposed by Sangkloy et al. [2017]. However, our novel color map transformation introduces further ambiguities to make the system more robust to real user input. Moreover, in the case of face images we also leverage semantic information to further improve the result.

We first downsample the input image to and apply a median filter with kernel size 3. Next, we apply bilateral filtering with and 40 times repeatedly. This results in a color map with largely constant colors in low-frequency regions, which mimes the intended user input. In the case of face images we increase the ambiguity in the color map using semantic information. We first compute label maps for semantic regions using the face parsing network [Li et al., 2017]. Next, we compute median colors for hair, lips, teeth, and eyebrows regions and replace the corresponding regions in the color map with these median colors. This allows a user later to quickly change e.g. lips color with a single, constantly colored stroke.

Figure 4. Color domain: We pre-process the colors of the input using dowsampling and filtering, and generate random color strokes to resemble user input.

During training, we generate a random number of color strokes by randomly sampling thickness and start and end points of the strokes and interpolate between these points with additional random jittering orthogonal to the interpolation line. We color the random strokes using the color map value of the start position. If the color map value at the current stroke point position deviates more than a certain threshold from the initial value, we terminate the current stroke early. This technique results in random color strokes of constant color that may deviate from but correlate with the actual color in the input image.

In our experiments we observed that the model sometimes has difficulties to produce correct iris colors. For example, the network tends to produce two differently colored eyes under some circumstances. This observation is consistent with the findings by Dolhansky and Ferrer [2017] and we propose to mitigate this issue using iris detection. We first detect pupils using the work by Timm and Barth [2011]

. After normalizing the image size given the eye bounding box, we compute median iris colors in a fixed-size circle centered at the pupil position. This approach yields accurate iris colors in most cases and we replace the color map values at the iris positions using this color. During training, we draw a fixed-size circle of 10 pixels radius at the pupil positions, as opposed to the stroke-based approach used for all other color constraints. Our system should produce meaningful results even without additional color constraints, and only incorporate color information to guide the editing if available. Therefore, we provide color information during training only with probability

for each image and the model has to decide on an appropriate color for the rest of the samples.

In addition to the sketch and color constraints we also feed per-pixel noise as additional input to the network, which enables the model to produce highly detailed textures by resolving further ambiguities, given an appropriate network architecture. Figure 5 summarizes the training data that we use as input to the conditional completion network.

Figure 5. The input to the conditional completion network includes the original image (with the masked region removed), edge sketch, color strokes, mask, and a noise layer.

3.1.4. High-resolution Dataset

To create our high-resolution face image dataset, we start with the in-the-wild images from the celebA dataset [Liu et al., 2015]. We first remove all images with an annotated face bounding box smaller than pixels. We roughly align the remaining 21k images by rotating, scaling, and translating them based on the eye positions. Images that are smaller than

pixels are padded with the image average color. Finally, we center crop all images to

pixels. We use 20k images for training and the remaining 1k images for testing.

3.2. Network Architecture

Inspired by the recent work on deep image completion [Iizuka et al., 2017], our conditional completion network relies on an encoder-decoder architecture and two auxiliary discrimination networks. We tried out several architectural choices, using both a smaller cropped celebA dataset and our high resolution dataset. Our final architecture described in detail below, is very stable for training and produces high-quality synthesis results for high-resolution images up to pixels.

Training a neural network with a GAN loss involves the simultaneous training of two networks: the conditional image completion network, also called the generator in GAN terminology, and the auxiliary discriminator

. The discriminator takes as input either real data or fake data produced by the generator network. It is trained to distinguish real from fake data, while the generator is trained to produce data that fools the discriminator. After the training procedure, the discriminator network is not used anymore, it only serves as a loss function for the generator network.

In our image editing context the discriminator tries to distinguish edited photos from genuine, unmodified images. To encourage the conditional completion network not to ignore the additional sketch and color constraints, we provide this information also to the discriminator network as additional source to distinguish real from fake data. Because of the construction of the training data, real data always correlates with the sketch and color constraints. Therefore, training with this conditional GAN loss [Mirza and Osindero, 2014] forces the conditional completion network to output data that correlates with the input.

3.2.1. Conditional Completion Network

Figure 6. Conditional completion network architecture.

Figure 6

shows a visualization of our proposed architecture for the conditional completion network. The input to our network is a tensor of size

: an input RGB image with a random region to be edited removed, a binary sketch image that describes the shape constraints of the intended edit, a potentially empty RGB image that describes the color constraints of the edit, a binary mask that indicates the region to be edited, and a one-dimensional per-pixel noise channel. Figure 5 shows an example training input to the conditional completion network. The output of the network is an RGB image of size , and we replace the image context outside the mask with the input image before feeding it to the loss function. This guarantees that the editing network is not constrained on image regions that are outside the editing region and the system enables local edits without changing other parts of the image.

Figure 7. Discriminator architecture consisting of a local (top) and global (bottom) network.

We build on the encoder-decoder architecture proposed by Iizuka et al. [2017], and add more layers and channels to reach the target image size of

pixels. Our final fully-convolutional high-resolution model downsamples the input image three times using strided convolutions, followed by intermediate layers of dilated convolutions 

[Yu and Koltun, 2016] before the activations are upsampled again to pixels using transposed convolutions. After each upsampling layer we add skip connections to the last previous layer with the same spatial resolution by concatenating the feature channels. These skip connections help to stabilize training on the one hand, and they allow the network to synthesize more realistic textures on the other hand. We observe that without these skip connections the network ignores the additional noise channel input and produces lower quality textures, especially for hair textures.

Similar to [Karras et al., 2017], we implement a variant of local response normalization (LRN) after convolutional layers, defined as

(1)

where is the activation in feature map at spatial position , is the number of feature maps outputted by the convolutional layer, and

. We find this per-layer normalization in the editing network crucial for stable training, however, we also observe that this heavy constraint limits the capacity of the network dramatically and prevents the model from producing detailed textures. Our proposed solution is to apply LRN only after the first 14 layers before upsampling the data, which leads to both stable training and high-quality texture synthesis. We use the leaky ReLU activation function after each layer except for the output layer, which uses a linear activation function. In total, the proposed editing network consists of 23 convolutional layers with up to 512 feature channels.

3.2.2. Discriminator Networks

Figure 7 shows a visualization of our proposed architecture for the discriminator network. Similar to Iizuka et al. [2017], we use a global and a local discriminator. The input to the global network is a tensor: a fake sample consisting of the edited RGB image synthesized by the conditional completion network, sketch and color constraints, and the binary mask indicating the edited region. For real samples we use a genuine, unedited image and a random mask. The local discriminator uses the same input tensor but only looks at a cropped region of size

centered around the region to be edited. The outputs of both discriminator networks are 512-dimensional feature vectors that are concatenated and fed into a single fully-connected layer that outputs a single scalar value. This way the contribution of both discriminators is optimally weighted by learning the weights for the fully-connected layer and there is no need for another hyper-parameter.

Both discriminators are fully-convolutional with alternating convolution and strided convolution layers until the activations are downsampled to a single pixel with 512 feature channels. We use leaky ReLU activation functions everywhere in the discriminators except for the fully-connected output layer that merges the outputs of both discriminators using a linear activation function. We do not apply any form of normalization in the discriminator networks. The global discriminator has 17 convolutional layers and the local discriminator consists of 16 layers.

3.3. Training Procedure

We experimented with several loss functions to achieve stable training behavior and high-quality synthesis results. Previous work has shown that a combination of pixel-wise reconstruction loss and GAN loss results in high-quality image synthesis in both image completion and image translation applications [Pathak et al., 2016; Iizuka et al., 2017; Isola et al., 2017; Wang et al., 2017]. The pixel-wise reconstruction loss stabilizes training and encourages the model to maintain global consistency on lower frequencies. We use reconstruction loss restricted to the area to be edited, that is

(2)

where is the output of the editing network at pixel after restoring the editing context with the ground truth image and is the number of pixels.

For the GAN loss we tried out three different approaches until we found a setting that works reasonably well for our system. We first experimented with the original GAN loss formulation by [Goodfellow et al., 2014], that is

(3)

where is the discriminator and the generator network. However, we find that this GAN loss results in very unstable training and the networks often diverge, even for smaller networks on lower resolution data. Next we evaluated the BEGAN loss function [Berthelot et al., 2017]

by replacing our discriminator networks with autoencoders. While this results in significantly more stable training behavior, the BEGAN setting converges very slowly and tends to produce more blurry synthesis results compared to the original GAN formulation. Moreover, the BEGAN discriminators occupy significantly more memory due to the autoencoder architecture and therefore limit the capacity of our model for high-resolution images due to memory limitations on current GPUs.

We achieve by far the best results on both high and low-resolution data using the WGAN-GP loss function [Gulrajani et al., 2017] defined as

(4)

where is a weighting factor for the gradient penalty and is a data point uniformly sampled along the straight line between and . Similar to Karras et al. [2017] we add an additional term , to keep the discriminator output close to zero. Our overall loss function therefore becomes

(5)

and we set and . We use a learning rate of without any decay schedule and train using the ADAM optimizer [Kingma and Ba, 2014] with standard parameters. With this hyper-parameter setting we are able to train our high-resolution model end-to-end without any parameter adjustments during training, unlike Iizuka et al. [2017]. We attribute this stability to the WGAN-GP loss and our network architecture, mainly the local response normalization and the skip connections in the conditional completion network. Training the full model takes two weeks on a Titan XP GPU with batch size of 4.

3.4. User Interface

Our web-based user interface (Figure 2 left) features several tools to edit an image using sketching. The interface consists of two main canvases: the left canvas shows the original input image and a user can sketch on it to edit. The right canvas shows the edited result and is updated immediately after drawing each stroke. We provide a mask tool to draw an arbitrary shaped region to be edited. A user can indicate the shape of the edited contents by drawing and erasing strokes using the pen tool. A color brush tool allows to draw colored strokes of variable thickness to indicate color constraints. For eye colors the interface provides a dedicated iris color tool that allows to draw colored circles of 10 pixels radius to indicate iris color and position. A forward pass through the conditional completion network takes 0.06 seconds on the GPU, which allows a real-time user experience.

4. Results

We next present ablation studies to demonstrate the benefits of various components of our system (Section 4.1), followed by image editing results and comparisons to related techniques (Sections 4.2, 4.3).

4.1. Ablation Studies

We first demonstrate the effect of our automatically constructed sketch domain (Section 3.1.2). The key observation is that, while both raw HED edge maps [Xie and Tu, 2015] and our sketch domain lead to very similar conditional image completion, our sketch domain generalizes better to free hand sketches provided by users. Figure 8 compares conditional image completion based on automatically constructed (using the input image itself) HED edge maps and our sketch domain, showing that both lead to high quality results. In contrast, Figure 9 illustrates results from free hand user sketches. Synthesized results with HED (middle column) are blurrier and show some artifacts. Because the system is trained on highly accurate HED edge maps, artifacts in the user input translates into artifacts in the output. Our sketch domain leaves some ambiguity between image structures and sketch strokes, which enables the system to avoid artifacts when the user input is inaccurate.

Figure 8. Editing results on training-like data for HED edges and our proposed sketch domain. Both lead to high quality conditional image completion.
Figure 9. Editing results on real user input for a system trained using HED edges (middle) and our proposed sketch domain (right). Using HED edges, the system translates artifacts in the user input into artifacts in the output. Our sketch domain leaves some ambiguity between image structures and sketch edges, allowing the system to suppress inaccuracies in the user input.

Figure 10 illustrates the benefit of using randomly rotated masks during training. Training only with axis aligned masks suffers from overfitting, and artifacts occur with non-axis aligned and arbitrarily shaped masks, as typically provided by users. On the other hand, our approach avoids these issues.

Figure 10. Comparison of training with axis-aligned masks (middle) and randomly rotated masks (right). Randomly rotating masks avoids overfitting and generalizes to arbitrarily shaped masks.

We demonstrate the need of including the mask in the input to the discriminators (see Figure 7) in Figure 11. Without providing the mask to the discriminator subtle artifacts around the mask boundaries occur, while our approach eliminates them.

Figure 11. Subtle artifacts appear if the mask is not provided as an input to the discriminators (middle). Our approach includes the mask in the discriminator input and avoids these issues (right).

Finally, Figure 12 highlights the importance of using skip connections in the conditional generator (Figure 6) and a noise layer in its input. Without these provisions, the conditional completion network produces unrealistic textures (second image from the left). In contrast, with our approach we obtain a high quality output (third and fourth image). We show two results with different noise patterns to emphasize the influence of the noise on the texture details, visualized by the difference image on the right.

Figure 12. Including a noise layer is important to produce realistic textures. Without it, synthesized textures exhibit artifacts (second image from left). We show results with two noise patterns (third and fourth image), and highlight the influence of the noise on texture details using a difference image (right).

4.2. Sketch-based Face Image Editing

Figure 13. Results using sketching with our system. The examples shown here include editing of nose shape, facial expression (open or close mouth), hairstyle, eyes (open or close eyes, gaze direction, glasses), and face shape. Synthesized face regions exhibit plausible shading and texture detail.

Figure 13 shows various results obtained with our system using sketching. They include editing operations to change the nose shape, facial expression (open or close mouth), hairstyle, eyes (open or close eyes, gaze direction, glasses), and face shape. These examples demonstrate that by providing simple sketches, users can obtain predictive, high quality results with our system. Edges in the user sketches intuitively determine the output, while the system is robust to small imperfections in the sketches, largely avoiding any visible artifacts in the output. Synthesized face regions exhibit plausible shading and texture detail. In addition, these results underline the flexibility of our system, which is able to realize a variety of intended user edits without being restricted to specific operations (like opening eyes [Dolhansky and Ferrer, 2017]).

Figure 14. Results using sketching and coloring. The unmodified input images are shown in Figure 13 in rows three and six, leftmost column.

Figure 14 highlights the color editing functionality of our system, where we show examples of changing eye, makeup, and lip color. This shows that the system is robust to imperfect user input consisting of only rough color strokes, largely avoiding artifacts.

4.3. Smart Copy-Paste

Figure 15 shows results of our smart copy-paste functionality, where we copy the masked sketch domain of a source image into a target image. This allows users to edit images in an example-based manner, without the need to draw free hand sketches. In addition, we compare our results to Poisson image editing [Pérez et al., 2003]. The comparison reveals that Poisson image editing often produces artifacts due to mismatches in shading (second, fourth, sixth column) or face geometry (third column) between the source and target images. Our approach is more robust and avoids artifacts, although significant misalignment between source and target still leads to unnatural outputs (third column).

Figure 15. Our system supports example-based editing via copy-and-paste. We show input images (first row), copy-paste input (second row), and results from Poisson image editing (third row) and our approach (last row). Note that Poisson image editing copies gradients, while we copy the sketch domain of the source region. While our results are mostly free of artifacts, Poisson image editing often suffers from inconsistencies between source and target images.

Our copy-paste approach is generic and can be applied to any image region, subsuming specialized approaches, like the “open eyes” functionality in Adobe Photoshop Elements, or the eye inpainting technique by Dohansky et al. [2017]. We compare our approach to these techniques in Figure 16. While Adobe’s technique may lead to color artifacts, and Dohansky et al. [2017] is limited to lower resolution images, our approach leads to crisp, high quality results.

Figure 16. Our generic copy-paste approach can be used for operations such as opening eyes. We compare to two specialized techniques, Adobe’s “open eyes” functionality and eye inpainting [Dolhansky and Ferrer, 2017]. Our approach produces high quality results, avoiding color artifacts and providing crisper results compared to Dohansky and Ferrer [2017].

5. Conclusions

In this paper we presented a novel system for sketch-based image editing based on generative adversarial neural networks (GANs). We introduce a conditional completion network that synthesizes realistic image regions based on user input consisting of masks, simple edge sketches, and rough color strokes. In addition, our approach also supports example-based editing by copying and pasting source regions into target images. We train our system on high resolution imagery based on the celebA face dataset and show a variety of successful and realistic editing results. Key to our method is a careful design of the training data, the architectures of our conditional generator and discriminator networks, and the loss functions. In particular, we describe suitable automatic construction of sketch and color stroke training data. In the future, we will train our system on more diverse and even higher resolution image datasets. We are also exploring additional user interaction tools that may enable even more intuitive editing workflows. Finally, we will explore how to leverage deep learning for sketch-based editing of video data.

References