Smart, Deep Copy-Paste

03/15/2019 ∙ by Tiziano Portenier, et al. ∙ 79

In this work, we propose a novel system for smart copy-paste, enabling the synthesis of high-quality results given a masked source image content and a target image context as input. Our system naturally resolves both shading and geometric inconsistencies between source and target image, resulting in a merged result image that features the content from the pasted source image, seamlessly pasted into the target context. Our framework is based on a novel training image transformation procedure that allows to train a deep convolutional neural network end-to-end to automatically learn a representation that is suitable for copy-pasting. Our training procedure works with any image dataset without additional information such as labels, and we demonstrate the effectiveness of our system on two popular datasets, high-resolution face images and the more complex Cityscapes dataset. Our technique outperforms the current state of the art on face images, and we show promising results on the Cityscapes dataset, demonstrating that our system generalizes to much higher resolution than the training data.

READ FULL TEXT VIEW PDF

Authors

page 3

page 4

page 6

page 7

page 8

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image manipulation techniques consitute a major part in computer graphics and computer vision research. The rapidly growing imagery content on the web, accelerated by social media trends and high-quality image acquiring systems of modern smartphones increase the demand for flexible, high-quality, and easy to use image editing applications. Recently, several tools have been proposed that leverage deep learning techniques to manipulate and enhance images by inexperienced users, for example

[1, 2, 3, 4]. However, most applications target rather specific image editing operations and lack more general purpose image manipulation. On the other hand, image editing applications targeted for experienced users allow to enhance images by directly manipulating individual pixel values. While these systems allow complex editing operations, expertise is required to use such tools efficiently, and reasonably complex manipulations are time consuming even for experienced users.

In this work, we propose an image editing system that enables an user to manipulate images by using imagery content from another source image. We make use of recent advances in deep learning to implement a smart copy-paste system that produces realistic, high-quality results by seamlessly merging a source image patch into a target image. This approach allows to perform complex image manipulations given that there is another exemplar image that contains the desired features. In Section 4 we show several examples where an object is copied from a source image and seamlessly pasted into a target image.

Conceptually, our work can be seen as a form of image completion and it is technically related to approaches that leverage deep learning to solve the image completion problem [5, 6, 7, 8]. While these approaches try to inpaint a missing image region solely based on the image context to synthesize a plausible image patch, in our work the completion task is conditioned not just on the target context, but also on another source image from which the desired content is selected. Such a conditional image completion approach is also addressed by the work of Dolhansky and Ferrer [9]

. However, their system focuses on the very specific task of copy-pasting eyes on face images only. Probably the most related previous work on this task is FaceShop 

[10]. Their system features a copy-paste mode that allows to cut facial features from a source image and paste it seamlessly into a target image. However, unlike their system, our approach is not restricted to face images, as we will show in this work. Moreover, our technique enables copy-pasting of features like textures and low-contrast content, which is not possible with their system.

The core component of our system is a deep convolutional neural network (CNN) that is trained end-to-end to perform copy-pasting. Our framework can be trained on any image dataset without the need of additional information such as labels or paired images. We show that our approach produces high-quality copy-paste results for a wide range of difficult examples where previous methods fail. We evaluate our technique on two completely different image datasets and demonstrate its effectiveness on a wide range of examples. In summary, we make the following contributions:

  • a novel technique to generate training data that is suitable to train a CNN on the task of copy-pasting objects from one image to another, given an arbitrary image dataset without additional information as input,

  • an end-to-end trained system that synthesizes high-resolution, high-quality, and coherent copy-paste results,

  • a framework that outperforms the previous state of the art technique on face images, and

  • we demonstrate its effectiveness beyond faces on a diverse image dataset.

2 Related Work

In this section, we discuss previous work that is most related to our technique. A comprehensive survey of image editing techniques would exceed the scope of this work, therefore we will focus on approaches that are highly related either technically or conceptually. The following paragraphs are divided into two categories: deep image completion techniques and traditional copy-paste approaches. Image completion (sometimes named deep image inpainting) considers the problem of filling a missing region in an input image, either by using additional inputs or solely based on the region context. Recent advances in deep learning inspired many recent techniques to solve this task by leveraging deep neural networks. Traditional copy-paste approaches (also known as image harmonization techniques) try to solve the task of seamlessly blending a region in a source image into a target image. We focus on this task in our work, borrowing techniques from deep image completion approaches to produce photo-realistic copy-paste results.

2.1 Traditional Copy-Paste Techniques

Poisson image editing [11] is a technique that performs copy-paste in image gradients domain. This technique works very well in examples where the shading mismatch between source and target is moderate, but it causes distracting blending artifacts if the discrepancy is too severe. Moreover, it struggles with geometric inconsistencies, since it is not suitable to ignore mismatching features or to hallucinate missing content, as we will show in Section 4. Another traditional technique extends Poisson image editing by finding optimal seams that reduce artifacts using graph-cut optimization [12], before applying Poisson blending to blend the source and target image [13]. While their system mitigates blending artifacts by reducing the shading mismatch, it features the same problems as Poisson image editing in terms of geometric inconsistencies.

2.2 Deep Image Completion

Many recent techniques leverage CNNs to be trained on the task of image completion. Pathak et al. [5] successfully tackle the problem by training a Generative Adversarial Network (GAN) [14]

in combination with a pixel-wise reconstruction loss. GANs are trained by leveraging an auxiliary discriminator network that acts as a loss function. The image completion network tries to fool the discriminator network, which itself tries to discriminate genuine from synthesized images. The technique has further been improved by using two discriminator networks for two different scales 

[7, 6], which leads to greater synthesis quality. One obvious drawback of these techniques is that the user has no control on the completion process, since it is entirely determined by the image context.

Recently, researchers have proposed to leverage additional input information to guide the image completion process. Dolhansky et al. [9] developed a system that is able to inpaint missing eye regions into face images by incorporating another input image that provides exemplar eyes. They demonstrate the effectiveness on the application of opening eyes in a portrait image that features closed eyes. While their results are promising, the application is very limited to eye regions only. Portenier et al. [10] proposed a system that enables to copy-paste facial parts such as nose, mouth, or eyes by leveraging a sketch domain as copy-paste space and training a CNN to translate the sketch to a photo-realistic image. Their system produces exciting results, however, they demonstrate it only on face images. Moreover, their system cannot copy-paste features that are lost by the transformation to the ad hoc sketch domain, e.g., textural features or low-contrast images. In contrast to these techniques, our system is applicable beyond face images and enables copy-pasting of faint features such as textures. Another closely related technique is the deep image harmonization network by Tsai et al. [15]. They follow a similar approach to train a CNN to do copy-paste, however, similar to the work by Yang et al. [16], their technique requires semantic label information to train their network, unlike our approach that works with any image dataset without additional information. Moreover, their system is trained to handle only discrepancies between source and target due to shading, whereas our system can also handle geometric inconsistencies such as background clutter, missing information or pose mismatches. Other techniques relax the copy-paste problem by training the CNN to choose the best matching region from a source image to do the inpainting [17]. While this approach is more flexible, automatically selecting a region in a source image limits the user’s ability to control the result, leading to undesired copy-paste results in practice.

3 Smart, Deep Copy-Paste

Figure 1: System overview. The main component consists of a deep CNN that is trained on the task of copy-pasting. At runtime, the input to our network is a target image, a masked region from a source image, and a binary mask. The system then computes residuals that are added to the source image, before the source and target images are composited using the provided mask to form the final result.

In this section we introduce our smart copy-paste framework, which is implemented using a deep CNN. Figure 1 shows an overview of the proposed system. The main component consists of a CNN that is trained using a combination of conditional GAN loss and reconstruction loss. At runtime, the input to our network consists of a target image, a masked region from a source image and a binary mask. The system then computes residuals that are added to the source image, before the source and target images are composited using the provided mask to form the final result.

In Section 3.1 we discuss our approach to produce suitable training data, given an arbitrary image dataset as input. This procedure is crucial to render realistic results without the need of additional labels. Next, we motivate and describe the network architecture that we use to implement our copy-paste system (Section 3.2). Finally, we propose a training procedure in Section 3.3 that allows the training of our model end-to-end using suitable training data.

3.1 Training Data

The task of copying content from one image and seamlessly paste it into another image is difficult to train. Ideal training data consists of paired source and target images , where one region in semantically fits into another region in . In addition, a binary mask that indicates the respective regions in both images is required. Finally, a ground truth image is needed for each training pair in order to provide a meaningful loss signal during training. While the generation of suitable image pairs could be achieved by manually selecting and labeling appropriate images, the generation of ground truth images to compile a large-scale dataset suitable to train a deep CNN is hardly possible in practice. To overcome these issues, we propose a novel training data generation procedure that works with any image dataset without additional labeling.

The core idea of our approach is to automatically produce image pairs procedurally, using a single image as input. Given a training image , we cut out a random region, called content, transform it using a transformation , and paste it back into . During training, the network tries to recover the original image, i.e., it tries to undo the transformation on the content region. Besides the obvious advantage of requiring only a single image instead of a suitable image pair, this approach also enables training with completely random masks, and it provides a ground truth image for each training sample, namely the original image itself. However, finding a suitable such that is crucial: if the transformation is too simple and close to the identity, the task for the network becomes trivial and it will overfit to the training data and fail on real-world copy-paste examples, where and typically consist of two different images. On the other hand, if is too strong, the task becomes too difficult and the network will degenerate to unconditional inpainting, effectively ignoring the input .

We therefore need to design carefully, ideally simulating real-world copy-paste image pairs. When considering copy-paste examples that we would like to be solved by our system, we observe that several discrepancies may occur between and . There might exist color mismatches between source and target, based on inconsistencies in shading or background colors. Moreover, the content may feature inconsistent geometry, for example the pose of the object to paste can be slightly wrong. In addition, roughly cutting an object from a source image may include unrelated background clutter that should be ignored during copy-paste. These observations inspire our design of the transformation , which consists of both shading and geometric transformations, i.e., . Figure 2 for an example of .

3.1.1 Shading Adjustments

To mimic shading and other color mismatches, we first define a color transformation that randomly changes brightness, contrast, hue, and saturation of image , i.e.,

(1)

where adjusts the saturation by converting from RGB to HSV, scaling the S-channel by factor , and converting the image back to RGB. Similarly, adjusts the hue by adding a bias to the H-channel. adjusts the contrast for each channel independently by scaling the zero-centered pixel values:

(2)

where is the mean pixel value of color channel and . Finally, adjusts the brightness by adding a single bias to all color channels. See Figure 2 for an example of .

Figure 2: Example of our proposed transformation . From left to right: original image , color transformation , shading transformation , geometric transformation , and the complete final transformation .

Global transformations such as can easily be learned to be undone by a CNN by applying appropriate scaling and bias to the signal. In practice, more complicated mismatches between source and target image occur. The content patch may fit locally with the target image in one region, but another region may feature inconsistencies such that global adjustments do not resolve the issue. To simulate such locally varying color mismatches, we propose a more sophisticated procedure to simulate locally varying shading adjustments. This step is crucial, as we will show in Section 4. Given input image , we first compute two independent color transformations and . Next, we compute a random mixing mask that is used to fuse the two images and to form the final image that features locally varying shading adjustments:

(3)

To create , we first sample a single-channel image of resolution with salt-and-pepper noise. Next we use bilinear upsampling to scale the noise image to the resolution of . See Figure 2 for an example of .

3.1.2 Geometric Adjustments

So far we considered how to handle color mismatches. In real-world copy-paste examples, geometric inconsistencies between source and target images occur. To address such geometric mismatches we propose to apply random homography transformations to the training image. For images of planar objects, homographies encode all possible pose transformations. This is no more true for more general, non-planar objects, but we find that if the homography is sufficiently close to the identity, the error is negligible. In addition to pose mismatches, applying homographies enables our system to become robust to other mismatches. Since we apply before we cut and paste a region during training, we introduce two other types of inconsistencies that also occur in practice. First, some content might get lost by applying the homography, and the system is encouraged to learn to invent missing content, based on the pasted content and the context. In addition, image features might appear twice after applying the homogaphy, once in the context region and again in the pasted content. The network needs to learn to ignore such content in order to complete the task. This makes our system robust to both missing content and clutter, as we will show in Section 4.

To apply a random homography, we first add random offsets to the four image corner pixel coordinates . Next, we compute the respective homography by solving the linear system consisting of the point correspondences using least squares. Therefore, our proposed geometric transformation is . See Figure 2 for an example of .

Similar to [10], we use randomly rotated rectangular masks with random size and position. Each training sample consists of a ground truth image , a target image , a source image , and a copy-paste mask .

3.1.3 Datasets

To evaluate the effectiveness of our system, we train it on two different datasets. First, we use the high-resolution face dataset from [10], which allows a qualitative comparison to FaceShop. This dataset consists of 21k face images of resolution. Next, we train our system on the Cityscapes dataset [18] to demonstrate its effectiveness on non-face images. This dataset consists of 5k street view images captured by a camera mounted on top of a car driving through different cities. The images feature a resolution of pixels, and we downsample the images for training to a resolution of pixels.

3.2 Network Architecture

In this section, we motivate and describe the network architecture that enables our system to render high-quality copy-paste results. In FaceShop [10], the authors propose a copy-paste system that leverages a sketch domain as copy-paste space to produce realistic renderings on face images. The main idea is to map a source image to the sketch domain, copy-paste a source region in this domain to a target image, and let a conditional completion network produce the final result. This ad hoc sketch domain comes with an obvious drawback: features like textures or low contrast content are lost by mapping an image to the sketch domain. Hence, the system cannot be applied to copy-paste such image features. Moreover, the proposed sketch domain may not work reasonably well on arbitrary image datasets, and the authors only demonstrate its effectiveness on face images. Our core idea is therefore to learn a copy-paste space that is more suited for the task than the ad hoc sketch domain. A promising idea is to train a separate encoder network that maps the source image to a dedicated copy-paste space, and the source content is then cut out in this copy-paste domain and fed into another copy-paste network that computes the final result. The proposed encoder network can be trained simultaneously to the copy-paste network to learn an optimal representation for the copy-paste task. We experimented with such an explicit copy-paste space, but our experiments showed that we achieve better results by using a single copy-paste network without a dedicated copy-paste space encoder. Our single copy-paste network implicitly learns an internal representation that is better suited to do the copy-paste task, by leveraging both the source content as well as the target context to find a suitable representation that is invariant to the transformations introduced in Section 3.1, and in the same time keeps as much information as possible from the source image. Based on how well the copied content fits with the target context, our network maps the input to a more or less abstract representation. Inspired by [10], we use an encoder-decoder architecture for our copy-paste network. In addition, we leverage an auxiliary discriminator network that acts as a loss function that is learned simultaneously to the copy-paste network. The input to both networks are images of resolution .

Copy-Paste Network

The encoder part of the copy-paste network gradually downsamples the input tensor to a spatial resolution of

pixels, increasing the number of feature maps to 512. Each downsampling operation is implemented using a strided convolution layer with stride 2, followed by a non-strided convolution layer. Each convolutional layer is followed by a leaky ReLU nonlinearity with

, and a local response normalization layer [19], defined as:

(4)

where is the activation of feature map at coordinate , is the number of feature maps, and prevents division by zero.

The intermediate layers of the copy-paste network are implemented using seven dilated convolution layers with stride 1 and increasing dilation rate up to 16 in order to increase the receptive field. Each layer is again followed by a leaky ReLU nonlinearity and a LRN layer.

The decoder part of the copy-paste network is a mirrored version of the encoder part, gradually upsampling the activations back to the input resolution and decreasing the number of feature maps to three RGB channels. Each upsampling step is implemented using a transposed convolution layer, followed by a non-strided convolution layer. In addition, we add a noise addition layer after each transposed convolution layer, before evaluating the nonlinearity. This injection of stochastic information helps the network to fill in missing information, as we will show in Section 4. Inspired by [20], we sample a single-channel image with per-pixel Gaussian noise for each layer, and broadcast it to the number of feature maps in the corresponding layer. The noise image is then added to each feature map, scaled by a learned per-channel variable. We find that this method produces better results than feeding a single noise image as input to the network.

In an U-Net [21] manner, we additionally use skip connections between all corresponding downsampling and upsampling layers by concatenating the respective feature channels, similar to [10]. All convolution layers use kernel size and the transposed convolution layers feature kernel size .

Discriminator Network

We borrow the discriminator architecture from [10]. It consists of a global branch that consumes the entire copy-paste result and a local branch that focuses on the pasted region. Both branches are merged using a single linear dense layer that outputs a scalar value. For a detailed network architecture we refer to [10].

3.3 Training Procedure

Next, we explain the training inputs and outputs and the loss function in detail. Finally, we propose important training details and the choice of hyperparameters that allow stable training.

Figure 3: Example of the proposed training input to the copy-paste network. From left to right: , , , , and .

The training input to our copy-paste network is a tensor, composed of the target context, the transformed source content, and the copy-paste mask, i.e., , where is a random crop of a training image and is a copy-paste mask of random position, size and rotation, as described in Section 3.1. See Figure 3 for an example. Since a lot of information is already present in the input and only has to make the pasted content fit to the context, we let learn only the residuals instead of the final image. We find that this approach produces better results and the network does not ignore the input content even with moderately strong transformations , as opposed to learning the final image directly. In addition, we replace the image context with the genuine input context before feeding the final image to the loss function, i.e., . This encourages to focus on the content only, since we do not want the system to change the context. The input to the discriminator network consists of an RGB image, either a genuine image or a generated image . We use a conditional GAN, i.e., we concatenate the copy-paste input as additional information to the discriminator input, thus the overall input to is .

As training loss function, we use a combination of WGAN-GP [22]

and a pixel-wise reconstruction loss, with an additional regularization term to minimize the norm of the logits 

[19]. The WGAN-GP loss is defined as

(5)

where is an uniformly sampled image along the straight line between and , and we set . As reconstruction loss we use the pixel-wise distance, i.e.,

(6)

where is the number of image pixels. Our final training objective is therefore defined as

(7)

In all our experiments, we set and . We use ADAM optimizer [23] with and , and we use a constant learning rate of for both and . For both datasets, we use batch size of 5 and train the networks for 250k iterations, which takes approximately two weeks on a Titan XP GPU. For the face image dataset, we set . Since the Cityscapes dataset is harder to train, we found that setting leads to better results on this dataset.

4 Results

In this section, we first show ablation studies that demonstrate the effectiveness of different design decisions in our framework (Section 4.1). Next, we show qualitative copy-paste results on the face images dataset with a comparison to the state of the art technique (Section 4.2). In Section 4.3 we finally show results on a dataset that is more complex than face images.

4.1 Ablation Studies

Figure 4: Effect of different values for the offset parameter in . When choosing a value that is too small (second column), the network fails on real-world copy-paste examples and introduces visual artifacts (see zoomed crop). These artifacts do not occur on training data, which is a sign of overfitting. If is too large (third column), the task becomes too difficult and the network partially ignores the source content, rendering a differently shaped nose in this example. Only the right amount of geometric transformation produces decent results that both resemble the source content and do not feature overfitting artifacts (last column).

We first show the effect of different choices for the parameter that controls the strength of the geometric transformation (Section 3.1.2). Choosing the right amount of geometric transformation is crucial for our system to produce high-quality copy-paste results, as demonstrated in Figure 4. If the offset parameter is too small, the network overfits to the training data, noticeable as visual artifacts when feeding source content that comes from another image than the target context. When setting the offset too high, the copy-paste network tends to ignore the source content for difficult examples and degenerates to some sort of unconditional inpainting. Only when setting a reasonable value for the offset parameter, the system synthesizes high-quality results that resemble the input content without introducing visual artifacts. The same argument holds for the strength of the shading transformation . Inappropriate choices for the color transformation parameters either cause the network to become totally invariant to colors, making it impossible to copy-paste color features, or render the network unable to resolve examples with strong shading mismatches between source and target.

Figure 5: Effect of locally varying shading adjustments. Using simple, global shading adjustments (second column) causes the network to fail on examples with locally varying shading mismatches. Our proposed locally varying shading adjustments lead to a copy-paste network that effectively solves such examples.

In Figure 5 we demonstrate the effect of the local shading adjustments proposed in Section 3.1.1. Using simple global shading adjustments, the copy-paste network fails to seamlessly blend source and target image in examples where the shading mismatch between source and target changes locally. When training the network with our proposed locally varying shading adjustments, the network is able to resolve such mismatches.

Figure 6: Effect of noise addition layers. Feeding a single noise image to the input layer (second column) leads to less realistic synthesis of missing features, such as the strand of hair in this example. The use of noise addition layers (last column) produces more realistic results in these cases.

Finally, we highlight the effectiveness of the proposed noise addition layers in the decoder part of the copy-paste network. In Figure 6, we compare the use of noise addition layers to feeding a single noise image to the input layer, as proposed by [10]. Noise addition layers enable the network to produce more realistic features in cases where content needs to be “invented” by the network based on the context. In contrast, feeding a single noise channel to the input layer that needs to be propagated through the entire network makes it more difficult to render realistic features.

4.2 Face Images Dataset

Figure 7: Results on the face images dataset. Our system enables to copy-paste various facial features like eyes, nose, and mouth, but also face accessories such as glasses. Moreover, the framework can also be used to replace the entire face and hairstyle.

In Figure 7 we show various results on face images, produced by our copy-paste network. All examples feature a resolution of and show the final output of our framework, without any post-processing. The examples include copy-pasting of facial features like nose, mouth, eyes, or glasses. Moreover, we show examples of replacing the entire face or hairstyle, using source content patches that are significantly larger than the patches used for training.

Figure 8: Qualitative comparison of our copy-paste framework to FaceShop [10] and Poisson Image Editing [11]. Each row shows a different example. Our framework produces high-quality results, whereas Poisson image editing suffers from both shading and geometric inconsistencies. FaceShop clearly features a higher level of abstraction, the outcome is often less similar to the actual input features compared to our approach.

Next, we compare our copy-paste network to the copy-paste mode in FaceShop [10], a state of the art technique for copy-pasting facial features. In addition, we do a comparison to Poisson image editing [11]. Figure 8 shows our approach compared to the two baselines. In the first example, Poisson image editing does a good job at the right eye, but introduces subtle blending artifacts on the left eye due to the locally varying shading mismatch. FaceShop seems to be quite confused by this example. The strong eye makeup produces clutter edges in the sketch domain, and the network fails to interpret the edges in this example. Our approach manages to transfer both the eye and eyebrow geometry as well as the texture features. The second example shows rather successful results for all approaches, only Poisson image editing has difficulties to resolve the geometric mismatch. Interestingly, FaceShop produces a rather different nose shape than the input, probably due to the highly abstract sketch domain. In the third example, FaceShop actually falls back to unconditional image completion, since the edges of the input nose are so faint that the sketch domain ends up void. The last example shows a complete failure case for Poisson image editing. Interestingly, FaceShop also produces a result that is quite different from the input and somewhat blurry. Maybe this is a sign of overfitting, since FaceShop was not trained on such large masks.

4.3 Cityscapes Dataset

Figure 9: Results on the Cityscapes images dataset. For each example (column), we show a zoomed crop at the top right corner that shows a region that is particularly difficult due to background clutter or shading mismatches.

Figure 9 shows various results on the Cityscapes dataset. Since our copy-paste network is fully convolutional, we can apply it on arbitrary sized images after training. Note that all results are of resolution , even though the network was trained solely on crops. Pasted regions feature an extent of up to 700 pixels (first column), significantly larger than the crops that our network consumed during training, demonstrating the generalization capability to higher resolutions, given appropriate training data. The results demonstrate that our approach also works on a more complex dataset, where the number of different objects and features are much more diverse than in the case of face images. It must be emphasized that the overall synthesis quality is worse compared to the face images dataset. We attribute this to the fact that copy-pasting on this dataset is more difficult than on face images, mainly due to background clutter when providing inaccurate masks. Moreover, the dataset features both heavy motion blur as well as tonemapping artifacts, limiting the ability of the network to synthesize higher-quality results.

5 Conclusions

In this work we propose a novel smart copy-paste framework that enables the synthesis of high-quality copy-paste results. The key ingredient of our system is a deep convolutional neural network trained end-to-end on the task of copy-pasting. Our key contribution is a novel, carefully designed training data generation procedure that works on any image dataset without additional label information. We demonstrate the effectiveness of our system on two high-resolution datasets, outperforming the state of the art in face image copy-pasting on many examples. Moreover, we show the application beyond faces and process images featuring up to two megapixels resolution, copying content as big as 700 pixels into a target image. In future work, we will leverage higher quality image datasets to produce even better results with our approach.

References