Deep Single Image Manipulation

07/02/2020 ∙ by Yael Vinker, et al. ∙ 0

Image manipulation has attracted much research over the years due to the popularity and commercial importance of the task. In recent years, deep neural network methods have been proposed for many image manipulation tasks. A major issue with deep methods is the need to train on large amounts of data from the same distribution as the target image, whereas collecting datasets encompassing the entire long-tail of images is impossible. In this paper, we demonstrate that simply training a conditional adversarial generator on the single target image is sufficient for performing complex image manipulations. We find that the key for enabling single image training is extensive augmentation of the input image and provide a novel augmentation method. Our network learns to map between a primitive representation of the image (e.g. edges) to the image itself. At manipulation time, our generator allows for making general image changes by modifying the primitive input representation and mapping it through the network. We extensively evaluate our method and find that it provides remarkable performance.



There are no comments yet.


page 2

page 4

page 5

page 6

page 7

page 8

Code Repositories


Official PyTorch implementation of the paper: “Deep Single Image Manipulation”.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Images capture a scene at a specific point in time. Viewers of images often wish the scene had been different e.g. that objects were arranged differently. Due to the popularity of this task, it has been the focus of much research and also of many companies and products e.g. Instagram and Photoshop. Deep learning methods have significantly boosted performance of image manipulation methods for which large training datasets can be obtained e.g. super-resolution or facial inpainting. User-captured photographs follow a long tailed distribution. Some classes of photographs are very common e.g. faces or cars. On the other hand a large proportion of photographs capture a rare object class or configuration. Training deep learning methods that capture the entire long tail of images can be very hard, particularly for generative models that are slow and tricky to train. Training models on just the target image is emerging as an alternative to training deep models on large image datasets. Although this is counter-intuitive as deep learning methods typically require many training samples, single-image methods have recently demonstrated some promising results.

In this paper, we introduce a novel method for training deep conditional generative models from a single image. The training image is first represented with a primitive representation, which can be unsupervised (an edge map, unsupervised segmentation), supervised (segmentation map, landmarks) or a combination of both. We use a standard adversarial conditional image mapping network to learn to map between the primitive representation and the image. In order to extend the training set (which simply consists of a single image), we perform extensive augmentations. The choice of augmentation method makes a significant difference to the method’s performance. We introduce a novel augmentation method based on thin-plate-spline (TPS) and show that it is key to the success of our method. After training, we are able to perform challenging image manipulation tasks by modifying the primitive representation. Our method is evaluated extensively and displays remarkable results.

Our contributions in this paper:

  1. A general purpose approach for training conditional generators from a single image.

  2. Proposing a TPS-based augmentation for conditional image generation, and demonstrating its importance for single image training.

  3. A novel primitive representation allowing concurrent low and high-level image editing.

  4. Extensive evaluations showing remarkable visual performance, and the introduction of a novel protocol enabling quantitative evaluation.

Figure 1: Results produced by our model. The model was trained on a single training pair (first and the second columns). The third column shows the inputs to the trained model at inference time. First row- (left) lifting the nose, (right) flipping the eyebrows. Second row- (left) adding a wheel, (right) conversion to a sports car. Third row - modifying the shape of the starfish.

2 Related Work

Classical image manipulation methods: Image manipulation has attracted research for decades from the image processing, computational photography and graphics communities. It would not be possible to survey the scope of this corpus of work in this paper. We refer the reader to the book by Szeliski Szeliski (2010) for an extensive survey, and to the Photoshop software for a practical collection of image processing methods. A few notable image manipulation techniques include: Poisson Image Editing Pérez et al. (2003), Seam Carving Avidan and Shamir (2007) and ShiftMap Pritch et al. (2007). Learning a high-resolution parametric function between a primitive image representation and a photo-realistic image was very challenging for pre-deep learning methods.

Deep conditional generative models:Image-to-image translation is the task of mapping an image from a source domain to a target domain, while preserving the semantic and geometric content of the input image. Over the last decade, with the advent of deep neural network models and increasing dataset sizes, significant progress was made in this field. Currently, the most popular methods for training image-to-image translation models use Generative Adversarial Networks (GANs) Goodfellow et al. (2014) and are currently used in two main scenarios: i) unsupervised image translation between domains Zhu et al. (2017a) Kim et al. (2017) Liu et al. (2017) Choi et al. (2018)

ii) serving as a perceptual image loss function

Isola et al. (2017) Wang et al. (2017) Zhu et al. (2017b). Existing state-of-the-art methods for image-to-image translation require a significant number of labeled image pairs.

Single image generators: Limited work has been done on training image generators from a single-image due to the difficulty of the task. Deep Image Prior Ulyanov et al. (2018), retargeting Shocher et al. (2018a) and super-resolution Shocher et al. (2018b) from a single image. Deep image prior is mainly applicable to image restoration tasks rather than novel image manipulations, and also requires training a deep network for every new manipulation. The seminal work of Shaham et al. Shaham et al. (2019), presented a more general approach for single image generative model training, which can be used for unconditional and some conditional image generation tasks. SinGAN is typically more successful on texture images rather than larger objects.

3 Learning Conditional Generators from a Single Image

We propose a conditional generative adversarial network (cGAN) for learning to map from a primitive image representation (e.g. edges, segmentation) to the image. The approach has several objectives: i) single image training ii) fidelity - the output should reflect the primitive image representation iii) appearance - the output image should appear to come from the same distribution as the training image. In this section, we present a novel augmentation method that allows standard cGAN architectures to achieve these objectives. The particular type of augmentations used is critical as they provide a prior over the data. Having a good prior is important in this case as the training set consists of merely a single image.

3.1 Model:

Our model design follows standard practice for cGAN models (particularly pix2pixHD Wang et al. (2017)). Let us denote our training image pair where is the input image and is the corresponding image primitive. We learn a mapping network , which takes in the input image primitive

and outputs the estimated image

. We use the VGG perceptual loss Johnson et al. (2016) which extracts features from the predicted and actual images and computes the difference between them.


Conditional GAN loss: Following standard practice, we add an adversarial loss which measures the ability of a discriminator to differentiate between the (input, generated image) pair and the (input, true image) pair . The discriminator

is implemented using a deep classifier and is trained adversarially to the mapping network. The loss of the discriminator is:


The combined loss of the mapping network is the sum of the reconstruction and adversarial mapping loss, weighted by a constant :


3.2 Augmentations:

Although the formulation above works well for image translation if trained on large datasets, it overfits on a single image. This has the negative consequence of not being able to generalize to a new primitive image input. In order to generalize to new primitive images, the size of the training dataset needs to be artificially increased so as to cover the range of expected primitives (please see Sec. 5

for more analysis). Conditional generative models typically use simple crop and shift augmentations. This simple augmentation strategy however will not generalize to primitive images with non-trivial changes (even simple rotations). Instead, we propose a novel augmentation strategy. We model the image as a grid and shift each grid point by a uniformly distributed random distance. This forms the shifted grid

. We use a thin-plate-spline (TPS) to smooth the transformation into a more realistic warp . The TPS optimization problem is given by:


Where are grid-cell locations, denote the second order partial derivatives of . is a regularization constant determining the smoothness of the warp. This optimization can be performed very efficiently e.g. Donato and Belongie Donato and Belongie (2002). The resulting transformation is then used to transform the original image , for a particular training iteration. A different, randomly sampled TPS warp is used for every training iteration. We additionally use random rotations, to further increase the range of image primitives seen in training.

3.3 Primitive images:

In order to be able to edit the image, we condition our generator on a representation of the image that we call the image primitive. The required properties of the image primitives are being able to precisely specify the required output image and the ease of manipulation by image editor. These two objectives are in conflict, although the most precise representation of the edited image is the edited image itself, this level of manipulation is impossible to achieve by a human editor, in fact simplifying this representation is the very motivation for this work. Two standard image primitives used by previous conditional generators are the edge representation of the image and the semantic instance/segmentation map of the image. Segmentation maps are easier to manipulate and provide information on the high-level properties of the image, but give less guidance on the fine-details. Edge maps provide the opposite trade-off. To achieve the best of both worlds, we propose a novel primitive, which includes both the edge and segmentation maps combined together (we dub this representation - "super primitive"). The advantages of this primitive representation will be shown in Sec. 5

Figure 2: Results on three different image primitives (the leftmost column shows the source image, then each column demonstrate the result of our model when trained on the specified primitive). We manipulated the image primitives, adding a right eye, changing the point of view and shortening the beak. Our results are presented next to each manipulated primitive. Our novel SP primitive performed best on high-level changes (e.g. the eye), and low-level changes (e.g. the background).
Figure 3: Results of our SP on challenging image manipulation tasks. left) the edge-image pair used to train our method. center) switching the positions between the two rightmost cars. right) removing the leftmost car and inpainting the background. In both cases our method was able to synthesize very attractive output images.

3.4 Implementation details:

We implemented the conditional GAN using the pix2pixHD architecture. We kept the same hyperparameters as in the official repository except changing the number of iterations to

. For each iteration, we randomly sample a new TPS warp and transform both the input primitive and output images. We also augment both images with a random rotation. We then proceed to train the network in the usual way.

Figure 4: Edges-to-image results. columns show the edges and images used for training. The third column shows the edges used as input at inference time. We can see that pix2pix cannot generate the correct shoe as it has no style information. BicycleGAN has style guidance but cannot reproduce the correct details. Our approach generates images of high quality and fidelity.
Figure 5: Paint-to-image results. Two leftmost columns show the input paint that was created manually and the input image. The third column shows the modified paint image used as input to the trained models, the result by SinGAN was generated using the authors’ best practice. Our method generates a novel tree corresponding to the segmentation map with high fidelity.

4 Experiments

In this section, we evaluate our method both qualitatively and quantitatively.

Figure 6: Several sample results from the Cityscapes dataset. We train each model on the segmentation-image pair on the left. We then use the models to predict the image, given the segmentation maps (second column from left). Our method is shown to perform very well on this task, generating novel configurations of people not seen in the training image.
Metric pix2pixHD Ours
LPIPS 0.342 0.216
SIFID 0.292 0.127
Table 1: Results for the Cityscapes dataset - we report the average over the 16 videos

4.1 Qualitative evaluation

Example results by our method are displayed in Fig. 1. Our method is able to generate very high resolution images from single image training. In the top row we are able to perform fine changes to the facial images from edge primitives e.g. raising the nose and flipping the eyebrows. In the second row, we show complex shape transformations by using segmentation primitives. Our method was able to add a third wheel to the car and convert its shape into a sports car. This shows the power of the segmentation primitive, enabling major changes to the shape using simple operations. On the bottom row, we show that our method can perform free-form changes, completely changing the shape of an image while preserving fine texture.

In Fig. 4, we present edges-to-image translation results on two different shoes. We can see that pix2pixHD trained on the entire edge2shoes dataset is unable to capture the correct identity of the shoes as there are multiple possibilities for the appearance of the shoe given the edge image. BicycleGAN is able to take as input both the edge map and guidance for the appearance (style) of the required shoe. Although it is able to capture the general colors of the required shoe, it is unable to capture the fine details of the shoes (e.g. shoe laces and buckles). We believe that the loss of information is a general disadvantage of training on large datasets, a general mapping function becomes less specialized and therefore less accurate on individual images.

In Fig. 5, we present results on a paint-to-image task. Our method was trained to map from a rough paint image to an image of a tree, while SinGAN was trained using the authors’ best practice. We can see that SinGAN outputs an image which is more similar to the paint than a photorealistic image. Our method is able to change the shape of the tree to correspond to the paint while keeping the appearance of the tree and background as in the training image.

4.2 Quantitative evaluation

We evaluate our method quantitatively using reference-based and reference-free metrics. As single image generators have mostly operated on unconditional generation, there are not currently established datasets and metrics to conduct such an evaluation.

We propose a novel evaluation procedure for conditional single image evaluation. Our method utilizes video to provide ground truth for computing the metrics. We introduce a new evaluation dataset spanning different scenes and primitives. A single frame from each video is designated for training, where the network is trained to map the primitive image to the designated training frame. The trained network is then used to map from primitive to image for all the other frames in the video. The difference between the generated and the ground-truth frames is computed using LPIPS Zhang et al. (2018), a standard high-quality deep perceptual metric. We use all video segments from the Cityscapes dataset Cordts et al. (2016) provided by the code in vid2vid Wang et al. (2018) and Few-shot-vid2vid Wang et al. (2019). These sequences are labelled aachen-000000 to aachen-000015 leftImg8bit. For each sequence, we train on frame 000000 and test using frames 000001 to 000029. We use the segmentation maps provided as image primitives. We also use the first videos in the public release of the Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset containing videos of different people speaking. We extract their edges using a Canny edge detectorCanny (1986). In total, our evaluation set contains Cityscapes frames and LRS2 frames.

A visual evaluation of our method on a few frames from the Cityscapes dataset can be seen in Fig.  6. We compare our method to the results of pix2pixHD model that was trained on a single image without using the TPS augmentation (it still uses the standard augmentations used by generative models i.e. random cropping and horizontal flips). We can observe that our method is able to synthesize very different scene setups from those seen in training, including different numbers and positions of people. We can see that our method performs significantly better in terms of fidelity and quality than single-image pix2pixHD indicating that our proposed TPS augmentation is critical for single image conditional generation.

Metric Method Seq1 Seq2 Seq3 Seq4 Seq5
LPIPS pix2pixHD 0.44 0.47 0.41 0.53 0.46
Ours 0.12 0.21 0.1 0.22 0.14
SIFID pix2pixHD 0.51 0.49 0.5 0.26 0.44
Ours 0.07 0.12 0.04 0.12 0.06
Table 2: Results of pix2pixHD and our method on LRS2 videos (both trained on a single image)

A quantitative evaluation of the Cityscapes dataset is provided in Tab. 1 and of the LRS2 dataset is provided in Tab. 2. We present LPIPS and also SIFID, a reference-free method that measures the perceptual similarity of the distribution of patches between the source and target images. Results are presented for each of the LRS2 sequences and also for the average of Cityscapes videos (for space reasons). We can observe that our method significantly outperforms single-image pix2pixHD in all comparisons.

Figure 7:

An analysis of the benefits of TPS. We show the kNN distance between patches in the test and train frames with and without TPS augmentations (top-right). We can see that TPS augmentation decreases the kNN distance, in some image regions the decrease is drastic suggesting the patches there can be obtained by deformations of training patches. The kNN-TPS distance appears to be correlated with the regions where the prediction error of our method is large. This analysis suggests that by artificially increasing the diversity of patches, single-image methods can generalize better to novel images.

Figure 8:

Evaluation of the ability of our network to interpolate across empty space regions. The two leftmost columns show the training image pair, we gradually increase the distance between the eyes and nose of the cat, and feed the test images to the network, the corresponding output of each test image is shown in the second row. Our method generates attractive interpolations for moderate changes, the performance deteriorates for larger interpolations.

5 Analysing the Generalization of Our Method

In this section, we analyse the remarkable results of our method.

TPS improves generalization: Let us consider the train and test edge-image pairs presented in Fig. 7

. We input each edge map through an ImageNet-trained ResNet50 network and computed the activations at the end of the third residual block. For each pixel in the activation grid of the test image, we computed the nearest-neighbor (1NN) distance to the most similar activation of the train image. We then performed

TPS augmentations to the training image, and repeated the 1NN computation with the training set now containing the activations of the original training image and its 50 augmentation. Let us compare the 1NN distances presented in Fig. 7 with and without TPS augmentations. Naturally, the 1NN distance decreased for the TPS-augmented training set due to its larger size. More interestingly, we can see that several face regions which prior to the augmentations did not have similar patches in the input, now have much lower distance (while more significant changes might not be possible to describe by TPS). In Fig. 7, we present the results of our method when trained on the training edge-image pair (shown in the leftmost column) and evaluated on the test edge. We can see that the prediction error ( difference between ResNet50 activations of the predicted and the true test image) appears to be strongly related to the 1NN distance with TPS-augmentations. This gives some evidence to the hypothesis that the network recalls input-output pairs seen in training. It also gives an explanation for the effectiveness of TPS training, namely increasing the range of input-output pairs thus generalizing to novel images.

The significance of image primitives: The choice of input primitive images is important for the performance of our method. The standard primitives typically used by conditional image generators are edge and segmentation maps. Segmentation maps capture high-level aspects of the image while edge maps capture the low-level of the image better. To obtain the best of both worlds, we proposed a new primitive which we denote the "super primitive" (SP) representation. SP contains the edge map overlaid on the segmentation map. It provides both low and high-level details of the scene. Fig. 2 evaluates the settings where the target bird is represented using edges, segmentation and SP. We can observe that the edge representation is unable to capture the eye, presumably as it cannot capture its semantic meaning. The segmentation is unable to capture the details in the new background regions creating a smearing effect. SP is able to capture the eye as well as the low-level textures of the background region, showing its strong representational power. In Fig. 3 we present more manipulation results using SP. In the center column, we present image reorganization results, where the positions of the rightmost cars were switched. As the objects were not of the same size, some empty image regions were filled using small changes to the edges. A more extreme result can be seen in the rightmost column, the car on the left was removed. This created a large empty image region. By filling in the missing details using edges, our method was able to successfully complete the background. We believe this results fully illustrates the power of our method and new primitive image representation.

Interpolating empty regions: In Fig. 8 we stress test our method’s ability to handle regions with little guidance. In this example, the nose of the cat was shifted progressively downwards, forcing the network to interpolate the missing space. We observe the network generates attractive synthetic images for moderate empty regions, however, as the empty region gets larger, the network looks for similar regions to fill the newly created void. These regions will often be areas which exhibit low amounts of detail in the primitive representation. In our case we can notice that for larger shifts, the empty space becomes greener until eventually it inpaints a background patch. We conclude that at a certain point, the network fails to learn the spatial relationship among objects in the image (i.e. that the background can not be placed on the cat’s face) and satisfies the given constraint using neighboring information (as was analysed above).

6 Conclusion

We proposed a novel method for training conditional generators from a single training image based on thin-plate-spline augmentations. We demonstrated that our method is able to perform complex image manipulation at high-resolution. Single image methods have significant potential, they preserve image fine-details to a level not typically achieved by previous methods trained on large datasets. One limitation of single-image methods (including ours) is the requirement for training a separate network for every image, which can be expensive over a large dataset. Speeding up training of single-image generators is an important and promising direction for future work.


  • [1] S. Avidan and A. Shamir (2007) Seam carving for content-aware image resizing. In SIGGRAPH, Cited by: §2.
  • [2] J. Canny (1986) A computational approach to edge-detection. Ieee transactions on pattern analysis and machine intelligence. Cited by: §4.2.
  • [3] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: §2.
  • [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3213–3223. Cited by: §4.2.
  • [5] G. Donato and S. Belongie (2002) Approximate thin plate spline mappings. In European conference on computer vision, pp. 21–31. Cited by: §3.2.
  • [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §2.
  • [7] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    In CVPR, Cited by: §2.
  • [8] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §3.1.
  • [9] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. In ICML, Cited by: §2.
  • [10] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In NIPS, Cited by: §2.
  • [11] P. Pérez, M. Gangnet, and A. Blake (2003) Poisson image editing. In SIGGRAPHs, Cited by: §2.
  • [12] Y. Pritch, E. Kav-Venaki, and S. Peleg (2007) Shift-map image editing. In ICCV, Cited by: §2.
  • [13] T. R. Shaham, T. Dekel, and T. Michaeli (2019) Singan: learning a generative model from a single natural image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4570–4580. Cited by: §2.
  • [14] A. Shocher, S. Bagon, P. Isola, and M. Irani (2018) Internal distribution matching for natural image retargeting. CoRR abs/1812.00231. External Links: Link, 1812.00231 Cited by: §2.
  • [15] A. Shocher, N. Cohen, and M. Irani (2018) “Zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3118–3126. Cited by: §2.
  • [16] R. Szeliski (2010) Computer vision: algorithms and applications. Springer Science & Business Media. Cited by: §2.
  • [17] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2018) Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §2.
  • [18] T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro (2019) Few-shot video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §4.2.
  • [19] T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §4.2.
  • [20] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2017) High-resolution image synthesis and semantic manipulation with conditional gans. CoRR abs/1711.11585. External Links: Link, 1711.11585 Cited by: §2, §3.1.
  • [21] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    arXiv preprint arXiv:1801.03924. Cited by: §4.2.
  • [22] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §2.
  • [23] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In Advances in neural information processing systems, pp. 465–476. Cited by: §2.