Image Shape Manipulation from a Single Augmented Training Sample

09/13/2021 ∙ by Yael Vinker, et al. ∙ Hebrew University of Jerusalem 10

In this paper, we present DeepSIM, a generative model for conditional image manipulation based on a single image. We find that extensive augmentation is key for enabling single image training, and incorporate the use of thin-plate-spline (TPS) as an effective augmentation. Our network learns to map between a primitive representation of the image to the image itself. The choice of a primitive representation has an impact on the ease and expressiveness of the manipulations and can be automatic (e.g. edges), manual (e.g. segmentation) or hybrid such as edges on top of segmentations. At manipulation time, our generator allows for making complex image changes by modifying the primitive input representation and mapping it through the network. Our method is shown to achieve remarkable performance on image manipulation tasks.



There are no comments yet.


page 16

page 18

page 19

page 21

page 22

page 24

page 25

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

*Equal contribution

Deep neural networks have significantly boosted performance on image manipulation tasks for which large training datasets can be obtained, such as, mapping facial landmarks to facial images. In practice, however, there are many settings in which the image to be manipulated is unique, and a training set consisting of many similar input-output samples is unavailable. Moreover, in some cases using a large dataset might even lead to unwelcome outputs that do not preserve the specific characteristics of the desired image. Training generative models on just a single image, is an exciting recent research direction, which may hold the potential to extend the scope of neural-network-based image manipulation methods to unique images. In this paper, we introduce - DeepSIM, a simple-to-implement yet highly effective method for training deep conditional generative models from a single image pair. Our method is capable of solving various image manipulation tasks including: (i) shape warping (Fig. 

2) (ii) object rearrangement (Fig. 5) (iii) object removal (Fig. 5) (iv) object addition (Fig. 2) (v) creation of painted and photorealistic animated clips (Fig. 8 and videos on our project page).

Given a single target image, first, a primitive representation is created for the training image. This can either be unsupervised (i.e. edge map, unsupervised segmentation), supervised (i.e. segmentation map, sketch, drawing), or a combination of both. We use a standard conditional image mapping network to learn to map between the primitive representation and the image. Once training is complete, a user can explicitly design and choose the changes they want to apply to the target image by manipulating the simple primitive (serving as a simpler manipulation domain). The modified primitive is fed to the network, which transforms it into the real image domain with the desired manipulation. This process is illustrated in Fig. 1.

Several papers have explored the topic of what and how much can be learned from a single image. Two recent seminal works SinGAN [30] and InGAN [31] propose to extend this beyond the scope of texture synthesis [6, 18, 23, 43]

. SinGAN tackles the problem of single image manipulation in an unconditional manner allowing unsupervised generation tasks. InGAN, on the other hand, proposes a conditional model for applying various geometric transformations to the image. Our paper extends this body of work by exploring the case of supervised image-to-image translation allowing the modification of specific image details such as the shape or location of image parts. We find that the augmentation strategy is key for making DeepSIM work effectively. Breaking from the standard practice in the image translation community of using a simple crop-and-flip augmentation, we found that using a thin-plate-spline (TPS)

[13] augmentation method is essential for training conditional generative models based on a single image-pair input. The success of TPS is due to its exploration of possible image manipulations, extending the training distribution to include the manipulated input. Our model successfully learns the internal statistics of the target image, allowing both professional and amateur designers to explore their ideas while preserving the semantic and geometric attributes of the target image and producing high fidelity results.

Our contributions in this paper:

  • A general purpose approach for training conditional generators supervised by merely a single image-pair.

  • Recognizing that image augmentation is key for this task, and the remarkable performance of thin-plate-spline (TPS) augmentation which was not previously used for single image manipulation.

  • Achieving outstanding visual performance on a range of image manipulation applications.

Figure 3: Fashion design examples. On the left is the training image pair, in the middle are the manipulated primitives and on the right are the manipulated outputs- left to right: dress length, strapless, wrap around the neck.
Figure 4: Natural looking manipulations. Left: Image primitives, the top is the training primitive while the bottom is the manipulated one. Middle: training images. Right: manipulated outputs - changing the the orientation of the bird’s wing, changing the posture of the squirrel.
Figure 5: Results on challenging manipulations. Top right corners - primitive images. Left - original image used to train our model. Center- switching the positions between the two rightmost cars. Right- removing the leftmost car and inpainting the background. See the SM for many more results.

2 Related Work

Classical image manipulation methods: Image manipulation has attracted research for decades from the image processing, computational photography and graphics communities. It would not be possible to survey the scope of this corpus of work in this paper. We refer the reader to the book by [33] for an extensive survey, and to the Photoshop software for a practical collection of image processing methods. A few notable image manipulation techniques include: Poisson Image Editing [28], Seam Carving [3], PatchMatch [4], ShiftMap [29], and Image Analogies [16]. Spline based methods include: Field Morphing [5] and Image Warping by RDBF [1]

. Learning a high-resolution parametric function between a primitive image representation and a photo-realistic image was very challenging for pre-deep learning methods.

Deep conditional generative models:

Image-to-image translation maps images from a source domain to a target domain, while preserving the semantic and geometric content of the input images. Most image-to-image translation methods use Generative Adversarial Networks (GANs)

[14] that are used in two main scenarios: i) unsupervised image translation between domains [45, 21, 25, 9]

ii) serving as a perceptual image loss function

[17, 39, 27, 46]. Existing methods for image-to-image translation require many labeled image pairs. Several methods [8, 12, 44] are carefully designed for image manipulation, however they require large datasets which are mainly available for faces or interiors and cannot be applied to the long-tail of images.

Non-standard augmentations: Conditional generation models typically use crop and flip augmentations. Classification models also use chromatic and noise augmentation. Recently, methods have been devised for learning augmentation for classification tasks e.g. AutoAugment [11]. [26] learned warping fields for augmenting classification networks. Thin-plate-spline transformation have been used in the medical domain e.g. [34], but they are used for training on large datasets rather than a single sample. [42] learned augmentations for training segmentation networks from a single annotated 3D medical scan (using a technique similar to [20]) however they require a large unlabeled dataset of similar scans which is not available in our setting. TPS has also been used as a way of parametrizing warps for learning dense correspondences between images e.g. [15] and [22].

Learning from a single image: Although most deep learning works use large datasets, seminal works showed that single image training is effective in some settings. [2]

showed that a single image can be used to learn deep features. Limited work has been done on training image generators from a single image - Deep Image Prior

[36], retargeting [31]

and super-resolution

[32]. Recently, the seminal work, SinGAN [30], presented a general approach for single unconditional image generative model training. However its ability for conditional manipulation is very limited. TuiGAN [24], on the other hand, proposed a conditional unsupervised image-to-image method based on a single image pair. However, their method requires retraining the network for every new pair. Our method, on the other hand, uses a single aligned image pair for training a single generator that can be used for multiple manipulations without retraining, it is able to affect significantly more elaborate changes to images including to large objects in the scene.

3 DeepSIM: Learning Conditional Generators from a Single Image

Our method learns a conditional generative adversarial network (cGAN) using just a single image pair consisting of the main image and its primitive representation. To account for the limited training set, we augment the data by using thin-plate-spline (TPS) warps on the training pair. The proposed approach has several objectives: i) single image training ii) fidelity - the output should reflect the primitive representation iii) appearance - the output image should appear to come from the same distribution as the training image. We will next describe each component of our method:

3.1 Model:

Our model design follows standard practice for cGAN models (particularly Pix2PixHD [39]). Let us denote our training image pair where is the input image ( and are the number of rows and columns) and is the corresponding image primitive ( is the number of channels in the image primitive). We learn a generator network , which learns to map input image primitive to the generated image . The fidelity of the result is measured using the VGG perceptual loss [19]

, which compares the differences between two images using a set of activations extracted from each image using a VGG network pre-trained on the ImageNet dataset (we follow the implementation in

[39]). We therefore write the reconstruction loss :


Conditional GAN loss: Following standard practice, we add an adversarial loss which measures the ability of a discriminator to differentiate between the (primitive, generated image) pair and the (primitive, true image) pair . The conditional discriminator

is implemented using a deep classifier which maps a pair of primitive and corresponding image into the probability of the two being a ground truth primitive-image pair.

is trained adversarially against . The loss of the discriminator () is:


The combined loss is the sum of the reconstruction and adversarial losses, weighted by a constant :

Figure 6: TPS Visualisation. A random TPS warp of the primitive-image pair. Also see SM.
Figure 7: Results on three different image primitives. The leftmost column shows the source image, then each column demonstrate the result of our model when trained on the specified primitive. We manipulated the image primitives, adding a right eye, changing the point of view and shortening the beak. Our results are presented next to each manipulated primitive. The combined primitive performed best on high-level changes (e.g. the eye), and low-level changes (e.g. the background).

3.2 Augmentations:

When large datasets exist, finding the generator and conditional discriminator that optimize under the empirical data distribution can result in a strong generator . However, as we only have a single image pair , this formulation severely overfits. This has the negative consequence of not being able to generalize to new primitive inputs. In order to generalize to new primitive images, the size of the training dataset needs to be artificially increased so as to cover the range of expected primitives. Conditional generative models typically use simple crop-and-flip augmentations. We will later show (Sec. 4) that this simple augmentation strategy however will not generalize to primitive images with non-trivial changes.

We incorporate the thin-plate-spline (TPS) as an additional augmentation in order to extend our single image dataset. For each TPS augmentation an equispaced grid of control points

is placed on the image, we then shift the control points by a random (uniformly distributed) number of pixels in the horizontal and vertical directions. This shift creates a

non-smooth warp which we denote by . To prevent the appearance of degenerate transformations in our training images, the shifting amount is restricted to at most of the minimum between the image width and height. We calculate the smooth

TPS interpolating function

by minimizing:


Where denote the second order partial derivatives of which forms the smoothness measure, regularised by . The optimization over the warp can be performed very efficiently e.g. [13]. We denote the distribution of random TPS that can be generated using the above procedure as . The above is illustrated in Fig.  6

3.3 Optimization:

During training, we sample random TPS warps. Each random warp transforms both the input primitive and image to create a new training pair (where we denote where ). We optimize the generator and discriminator adversarially to minimize the expectation of the loss under the empirical distribution of random TPS warps:


We used the Pix2PixHD architecture with the official hyperparameters (except using


3.4 Primitive images:

To edit the image, we condition our generator on a representation of the image that we denote the image primitive. The required properties of the image primitive are: being able to precisely specify the required output image and the ease of manipulation by image editor. These two objectives are in conflict, although the most precise representation of the edited image is the edited image itself, this level of manipulation is very challenging to achieve by a human editor, in fact, simplifying this representation is the very motivation for this work. Two standard image primitives used by previous conditional generators are the edge representation of the image and the semantic instance/segmentation map of the image. Segmentation maps provide information on the high-level properties of the image, but give less guidance on the fine-details. Edge maps provide the opposite trade-off. To achieve the best of both worlds, we use the combination of the two primitive representations. The advantages of the combined representation are shown in Sec. 5. Our editing procedure is illustrated in the SM.

Figure 8: Single Image Animation. Top: translating an animation into a video clip, bottom- translating a video clip into a painted animation. Left: single training pairs, middle- subsequent frames, right: generated outputs. The video clips are available on our project page.
Training Pair Input SinGAN
TuiGAN Ours
Figure 9: Image manipulation comparison. The leftmost column shows the training pair consisting of a painted image that was created manually and a target image. The manipulated image is given as input. We can see that SinGAN preserves some details while failing to capture the shape, on the other hand, TuiGAN correctly captures the shape but does not preserve the details of the image. Our method is able to capture both the shape and the details of the manipulation with high fidelity.
  Training Image Pair Input Pix2PixHD BicycleGAN Ours
Figure 10:

Edges-to-image comparison

. Columns show the training edges and images. Column shows the edges used as input at inference time. Pix2PixHD-MI cannot generate the correct shoe as there is not enough guidance. BicycleGAN has sufficient guidance but cannot reproduce the correct details. Our results are of high quality and fidelity.

4 Experiments

4.1 Qualitative evaluation

We present many results of our method in the main paper and SM. In Fig. 2, our method generates very high resolution results from single image training. In the top row we perform fine changes to the facial images from edge primitives e.g. raising the nose and flipping the eyebrows. In the second row, on the left, we used the combined primitive (edges and segmentation), we modify the dog’s hat and made his face longer. On the right, we show complex shape transformations by using segmentation primitives. Our method added a third wheel to the car and converted its shape into a sports car. This shows the power of the segmentation primitive, enabling major changes to the shape using simple operations. See figures Fig. 3 and Fig. 4 for more examples.

Method S1 S2 S3 S4 S5
Pix2PixHD-SIA 0.44 0.51 0.47 0.49 0.41 0.5 0.53 0.26 0.46 0.44
Ours - no VGG 0.14 0.05 0.26 0.11 0.11 0.07 0.28 0.14 0.19 0.08
Ours 0.12 0.07 0.21 0.12 0.1 0.04 0.22 0.12 0.14 0.06

Table 1: Quantitative comparison on LRS2 frames. Results of Pix2PixHD-SIA (crop-and-flip) and our method (TPS) on LRS2 videos (both trained on a single pair). For each sequence left column: LPIPS, right column: SIFID.
Figure 11: Visually comparing the affect of TPS augmentations. Our method with TPS outputs an image much more similar to the ground truth than just crop-and-flip augmentation (further results in SM).

In Fig. 9, we compare the results of different single-image methods on a paint-to-image task. Our method was trained to map from a rough paint image to an image of a tree, while SinGAN and TuiGAN were trained using the authors’ best practice. We can see that SinGAN outputs an image which is more similar to the paint than a photorealistic image and fails to capture the new shape of the tree. We note that although SinGAN allows for some conditional generation tasks, it is not its main objective, explaining the underwhelming results. TuiGAN on the other hand, does a better job in capturing the shape but fails to capture the fine details and texture. Our method is able to change the shape of the tree to correspond to the paint while keeping the appearance of the tree and background as in the training image. Differently from TuiGAN, we learn a single generator for all future manipulations of the primitive without the need to retrain for each manipulations.

In Fig. 10, we compare to two models that were trained on a large dataset. We can see that Pix2PixHD-MI (Pix2PixHD that was trained on the entire edge2shoes dataset, where ”MI” is an acronym for ”Multi Image”) is unable to capture the correct identity of the shoes as there are multiple possibilities for the appearance of the shoe given the edge image. BicycleGAN is able to take as input both the edge map and guidance for the appearance (style) of the required shoe. Although it is able to capture the general colors of the required shoe, it is unable to capture the fine details of the shoes (e.g. shoe laces and buckles). This is a general disadvantage of training on large datasets, as a general mapping function becomes less specialized and therefore less accurate on individual images.

Single Image Animation the idea of generating short clip art videos from only a single image was demonstrated in [30] in an unsupervised fashion, we show that our model can be used to create an artistic short video clips in a supervised fashion from a single image-pair. This application allows to ”breath life” in a single static image, by creating a short animated clip in the primitive domain, and feeding it frame-by-frame to the trained model to obtain a photorealistic animated clip. In contrast to SinGAN, which performs a random walk in the latent space, we allow for fine grained control over the animation ”story”. In addition, our model can be used also in the opposite direction. That is, translating short video clips into painted animations based on a single frame and corresponding stylized image. This application may be useful for animators and designers. An example may be seen in Fig. 8. We note that since our work does not focus on video generation, we do not have any temporal consistency optimization as was done by [35]. We strongly encourage the reader to view the videos on our project page.

4.2 Quantitative evaluation

As previous single image generators have mostly operated on unconditional generation, there are no established suitable evaluation benchmarks. We propose a new video-based benchmark for conditional single image generation spanning a range of scenes. A single frame from each video is designated for training, where the network is trained to map the primitive image to the designated training frame. The trained network is then used to map from primitive to image for all the other video frames and compute the prediction error using LPIPS [41] and fidelity using SIFID [30].

A visual evaluation on a frame from the LRS2 dataset can be seen in Fig.  11

. Our method is compared against Pix2PixHD-SIA, where ”SIA” stands for ”Single Image Augmented” e.g. a Pix2PixHD model that was trained on a single image using random crop-and-flip warps but not TPS. Our method significantly outperforms Pix2PixHD-SIA in fidelity and quality indicating that our TPS augmentation is critical for single image conditional generation. Quantitative evaluations on Cityscapes and LRS2 are provided in Tab. 

2 and Tab. 1. We report LPIPS and SIFID for each of the LRS2 sequences and for the average of Cityscapes videos. Our method significantly outperformed Pix2PixHD-SIA in all comparisons. More technical details may be found in the SM. SinGAN cannot perform this task and did not obtain meaningful results. While TuiGAN can in theory perform this task, it would require retraining a model for each frame which is impractical.

User Study We conducted a user study, following the protocol of Pix2Pix and SinGAN. We sequentially presented 30 images: 10 real, 10 manipulated images, and 10 of side-by-side pairs of real and manipulated images. The participants were asked to classify each as “Real” or “Generated by AI”. In the case of pairs, we asked participants to determine if the ‘left’ or ‘right’ image was real. Each image was presented for second, as in previous protocols. The study consisted of participants. ( males, females). The confusion rate on the unpaired images was , while on the paired images it was . This shows that our manipulated images are very realistic.

Metric Pix2PixHD-SIA DeepSIM (Ours)
Seg, Crop+Flip Seg, TPS Seg+Edge, TPS
LPIPS 0.342 0.216 0.134
SIFID 0.292 0.127 0.104
Table 2: Results for the Cityscapes dataset - we report the average over the 16 videos. The results show the importance of the TPS augmentation and the combined primitive.

5 Analysis

Input primitives As segmentations capture high-level aspects of the image while edge maps capture the low-level of the image better, we analyze the primitive that combines both. This choice is uncommon, e.g. Pix2PixHD proposed combining instance and semantic segmentation maps, however, this does not provide low-level details. Fig. 7 compares the three primitives. The edge representation is unable to capture the eye, presumably as it cannot capture its semantic meaning. The segmentation is unable to capture the details in the new background regions creating a smearing effect. The combined primitive is able to capture the eye as well as the low-level textures of the background region. In Fig. 5 we present more manipulation results using the combined primitive. In the center column, we switched the positions of rightmost cars. As the objects were not of the same size, some empty image regions were filled using small changes to the edges. A more extreme result can be seen in the rightmost column, the car on the left was removed, creating a large empty image region. By filling in the missing details using edges, our method was able to successfully complete the background (see SM for an ablation).

Runtime: Our runtime is a function of the neural architecture and the number of iterations. When running all experiments on the same hardware (NVIDIA RTX-2080 Ti), a 256x256 image e.g. the ”face” image (Fig. 2) takes SinGAN minutes to train, and minutes for TuiGAN while DeepSIM (ours) takes minutes. As was discussed previously, TuiGAN requires a new training process for each new manipulation whereas our DeepSIM does not.

Is the cGAN loss necessary? We evaluated removing the cGAN loss, keeping just the VGG perceptual loss on the Cars image (see SM). For such high-res images the cGAN was a better perceptual loss. At lower resolutions, the VGG results were reasonable but still blurrier than the cGAN loss.



Figure 12: Failure modes. Left: generating unseen objects - eyes of the dog. Center: background duplication - sea behind the turtle. Right: empty space interpolation - nose of the cat.

Can methods trained on large datasets generalize to rare images? We present examples where this is not the case. Fig. 10 showed that BicycleGAN did not generalize as well as Pix2PixHD-MI for new (in-distribution) shoes. We show that in the more extreme case, where the image lies further from the source distribution used for training, current methods fail completely. See SM for further analysis.

Augmentation in deep single image methods: Although we are the first to propose single-image training for manipulation using extensive non-linear augmentations, we see SinGAN as implicitly being an augmentation-based unconditional generation approach. In its first level it learns an unconditional low-res image generator, while latter stages can be seen as an upscaling network. Critically, it relies on a set of “augmented” input low-res images generated by the first stage GAN. Some other methods e.g. Deep Image Prior do not use any form of augmentation.

Failure modes: We highlight three main failure modes of DeepSIM (Fig. 12): i) generating unseen objects - when the manipulation requires generating objects unseen in training, the network can do so incorrectly. ii) background duplication - when adding an object onto new background regions, the network can erroneously copy some background regions that originally surrounded the object. iii) interpolation in empty regions - as no guidance is given in empty image regions, the network hallucinates details, sometimes incorrectly. See SM for further analysis.

6 Conclusions

We proposed a method for training conditional generators from a single training image based on TPS augmentations. Our method is able to perform complex image manipulation at high-resolution. Single image methods have significant potential, they preserve image fine-details to a level not typically achieved by previous methods trained on large datasets. One limitation of single-image methods (including ours) is the requirement for training a separate network for every image. Speeding up training of single-image generators is a promising direction for future work.

Acknowledgements We thank Jonathan Reich for creating the primitives and the animations examples and Prof. Shmuel Peleg for insightful comments and advise.


  • [1] N. Arad, N. Dyn, D. Reisfeld, and Y. Yeshurun (1994-03)

    Image warping by radial basis functions: applications to facial expressions

    CVGIP: Graph. Models Image Process. 56 (2), pp. 161–172. External Links: ISSN 1049-9652, Link, Document Cited by: §2.
  • [2] Y. M. Asano, C. Rupprecht, and A. Vedaldi (2019) A critical analysis of self-supervision, or what we can learn from a single image. arXiv preprint arXiv:1904.13132. Cited by: §2.
  • [3] S. Avidan and A. Shamir (2007) Seam carving for content-aware image resizing. In SIGGRAPH, Cited by: §2.
  • [4] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman PatchMatch: a randomized correspondence algorithm for structural image editing. Cited by: §2.
  • [5] T. Beier and S. Neely (1992-07) Feature-based image metamorphosis. SIGGRAPH Comput. Graph. 26 (2), pp. 35–42. External Links: ISSN 0097-8930, Link, Document Cited by: §2.
  • [6] U. Bergmann, N. Jetchev, and R. Vollgraf (2017) Learning texture manifolds with the periodic spatial GAN. CoRR abs/1705.06566. External Links: Link, 1705.06566 Cited by: §1.
  • [7] J. Canny (1986) A computational approach to edge-detection. Ieee transactions on pattern analysis and machine intelligence. Cited by: Appendix H.
  • [8] W. Chen and J. Hays (2018) SketchyGAN: towards diverse and realistic sketch to image synthesis. External Links: 1801.02753 Cited by: §2.
  • [9] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: §2.
  • [10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3213–3223. Cited by: Appendix H.
  • [11] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §2.
  • [12] T. Dekel, C. Gan, D. Krishnan, C. Liu, and W. T. Freeman (2017) Smart, sparse contours to represent and edit images. arXiv preprint arXiv:1712.08232. Cited by: §2.
  • [13] G. Donato and S. Belongie (2002) Approximate thin plate spline mappings. In European conference on computer vision, pp. 21–31. Cited by: §1, §3.2.
  • [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §2.
  • [15] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis (2018) Viton: an image-based virtual try-on network. In CVPR, Cited by: §2.
  • [16] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin (2001) Image analogies. SIGGRAPH. Cited by: §I.5, Appendix I, §2.
  • [17] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §2.
  • [18] N. Jetchev, U. Bergmann, and R. Vollgraf (2017) Texture synthesis with spatial generative adversarial networks. External Links: 1611.08207 Cited by: §1.
  • [19] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §3.1.
  • [20] A. Kanazawa, D. W. Jacobs, and M. Chandraker (2016) Warpnet: weakly supervised matching for single-view reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3253–3261. Cited by: §2.
  • [21] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. In ICML, Cited by: §2.
  • [22] J. Lee, E. Kim, Y. Lee, D. Kim, J. Chang, and J. Choo (2020)

    Reference-based sketch image colorization using augmented-self reference and dense semantic correspondence

    In CVPR, Cited by: §2.
  • [23] C. Li and M. Wand (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Lecture Notes in Computer Science, Vol. 9907, pp. 702–716. External Links: Link, Document Cited by: §1.
  • [24] J. Lin, Y. Pang, Y. Xia, Z. Chen, and J. Luo (2020) TuiGAN: learning versatile image-to-image translation with two unpaired images. pp. 18–35. Cited by: §2.
  • [25] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In NIPS, Cited by: §2.
  • [26] S. Mounsaveng, D. Vazquez, I. B. Ayed, and M. Pedersoli (2019) Adversarial learning of general transformations for data augmentation. arXiv preprint arXiv:1909.09801. Cited by: §2.
  • [27] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In CVPR, Cited by: §2.
  • [28] P. Pérez, M. Gangnet, and A. Blake (2003) Poisson image editing. In SIGGRAPHs, Cited by: §2.
  • [29] Y. Pritch, E. Kav-Venaki, and S. Peleg (2007) Shift-map image editing. In ICCV, Cited by: §2.
  • [30] T. R. Shaham, T. Dekel, and T. Michaeli (2019) Singan: learning a generative model from a single natural image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4570–4580. Cited by: §1, §2, §4.1, §4.2.
  • [31] A. Shocher, S. Bagon, P. Isola, and M. Irani (2018) Internal distribution matching for natural image retargeting. CoRR abs/1812.00231. External Links: Link, 1812.00231 Cited by: §1, §2.
  • [32] A. Shocher, N. Cohen, and M. Irani (2018) “Zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3118–3126. Cited by: §2.
  • [33] R. Szeliski (2010) Computer vision: algorithms and applications. Springer Science & Business Media. Cited by: §2.
  • [34] Z. Tang, K. Chen, M. Pan, M. Wang, and Z. Song (2019) An augmentation strategy for medical image processing based on statistical shape model and 3d thin plate spline for deep learning. IEEE Access 7, pp. 133111–133121. Cited by: §2.
  • [35] O. Texler, D. Futschik, M. kučera, O. jamriška, Š. Sochorová, M. Chai, S. Tulyakov, and D. SÝkora (2020-07) Interactive video stylization using few-shot patch-based training. ACM Trans. Graph. 39 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §4.1.
  • [36] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2018) Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §2.
  • [37] T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro (2019) Few-shot video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix H.
  • [38] T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix H.
  • [39] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2017) High-resolution image synthesis and semantic manipulation with conditional gans. CoRR abs/1711.11585. External Links: Link, 1711.11585 Cited by: §2, §3.1.
  • [40] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) CutMix: regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision (ICCV), Cited by: Appendix A.
  • [41] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint arXiv:1801.03924. Cited by: §4.2.
  • [42] A. Zhao, G. Balakrishnan, F. Durand, J. V. Guttag, and A. V. Dalca (2019) Data augmentation using learned transformations for one-shot medical image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8543–8553. Cited by: §2.
  • [43] Y. Zhou, Z. Zhu, X. Bai, D. Lischinski, D. Cohen-Or, and H. Huang (2018) Non-stationary texture synthesis by adversarial expansion. External Links: 1805.04487 Cited by: §1.
  • [44] J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros (2018) Generative visual manipulation on the natural image manifold. External Links: 1609.03552 Cited by: §2.
  • [45] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §2.
  • [46] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In Advances in neural information processing systems, pp. 465–476. Cited by: §2.


Appendix A A study of different augmentations

In this section we analyse the effect of different augmentations under the proposed framework. We trained our method by using different combinations of various augmentation methods i.e. crop, flip, shear, rotation, cutmix based [40](i.e. randomly swapping patches within the same single training image) and TPS. Furthermore, in order to improve the robustness to manual editing of the edges, we incorporate edge augmentation in the primitive by using randomly sampled values for the canny edge detector (i.e. controlling the scale of the edges, larger result in coarser scale edges while smaller result finer scale edges).
We used the same ”dataset” for all the experiments. The dataset follows the video-based evaluation method presented in Sec. 4.2 of the main paper. All images are of size and the primitives are a combination of edges and segmentations (extracted using face-parsing.PyTorch). For cutmix-like augmentations we sampled patches of random size in . We shear by . For rotations we uniformly sample degrees. The full breakdown and results are presented in Tab. 3. As can be seen from Tab. 3 and Fig. 13 TPS has a significant role in the success of our method. Additionally, we can see in that the augmentation improves the reconstruction of fine details such as the teeth.

crop flip sheer rotation cutmix tps canny SIFID LPIPS
0.25 0.15
0.3 0.15 0.19
0.6 0.14 0.18
0.99 0.10 0.05
0.8 0.10 0.07
0.9 0.10 0.07
0.99 0.10 0.03
0.8 0.10 0.04
0.99 0.09 0.05
0.9 0.10 0.04
0.99 0.10 0.05
0.99 0.10 0.02
0.99 0.10 0.01
Table 3: Types of augmentations The ”TPS” column indicates the portion for which TPS was applied.
 30% TPS 60% TPS 99% TPS 99% TPS + Canny GT
Figure 13: Effects of augmentations

Appendix B Empty space interpolation

In this section we stress test our method’s ability to handle regions with little guidance. In this example, the nose of the cat was shifted progressively downwards, forcing the network to interpolate the missing space. We observe the network synthesizes attractive images for moderate empty regions, however, as the empty region gets larger, the network looks for similar regions to fill the newly created void. These regions will often be areas which exhibit low amounts of detail in the primitive representation. In our case we can notice that for larger shifts, the empty space becomes greener until eventually it inpaints a background patch. We conclude that at a certain point, the network fails to learn the spatial relationship among objects in the image (i.e. that the background can not be placed on the cat’s face) and satisfies the given constraint using neighboring information (as was analysed above).

Figure 14: Evaluation of the ability of our network to interpolate across empty space regions. The two leftmost columns show the training image pair, we gradually increase the distance between the eyes and nose of the cat, and feed the test images to the network, the corresponding output of each test image is shown in the second row. Our method generates attractive interpolations for moderate changes, the performance deteriorates for larger interpolations.

Appendix C TPS generalization improvement

Let us consider the train and test edge-image pairs presented in Fig. 15

. We input each edge map through an ImageNet-trained ResNet50 network and computed the activations at the end of the third residual block. For each pixel in the activation grid of the test image, we computed the nearest-neighbor (1NN) distance to the most similar activation of the train image. We then performed

TPS augmentations to the training image, and repeated the 1NN computation with the training set now containing the activations of the original training image and its 50 augmentation. Let us compare the 1NN distances presented in Fig. 15 with and without TPS augmentations. Naturally, the 1NN distance decreased for the TPS-augmented training set due to its larger size. More interestingly, we can see that several face regions which prior to the augmentations did not have similar patches in the input, now have much lower distance (while more significant changes might not be possible to describe by TPS). In Fig. 15, we present the results of our method when trained on the training edge-image pair (shown in the leftmost column) and evaluated on the test edge. We can see that the prediction error ( difference between ResNet50 activations of the predicted and the true test image) appears to be strongly related to the 1NN distance with TPS-augmentations. This gives some evidence to the hypothesis that the network recalls input-output pairs seen in training. It also gives an explanation for the effectiveness of TPS training, namely increasing the range of input-output pairs thus generalizing to novel images.

Figure 15:

An analysis of the benefits of TPS. We show the kNN distance between patches in the test and train frames with and without TPS augmentations (top-right). We can see that TPS augmentation decreases the kNN distance, in some image regions the decrease is drastic suggesting the patches there can be obtained by deformations of training patches. The kNN-TPS distance appears to be correlated with the regions where the prediction error of our method is large. This analysis suggests that by artificially increasing the diversity of patches, single-image methods can generalize better to novel images.

Appendix D An ablation of the loss objective

We compare the results of our method, DeepSIM, using the original cGAN loss as in the base Pix2PixHD architecture vs. non-adversarial losses - the simple loss and the percetual loss based on the difference of VGG activations. In Fig. 16 we can see that on this image both non-adversarial losses fail. Note that at lower resolutions non-adversarial losses do indeed succeed but do not generate results of comparable sharpness of the cGAN loss. Additionally, we performed the experiment with the cGAN but without the VGG perceptual loss, the results are presented below. It can be seen that without the VGG loss, the results are reduced in quality and contain grainy artifacts.

 Training Perceptual Loss L1 Loss Ours w/o VGG Ours w/ VGG
Figure 16: Ablation of the loss objective

Appendix E Ablation of the combined primitive for the cars image

We present an ablation of the combined primitive representation (edges+segmentation) for the Cars image. In Fig. 17, we present results for a manipulation on the Cars image using edge-only, segmentation-only and combined. We can see that the combined primitive generates attractive artifact free results.
In Fig. 18 we present a qualitative comparison between different primitives on two frames from the LRS2 datasets. Although all primitives generate surprisingly good results, given the training on just a single image, the combined primitive generates cleaner outputs with fewer artifacts.

Training Pair Input Output
Figure 17: Ablation of the combined primitive (edges+segmentation). (top) edges-only (center) segmentation-only (bottom) combined. We can see that edges-only creates wrong associations between objects, segmentation-only fails to generate the fine details correctly (e.g. building), whereas the combined primitive achieves strong results.
  Training Pair Input Output Ground Truth
Figure 18: Ablation of the combined primitive (edges+segmentation). First row - The edges primitive is missing the chin. Second row - The segmentation primitive is missing the left eye. Third row - The combined primitive successfully recovered both the chin and the eye consistently with the ground truth.

Appendix F Out-of-distribution images using pretrained model

We manually labelled the semantic and instance segmentation maps of the Cars image, and pass it to a Pix2PixHD pre-trained by the authors on the Cityscapes dataset (containing street scenes of cars, roads and buildings). We see in Fig. 19 that Pix2Pix-HD pre-trained on a large dataset does not generalize well to out-of-distribution inputs whereas our single-image method did.

 Semantic Seg. Input Seg. Output Ground
Figure 19: Out-of-Distribution results. Running full pretrained cityscapes Pix2PixHD on a primitive representation of an out-of-distribution image. The network was not able to generalize well and generated unsatisfactory results.

Appendix G Video Frames

A visual evaluation on a few frames from the Cityscapes dataset can be seen in Fig.  20. We compare our method to the results of Pix2PixHD-SIA, where ”SIA” stands for ”Single Image Augmented” e.g. a Pix2PixHD model that was trained on a single image using random crop-and-flip warps but not TPS. We can observe that our method is able to synthesize very different scene setups from those seen in training, including different numbers and positions of people. Our method outperforms significantly in terms of fidelity and quality than Pix2PixHD-SIA indicating that our proposed TPS augmentation is critical for single image conditional generation.

Appendix H Qualitative comparison details

Below we provide the technical details used for our new video-based benchmark for conditional single image generation spanning a range of scenes. For the qualitative comparisons we use all video segments from the Cityscapes dataset [10] provided by the code in vid2vid [38] and Few-shot-vid2vid [37]. These sequences are labelled aachen-000000 to aachen-000015 leftImg8bit. For each sequence, we train on frame 000000 and test using frames 000001 to 000029. We use the segmentation maps provided as image primitives. We also use the first videos in the public release of the Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset containing videos of different speakers. We extract their edges using a Canny edge detector[7]. In total, our evaluation set contains Cityscapes frames and LRS2 frames.

Figure 20: Several sample results from the Cityscapes dataset. We train each model on the segmentation-image pair on the left. We then use the models to predict the image, given the segmentation maps (second column from left). Our method is shown to perform very well on this task, generating novel configurations of people not seen in the training image.

Appendix I Additional Results

We present additional results of our method, DeepSIM, on a range of manipulations on different images. The manipulation fall into four categories: (1) Manipulations. (2) Removals. (3) Additions. (4) Single Image Animation. In (5) we provide visual comparison to Image Analogies [16], a classic method in the field of image-to-image translation.

i.1 Manipulations

    Training Pair Input Output
Splitting the Starfish
    Changing the Shape of the Tail
    Joining the Hamburger Halves
    Training Pair Input Output
    Changing the Shape of the Lake
    Moving the Tree
    Making the Beak Longer
    Training Pair Input Output
    Making the Dress Longer
    Changing the Shoulder Area
    Changing the Shoulder Area
    Changing the Cut at the Bottom

i.2 Removals

    Training Pair Input Output
Removing the Fork
Removing Arms
    Removing the Teeth of the Top Lama
    Training Pair Input Output Input Output
    Removing the Handcuffs (left), Removing the Right Hand (right)

i.3 Additions

    Training Pair Input Output
    Adding More Lakes
Adding Stems
Adding the Left Paw
    Adding an Arm

i.4 Single Image Animation

As described in the paper, after training we can use DeepSIM to create a short animated clip in the primitive domain, feeding it frame-by-frame to the trained model we obtain a photorealistic animated clip. In addition, DeepSIM can be used also in the opposite direction. The following figures showcase a few frames from each clip. We strongly encourage the reader to view the videos on our project page.

 Training Pair



 Training Pair



 Training Pair



 Training Pair



i.5 Comparison to Image Analogies

Image Analogies by Hertzmann et al. [16] is based on finding multi-scale patch-level analogies between a pair of images in order to apply a wide variety of “image filter” effects to a given image. Below is a comparison of our method to theirs using the ”path” example from the ”texture-by-numbers” application shown in [16]. In this comparison we incorporate the combined primitive (i.e. edges and the high level drawing) to allow the fine-details editing.For the manipulated image, since the original image did not contain any edges, we added and ”edge-like” layer on top of the original result. To ensure the robustness of our method to handle these hand drawn edges, we perform binary skeletonize to the manipulate edges so that they are similar to the canny edges we’ve trained on.

Training Pair Input Output

Image Analogies


Appendix J A Step-By-Step Demonstration of Editing the Primitive

Performing complex manipulations by our method is quite easy. In this figure we present a step-by-step example of editing a primitive representation using ”Paint”. It simply requires sampling the required color and painting over the primitive image. One may also ”borrow” edges from other areas of the image to fill in empty spaces.

 (1) Original Image (2) Paint Heart (3) Copy Edges of Trees (4) Rearrange Edges of Trees

Appendix K TPS Examples

We present several examples of original and TPS augmented images and primitives. We can see that TPS introduces complex deformations to the samples, allowing much more expressive edits than when using simple ”flip and crop” augmentations.

 Original TPS 1 TPS 2 TPS 3