1 Introduction††*Equal contribution
Deep neural networks have significantly boosted performance on image manipulation tasks for which large training datasets can be obtained, such as, mapping facial landmarks to facial images. In practice, however, there are many settings in which the image to be manipulated is unique, and a training set consisting of many similar input-output samples is unavailable. Moreover, in some cases using a large dataset might even lead to unwelcome outputs that do not preserve the specific characteristics of the desired image. Training generative models on just a single image, is an exciting recent research direction, which may hold the potential to extend the scope of neural-network-based image manipulation methods to unique images. In this paper, we introduce - DeepSIM, a simple-to-implement yet highly effective method for training deep conditional generative models from a single image pair. Our method is capable of solving various image manipulation tasks including: (i) shape warping (Fig.2) (ii) object rearrangement (Fig. 5) (iii) object removal (Fig. 5) (iv) object addition (Fig. 2) (v) creation of painted and photorealistic animated clips (Fig. 8 and videos on our project page).
Given a single target image, first, a primitive representation is created for the training image. This can either be unsupervised (i.e. edge map, unsupervised segmentation), supervised (i.e. segmentation map, sketch, drawing), or a combination of both. We use a standard conditional image mapping network to learn to map between the primitive representation and the image. Once training is complete, a user can explicitly design and choose the changes they want to apply to the target image by manipulating the simple primitive (serving as a simpler manipulation domain). The modified primitive is fed to the network, which transforms it into the real image domain with the desired manipulation. This process is illustrated in Fig. 1.
Several papers have explored the topic of what and how much can be learned from a single image. Two recent seminal works SinGAN  and InGAN  propose to extend this beyond the scope of texture synthesis [6, 18, 23, 43]
. SinGAN tackles the problem of single image manipulation in an unconditional manner allowing unsupervised generation tasks. InGAN, on the other hand, proposes a conditional model for applying various geometric transformations to the image. Our paper extends this body of work by exploring the case of supervised image-to-image translation allowing the modification of specific image details such as the shape or location of image parts. We find that the augmentation strategy is key for making DeepSIM work effectively. Breaking from the standard practice in the image translation community of using a simple crop-and-flip augmentation, we found that using a thin-plate-spline (TPS) augmentation method is essential for training conditional generative models based on a single image-pair input. The success of TPS is due to its exploration of possible image manipulations, extending the training distribution to include the manipulated input. Our model successfully learns the internal statistics of the target image, allowing both professional and amateur designers to explore their ideas while preserving the semantic and geometric attributes of the target image and producing high fidelity results.
Our contributions in this paper:
A general purpose approach for training conditional generators supervised by merely a single image-pair.
Recognizing that image augmentation is key for this task, and the remarkable performance of thin-plate-spline (TPS) augmentation which was not previously used for single image manipulation.
Achieving outstanding visual performance on a range of image manipulation applications.
2 Related Work
Classical image manipulation methods: Image manipulation has attracted research for decades from the image processing, computational photography and graphics communities. It would not be possible to survey the scope of this corpus of work in this paper. We refer the reader to the book by  for an extensive survey, and to the Photoshop software for a practical collection of image processing methods. A few notable image manipulation techniques include: Poisson Image Editing , Seam Carving , PatchMatch , ShiftMap , and Image Analogies . Spline based methods include: Field Morphing  and Image Warping by RDBF 
. Learning a high-resolution parametric function between a primitive image representation and a photo-realistic image was very challenging for pre-deep learning methods.
Deep conditional generative models:
Image-to-image translation maps images from a source domain to a target domain, while preserving the semantic and geometric content of the input images. Most image-to-image translation methods use Generative Adversarial Networks (GANs) that are used in two main scenarios: i) unsupervised image translation between domains [45, 21, 25, 9]
ii) serving as a perceptual image loss function[17, 39, 27, 46]. Existing methods for image-to-image translation require many labeled image pairs. Several methods [8, 12, 44] are carefully designed for image manipulation, however they require large datasets which are mainly available for faces or interiors and cannot be applied to the long-tail of images.
Non-standard augmentations: Conditional generation models typically use crop and flip augmentations. Classification models also use chromatic and noise augmentation. Recently, methods have been devised for learning augmentation for classification tasks e.g. AutoAugment .  learned warping fields for augmenting classification networks. Thin-plate-spline transformation have been used in the medical domain e.g. , but they are used for training on large datasets rather than a single sample.  learned augmentations for training segmentation networks from a single annotated 3D medical scan (using a technique similar to ) however they require a large unlabeled dataset of similar scans which is not available in our setting. TPS has also been used as a way of parametrizing warps for learning dense correspondences between images e.g.  and .
Learning from a single image: Although most deep learning works use large datasets, seminal works showed that single image training is effective in some settings. 
showed that a single image can be used to learn deep features. Limited work has been done on training image generators from a single image - Deep Image Prior, retargeting 
and super-resolution. Recently, the seminal work, SinGAN , presented a general approach for single unconditional image generative model training. However its ability for conditional manipulation is very limited. TuiGAN , on the other hand, proposed a conditional unsupervised image-to-image method based on a single image pair. However, their method requires retraining the network for every new pair. Our method, on the other hand, uses a single aligned image pair for training a single generator that can be used for multiple manipulations without retraining, it is able to affect significantly more elaborate changes to images including to large objects in the scene.
3 DeepSIM: Learning Conditional Generators from a Single Image
Our method learns a conditional generative adversarial network (cGAN) using just a single image pair consisting of the main image and its primitive representation. To account for the limited training set, we augment the data by using thin-plate-spline (TPS) warps on the training pair. The proposed approach has several objectives: i) single image training ii) fidelity - the output should reflect the primitive representation iii) appearance - the output image should appear to come from the same distribution as the training image. We will next describe each component of our method:
Our model design follows standard practice for cGAN models (particularly Pix2PixHD ). Let us denote our training image pair where is the input image ( and are the number of rows and columns) and is the corresponding image primitive ( is the number of channels in the image primitive). We learn a generator network , which learns to map input image primitive to the generated image . The fidelity of the result is measured using the VGG perceptual loss 
, which compares the differences between two images using a set of activations extracted from each image using a VGG network pre-trained on the ImageNet dataset (we follow the implementation in). We therefore write the reconstruction loss :
Conditional GAN loss: Following standard practice, we add an adversarial loss which measures the ability of a discriminator to differentiate between the (primitive, generated image) pair and the (primitive, true image) pair . The conditional discriminatoris trained adversarially against . The loss of the discriminator () is:
The combined loss is the sum of the reconstruction and adversarial losses, weighted by a constant :
When large datasets exist, finding the generator and conditional discriminator that optimize under the empirical data distribution can result in a strong generator . However, as we only have a single image pair , this formulation severely overfits. This has the negative consequence of not being able to generalize to new primitive inputs. In order to generalize to new primitive images, the size of the training dataset needs to be artificially increased so as to cover the range of expected primitives. Conditional generative models typically use simple crop-and-flip augmentations. We will later show (Sec. 4) that this simple augmentation strategy however will not generalize to primitive images with non-trivial changes.
We incorporate the thin-plate-spline (TPS) as an additional augmentation in order to extend our single image dataset. For each TPS augmentation an equispaced grid of control points
is placed on the image, we then shift the control points by a random (uniformly distributed) number of pixels in the horizontal and vertical directions. This shift creates anon-smooth warp which we denote by . To prevent the appearance of degenerate transformations in our training images, the shifting amount is restricted to at most of the minimum between the image width and height. We calculate the smooth
TPS interpolating functionby minimizing:
Where denote the second order partial derivatives of which forms the smoothness measure, regularised by . The optimization over the warp can be performed very efficiently e.g. . We denote the distribution of random TPS that can be generated using the above procedure as . The above is illustrated in Fig. 6
During training, we sample random TPS warps. Each random warp transforms both the input primitive and image to create a new training pair (where we denote where ). We optimize the generator and discriminator adversarially to minimize the expectation of the loss under the empirical distribution of random TPS warps:
We used the Pix2PixHD architecture with the official hyperparameters (except usingiterations).
3.4 Primitive images:
To edit the image, we condition our generator on a representation of the image that we denote the image primitive. The required properties of the image primitive are: being able to precisely specify the required output image and the ease of manipulation by image editor. These two objectives are in conflict, although the most precise representation of the edited image is the edited image itself, this level of manipulation is very challenging to achieve by a human editor, in fact, simplifying this representation is the very motivation for this work. Two standard image primitives used by previous conditional generators are the edge representation of the image and the semantic instance/segmentation map of the image. Segmentation maps provide information on the high-level properties of the image, but give less guidance on the fine-details. Edge maps provide the opposite trade-off. To achieve the best of both worlds, we use the combination of the two primitive representations. The advantages of the combined representation are shown in Sec. 5. Our editing procedure is illustrated in the SM.
|Training Image Pair||Input||Pix2PixHD||BicycleGAN||Ours|
Edges-to-image comparison. Columns show the training edges and images. Column shows the edges used as input at inference time. Pix2PixHD-MI cannot generate the correct shoe as there is not enough guidance. BicycleGAN has sufficient guidance but cannot reproduce the correct details. Our results are of high quality and fidelity.
4.1 Qualitative evaluation
We present many results of our method in the main paper and SM. In Fig. 2, our method generates very high resolution results from single image training. In the top row we perform fine changes to the facial images from edge primitives e.g. raising the nose and flipping the eyebrows. In the second row, on the left, we used the combined primitive (edges and segmentation), we modify the dog’s hat and made his face longer. On the right, we show complex shape transformations by using segmentation primitives. Our method added a third wheel to the car and converted its shape into a sports car. This shows the power of the segmentation primitive, enabling major changes to the shape using simple operations. See figures Fig. 3 and Fig. 4 for more examples.
|Ours - no VGG||0.14||0.05||0.26||0.11||0.11||0.07||0.28||0.14||0.19||0.08|
In Fig. 9, we compare the results of different single-image methods on a paint-to-image task. Our method was trained to map from a rough paint image to an image of a tree, while SinGAN and TuiGAN were trained using the authors’ best practice. We can see that SinGAN outputs an image which is more similar to the paint than a photorealistic image and fails to capture the new shape of the tree. We note that although SinGAN allows for some conditional generation tasks, it is not its main objective, explaining the underwhelming results. TuiGAN on the other hand, does a better job in capturing the shape but fails to capture the fine details and texture. Our method is able to change the shape of the tree to correspond to the paint while keeping the appearance of the tree and background as in the training image. Differently from TuiGAN, we learn a single generator for all future manipulations of the primitive without the need to retrain for each manipulations.
In Fig. 10, we compare to two models that were trained on a large dataset. We can see that Pix2PixHD-MI (Pix2PixHD that was trained on the entire edge2shoes dataset, where ”MI” is an acronym for ”Multi Image”) is unable to capture the correct identity of the shoes as there are multiple possibilities for the appearance of the shoe given the edge image. BicycleGAN is able to take as input both the edge map and guidance for the appearance (style) of the required shoe. Although it is able to capture the general colors of the required shoe, it is unable to capture the fine details of the shoes (e.g. shoe laces and buckles). This is a general disadvantage of training on large datasets, as a general mapping function becomes less specialized and therefore less accurate on individual images.
Single Image Animation the idea of generating short clip art videos from only a single image was demonstrated in  in an unsupervised fashion, we show that our model can be used to create an artistic short video clips in a supervised fashion from a single image-pair. This application allows to ”breath life” in a single static image, by creating a short animated clip in the primitive domain, and feeding it frame-by-frame to the trained model to obtain a photorealistic animated clip. In contrast to SinGAN, which performs a random walk in the latent space, we allow for fine grained control over the animation ”story”. In addition, our model can be used also in the opposite direction. That is, translating short video clips into painted animations based on a single frame and corresponding stylized image. This application may be useful for animators and designers. An example may be seen in Fig. 8. We note that since our work does not focus on video generation, we do not have any temporal consistency optimization as was done by . We strongly encourage the reader to view the videos on our project page.
4.2 Quantitative evaluation
As previous single image generators have mostly operated on unconditional generation, there are no established suitable evaluation benchmarks. We propose a new video-based benchmark for conditional single image generation spanning a range of scenes. A single frame from each video is designated for training, where the network is trained to map the primitive image to the designated training frame. The trained network is then used to map from primitive to image for all the other video frames and compute the prediction error using LPIPS  and fidelity using SIFID .
A visual evaluation on a frame from the LRS2 dataset can be seen in Fig. 11
. Our method is compared against Pix2PixHD-SIA, where ”SIA” stands for ”Single Image Augmented” e.g. a Pix2PixHD model that was trained on a single image using random crop-and-flip warps but not TPS. Our method significantly outperforms Pix2PixHD-SIA in fidelity and quality indicating that our TPS augmentation is critical for single image conditional generation. Quantitative evaluations on Cityscapes and LRS2 are provided in Tab.2 and Tab. 1. We report LPIPS and SIFID for each of the LRS2 sequences and for the average of Cityscapes videos. Our method significantly outperformed Pix2PixHD-SIA in all comparisons. More technical details may be found in the SM. SinGAN cannot perform this task and did not obtain meaningful results. While TuiGAN can in theory perform this task, it would require retraining a model for each frame which is impractical.
User Study We conducted a user study, following the protocol of Pix2Pix and SinGAN. We sequentially presented 30 images: 10 real, 10 manipulated images, and 10 of side-by-side pairs of real and manipulated images. The participants were asked to classify each as “Real” or “Generated by AI”. In the case of pairs, we asked participants to determine if the ‘left’ or ‘right’ image was real. Each image was presented for second, as in previous protocols. The study consisted of participants. ( males, females). The confusion rate on the unpaired images was , while on the paired images it was . This shows that our manipulated images are very realistic.
|Seg, Crop+Flip||Seg, TPS||Seg+Edge, TPS|
Input primitives As segmentations capture high-level aspects of the image while edge maps capture the low-level of the image better, we analyze the primitive that combines both. This choice is uncommon, e.g. Pix2PixHD proposed combining instance and semantic segmentation maps, however, this does not provide low-level details. Fig. 7 compares the three primitives. The edge representation is unable to capture the eye, presumably as it cannot capture its semantic meaning. The segmentation is unable to capture the details in the new background regions creating a smearing effect. The combined primitive is able to capture the eye as well as the low-level textures of the background region. In Fig. 5 we present more manipulation results using the combined primitive. In the center column, we switched the positions of rightmost cars. As the objects were not of the same size, some empty image regions were filled using small changes to the edges. A more extreme result can be seen in the rightmost column, the car on the left was removed, creating a large empty image region. By filling in the missing details using edges, our method was able to successfully complete the background (see SM for an ablation).
Runtime: Our runtime is a function of the neural architecture and the number of iterations. When running all experiments on the same hardware (NVIDIA RTX-2080 Ti), a 256x256 image e.g. the ”face” image (Fig. 2) takes SinGAN minutes to train, and minutes for TuiGAN while DeepSIM (ours) takes minutes. As was discussed previously, TuiGAN requires a new training process for each new manipulation whereas our DeepSIM does not.
Is the cGAN loss necessary? We evaluated removing the cGAN loss, keeping just the VGG perceptual loss on the Cars image (see SM). For such high-res images the cGAN was a better perceptual loss. At lower resolutions, the VGG results were reasonable but still blurrier than the cGAN loss.
Can methods trained on large datasets generalize to rare images? We present examples where this is not the case. Fig. 10 showed that BicycleGAN did not generalize as well as Pix2PixHD-MI for new (in-distribution) shoes. We show that in the more extreme case, where the image lies further from the source distribution used for training, current methods fail completely. See SM for further analysis.
Augmentation in deep single image methods: Although we are the first to propose single-image training for manipulation using extensive non-linear augmentations, we see SinGAN as implicitly being an augmentation-based unconditional generation approach. In its first level it learns an unconditional low-res image generator, while latter stages can be seen as an upscaling network. Critically, it relies on a set of “augmented” input low-res images generated by the first stage GAN. Some other methods e.g. Deep Image Prior do not use any form of augmentation.
Failure modes: We highlight three main failure modes of DeepSIM (Fig. 12): i) generating unseen objects - when the manipulation requires generating objects unseen in training, the network can do so incorrectly. ii) background duplication - when adding an object onto new background regions, the network can erroneously copy some background regions that originally surrounded the object. iii) interpolation in empty regions - as no guidance is given in empty image regions, the network hallucinates details, sometimes incorrectly. See SM for further analysis.
We proposed a method for training conditional generators from a single training image based on TPS augmentations. Our method is able to perform complex image manipulation at high-resolution. Single image methods have significant potential, they preserve image fine-details to a level not typically achieved by previous methods trained on large datasets. One limitation of single-image methods (including ours) is the requirement for training a separate network for every image. Speeding up training of single-image generators is a promising direction for future work.
Acknowledgements We thank Jonathan Reich for creating the primitives and the animations examples and Prof. Shmuel Peleg for insightful comments and advise.
Image warping by radial basis functions: applications to facial expressions. CVGIP: Graph. Models Image Process. 56 (2), pp. 161–172. External Links: Cited by: §2.
-  (2019) A critical analysis of self-supervision, or what we can learn from a single image. arXiv preprint arXiv:1904.13132. Cited by: §2.
-  (2007) Seam carving for content-aware image resizing. In SIGGRAPH, Cited by: §2.
-  PatchMatch: a randomized correspondence algorithm for structural image editing. Cited by: §2.
-  (1992-07) Feature-based image metamorphosis. SIGGRAPH Comput. Graph. 26 (2), pp. 35–42. External Links: Cited by: §2.
-  (2017) Learning texture manifolds with the periodic spatial GAN. CoRR abs/1705.06566. External Links: Cited by: §1.
-  (1986) A computational approach to edge-detection. Ieee transactions on pattern analysis and machine intelligence. Cited by: Appendix H.
-  (2018) SketchyGAN: towards diverse and realistic sketch to image synthesis. External Links: Cited by: §2.
-  (2018) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: §2.
The cityscapes dataset for semantic urban scene understanding. In , pp. 3213–3223. Cited by: Appendix H.
-  (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §2.
-  (2017) Smart, sparse contours to represent and edit images. arXiv preprint arXiv:1712.08232. Cited by: §2.
-  (2002) Approximate thin plate spline mappings. In European conference on computer vision, pp. 21–31. Cited by: §1, §3.2.
-  (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §2.
-  (2018) Viton: an image-based virtual try-on network. In CVPR, Cited by: §2.
-  (2001) Image analogies. SIGGRAPH. Cited by: §I.5, Appendix I, §2.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §2.
-  (2017) Texture synthesis with spatial generative adversarial networks. External Links: Cited by: §1.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §3.1.
-  (2016) Warpnet: weakly supervised matching for single-view reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3253–3261. Cited by: §2.
-  (2017) Learning to discover cross-domain relations with generative adversarial networks. In ICML, Cited by: §2.
Reference-based sketch image colorization using augmented-self reference and dense semantic correspondence. In CVPR, Cited by: §2.
-  (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Lecture Notes in Computer Science, Vol. 9907, pp. 702–716. External Links: Cited by: §1.
-  (2020) TuiGAN: learning versatile image-to-image translation with two unpaired images. pp. 18–35. Cited by: §2.
-  (2017) Unsupervised image-to-image translation networks. In NIPS, Cited by: §2.
-  (2019) Adversarial learning of general transformations for data augmentation. arXiv preprint arXiv:1909.09801. Cited by: §2.
-  (2019) Semantic image synthesis with spatially-adaptive normalization. In CVPR, Cited by: §2.
-  (2003) Poisson image editing. In SIGGRAPHs, Cited by: §2.
-  (2007) Shift-map image editing. In ICCV, Cited by: §2.
-  (2019) Singan: learning a generative model from a single natural image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4570–4580. Cited by: §1, §2, §4.1, §4.2.
-  (2018) Internal distribution matching for natural image retargeting. CoRR abs/1812.00231. External Links: Cited by: §1, §2.
-  (2018) “Zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3118–3126. Cited by: §2.
-  (2010) Computer vision: algorithms and applications. Springer Science & Business Media. Cited by: §2.
-  (2019) An augmentation strategy for medical image processing based on statistical shape model and 3d thin plate spline for deep learning. IEEE Access 7, pp. 133111–133121. Cited by: §2.
-  (2020-07) Interactive video stylization using few-shot patch-based training. ACM Trans. Graph. 39 (4). External Links: Cited by: §4.1.
-  (2018) Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §2.
-  (2019) Few-shot video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix H.
-  (2018) Video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix H.
-  (2017) High-resolution image synthesis and semantic manipulation with conditional gans. CoRR abs/1711.11585. External Links: Cited by: §2, §3.1.
-  (2019) CutMix: regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision (ICCV), Cited by: Appendix A.
-  (2018) The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint arXiv:1801.03924. Cited by: §4.2.
-  (2019) Data augmentation using learned transformations for one-shot medical image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8543–8553. Cited by: §2.
-  (2018) Non-stationary texture synthesis by adversarial expansion. External Links: Cited by: §1.
-  (2018) Generative visual manipulation on the natural image manifold. External Links: Cited by: §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §2.
-  (2017) Toward multimodal image-to-image translation. In Advances in neural information processing systems, pp. 465–476. Cited by: §2.
Appendix A A study of different augmentations
In this section we analyse the effect of different augmentations under the proposed framework. We trained our method by using different combinations of various augmentation methods i.e. crop, flip, shear, rotation, cutmix based (i.e. randomly swapping patches within the same single training image) and TPS. Furthermore, in order to improve the robustness to manual editing of the edges, we incorporate edge augmentation in the primitive by using randomly sampled values for the canny edge detector (i.e. controlling the scale of the edges, larger result in coarser scale edges while smaller result finer scale edges).
We used the same ”dataset” for all the experiments. The dataset follows the video-based evaluation method presented in Sec. 4.2 of the main paper. All images are of size and the primitives are a combination of edges and segmentations (extracted using face-parsing.PyTorch). For cutmix-like augmentations we sampled patches of random size in . We shear by . For rotations we uniformly sample degrees. The full breakdown and results are presented in Tab. 3. As can be seen from Tab. 3 and Fig. 13 TPS has a significant role in the success of our method. Additionally, we can see in that the augmentation improves the reconstruction of fine details such as the teeth.
|30% TPS||60% TPS||99% TPS||99% TPS + Canny||GT|
Appendix B Empty space interpolation
In this section we stress test our method’s ability to handle regions with little guidance. In this example, the nose of the cat was shifted progressively downwards, forcing the network to interpolate the missing space. We observe the network synthesizes attractive images for moderate empty regions, however, as the empty region gets larger, the network looks for similar regions to fill the newly created void. These regions will often be areas which exhibit low amounts of detail in the primitive representation. In our case we can notice that for larger shifts, the empty space becomes greener until eventually it inpaints a background patch. We conclude that at a certain point, the network fails to learn the spatial relationship among objects in the image (i.e. that the background can not be placed on the cat’s face) and satisfies the given constraint using neighboring information (as was analysed above).
Appendix C TPS generalization improvement
Let us consider the train and test edge-image pairs presented in Fig. 15
. We input each edge map through an ImageNet-trained ResNet50 network and computed the activations at the end of the third residual block. For each pixel in the activation grid of the test image, we computed the nearest-neighbor (1NN) distance to the most similar activation of the train image. We then performedTPS augmentations to the training image, and repeated the 1NN computation with the training set now containing the activations of the original training image and its 50 augmentation. Let us compare the 1NN distances presented in Fig. 15 with and without TPS augmentations. Naturally, the 1NN distance decreased for the TPS-augmented training set due to its larger size. More interestingly, we can see that several face regions which prior to the augmentations did not have similar patches in the input, now have much lower distance (while more significant changes might not be possible to describe by TPS). In Fig. 15, we present the results of our method when trained on the training edge-image pair (shown in the leftmost column) and evaluated on the test edge. We can see that the prediction error ( difference between ResNet50 activations of the predicted and the true test image) appears to be strongly related to the 1NN distance with TPS-augmentations. This gives some evidence to the hypothesis that the network recalls input-output pairs seen in training. It also gives an explanation for the effectiveness of TPS training, namely increasing the range of input-output pairs thus generalizing to novel images.
Appendix D An ablation of the loss objective
We compare the results of our method, DeepSIM, using the original cGAN loss as in the base Pix2PixHD architecture vs. non-adversarial losses - the simple loss and the percetual loss based on the difference of VGG activations. In Fig. 16 we can see that on this image both non-adversarial losses fail. Note that at lower resolutions non-adversarial losses do indeed succeed but do not generate results of comparable sharpness of the cGAN loss. Additionally, we performed the experiment with the cGAN but without the VGG perceptual loss, the results are presented below. It can be seen that without the VGG loss, the results are reduced in quality and contain grainy artifacts.
|Training||Perceptual Loss||L1 Loss||Ours w/o VGG||Ours w/ VGG|
Appendix E Ablation of the combined primitive for the cars image
We present an ablation of the combined primitive representation (edges+segmentation) for the Cars image. In Fig. 17, we present results for a manipulation on the Cars image using edge-only, segmentation-only and combined. We can see that the combined primitive generates attractive artifact free results.
In Fig. 18 we present a qualitative comparison between different primitives on two frames from the LRS2 datasets. Although all primitives generate surprisingly good results, given the training on just a single image, the combined primitive generates cleaner outputs with fewer artifacts.
|Training Pair||Input||Output||Ground Truth|
Appendix F Out-of-distribution images using pretrained model
We manually labelled the semantic and instance segmentation maps of the Cars image, and pass it to a Pix2PixHD pre-trained by the authors on the Cityscapes dataset (containing street scenes of cars, roads and buildings). We see in Fig. 19 that Pix2Pix-HD pre-trained on a large dataset does not generalize well to out-of-distribution inputs whereas our single-image method did.
|Semantic Seg.||Input Seg.||Output||Ground|
Appendix G Video Frames
A visual evaluation on a few frames from the Cityscapes dataset can be seen in Fig. 20. We compare our method to the results of Pix2PixHD-SIA, where ”SIA” stands for ”Single Image Augmented” e.g. a Pix2PixHD model that was trained on a single image using random crop-and-flip warps but not TPS. We can observe that our method is able to synthesize very different scene setups from those seen in training, including different numbers and positions of people. Our method outperforms significantly in terms of fidelity and quality than Pix2PixHD-SIA indicating that our proposed TPS augmentation is critical for single image conditional generation.
Appendix H Qualitative comparison details
Below we provide the technical details used for our new video-based benchmark for conditional single image generation spanning a range of scenes. For the qualitative comparisons we use all video segments from the Cityscapes dataset  provided by the code in vid2vid  and Few-shot-vid2vid . These sequences are labelled aachen-000000 to aachen-000015 leftImg8bit. For each sequence, we train on frame 000000 and test using frames 000001 to 000029. We use the segmentation maps provided as image primitives. We also use the first videos in the public release of the Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset containing videos of different speakers. We extract their edges using a Canny edge detector. In total, our evaluation set contains Cityscapes frames and LRS2 frames.
Appendix I Additional Results
We present additional results of our method, DeepSIM, on a range of manipulations on different images. The manipulation fall into four categories: (1) Manipulations. (2) Removals. (3) Additions. (4) Single Image Animation. In (5) we provide visual comparison to Image Analogies , a classic method in the field of image-to-image translation.
|Splitting the Starfish|
|Changing the Shape of the Tail|
|Joining the Hamburger Halves|
|Changing the Shape of the Lake|
|Moving the Tree|
|Making the Beak Longer|
|Making the Dress Longer|
|Changing the Shoulder Area|
|Changing the Shoulder Area|
|Changing the Cut at the Bottom|
|Removing the Fork|
|Removing the Teeth of the Top Lama|
|Removing the Handcuffs (left), Removing the Right Hand (right)|
|Adding More Lakes|
|Adding the Left Paw|
|Adding an Arm|
i.4 Single Image Animation
As described in the paper, after training we can use DeepSIM to create a short animated clip in the primitive domain, feeding it frame-by-frame to the trained model we obtain a photorealistic animated clip. In addition, DeepSIM can be used also in the opposite direction. The following figures showcase a few frames from each clip. We strongly encourage the reader to view the videos on our project page.
i.5 Comparison to Image Analogies
Image Analogies by Hertzmann et al.  is based on finding multi-scale patch-level analogies between a pair of images in order to apply a wide variety of “image ﬁlter” effects to a given image. Below is a comparison of our method to theirs using the ”path” example from the ”texture-by-numbers” application shown in . In this comparison we incorporate the combined primitive (i.e. edges and the high level drawing) to allow the fine-details editing.For the manipulated image, since the original image did not contain any edges, we added and ”edge-like” layer on top of the original result. To ensure the robustness of our method to handle these hand drawn edges, we perform binary skeletonize to the manipulate edges so that they are similar to the canny edges we’ve trained on.
Appendix J A Step-By-Step Demonstration of Editing the Primitive
Performing complex manipulations by our method is quite easy. In this figure we present a step-by-step example of editing a primitive representation using ”Paint”. It simply requires sampling the required color and painting over the primitive image. One may also ”borrow” edges from other areas of the image to fill in empty spaces.
|(1) Original Image||(2) Paint Heart||(3) Copy Edges of Trees||(4) Rearrange Edges of Trees|
Appendix K TPS Examples
We present several examples of original and TPS augmented images and primitives. We can see that TPS introduces complex deformations to the samples, allowing much more expressive edits than when using simple ”flip and crop” augmentations.
|Original||TPS 1||TPS 2||TPS 3|