Learning image distributions is a key computer vision task with many applications such as novel image synthesis, image priors and image translation. Many methods were devised for learning image distributions either by directly learning the image probability density function or by aligning the distribution of images with a simpler parametric distribution. Current methods normally require large image collections for learning faithful models of image distributions. This limits the applicability of current methods to images which are similar to many other images (such as facial images) but not to the long-tail of unique images. Even when large image collections exist, the task of learning image generators remains very difficult for complex images i.e. learning models for large image collections such as ImageNet or Places365 is still the subject of much research.
A complementary approach to learning image generators from large collections is learning a model for each image. Although the sample size is far smaller, a single image has a simpler distribution which may be simpler to learn. Very recently, SinGAN, a general purpose single image generative adversarial network model was proposed by Shaham et al. 
. This pioneering work, utilizes the multiple patch pyramids present in a single image to train a cascade of simple conditional adversarial networks. It was shown to be able to perform a set of conditional and unconditional image synthesis task. Despite the amazing progress achieved by this work, SinGAN has several drawbacks mainly due to the the use of generative adversarial (GAN) training. The stability issues inherent in GAN training, motivated the architectural choice of using a cascade of generators rather than a single end-to-end network, which is more brittle and less convenient. Similarly to other PatchGAN-based methods, it can only deal with texture-like image details rather than large objects.
In recent years, an alternative set of techniques to generative adversarial networks has been developed. These techniques, sometimes named ”non-adversarial” methods, attempt to perform image generative modeling without the use of adversarial training. non-adversarial methods have achieved strong performance on unconditional image generation (VAE , GLO , GLANN , VQ-VAE2 ), as well as unsupervised image translation tasks (, ). In this work, we extend non-adversarial training to single-image unconditional image generation tasks as well as a set of conditional generation tasks.
To contextualize SinGAN within the process of non-adversarial learning, we offer an insightful interpretation for the operation of SinGAN. Our main insight is that SinGAN is a combination of a super-resolution network and GAN-based image augmentations of the single input image. All these networks are learned in a stage-wise rather than end-to-end manner. The cascaded training is due to the difficulty of GAN training. As insufficient variation is present within a single image, augmentation of the training image is necessary.
We propose an alternative non-adversarial approach, AugurOne, which takes the form of an upsampling network. AugurOne learns to upsample a downscaled version of the input image by reconstructing it at high resolution (much like single image super-resolution). As a single image does not present sufficient variation, we augment the training image with carefully crafted affine and non-affine augmentations. AugurOne is trained end-to-end with a non-adversarial perceptual reconstruction loss. In cases that require synthesis of novel image samples, we add a front-end variational autoencoder which learns a compact latent space allowing novel image generation by latent interpolation. Novel images of arbitrary size are synthesized using an interpolations between concatenations of different augmentations of the input images. Our method enjoys fast and stable training and achieves very strong results on novel image synthesis, most remarkable for image animation with large object movements. This is enabled by our encoder allowing control over the novel synthesized images. Our method achieves very compelling results on conditional generation tasks e.g. paint-to-image and edges-to-image.
2 Previous Work
Generative Modeling: Image generation has attracted research for several decades. Some successful early approach used Gaussian or mixtures of Gaussians models (GMM) 
. Due to the limited expressive power of GMM, such methods achieved limited image resolutions and quality and mainly focused on modeling image patches rather than large images. Over the last decade, with the advent of deep neural network models and increasing dataset sizes, significant progress was made on image generation models. Early deep models include Reduced Boltzmann Machines (RBMs). Variational Autoencoders, first introduced by Kingma and Welling made a significant breakthrough as a principled model for mapping complex empirical distributions (e.g. images) to simple parametric distributions (e.g. Gaussian). Although VAEs are relatively simple to train and have solid theoretical foundations, the images they generate are not as sharp as those generated by other state-of-the-art methods. Auto-regressive and flow-based models    have also been proposed as a promising direction for image generative models.
Adversarial Generative Models: Currently, the most popular paradigm for training image generation models is Generative Adversarial Networks (GANs). GANs were first introduced by Goodfellow et al.  and are currently used in computer vision for three main uses: i) unconditional image generator training  ii) unsupervised image translation between domains  
iii) serving as a perceptual image loss function. GANs are able to synthesize very sharp images but suffer from some notable drawbacks, particularly very sensitive training and mode dropping (hurting generation diversity). Overcoming the above limitations has been the focus of much research over the last several years. One mitigation is changing the loss function to prevent saturation (e.g. Wasserstein GAN ). Another mitigation is using different types of discriminator regularizations (where the aim is typically Lipschitzness). Regularization methods include: clipping , gradient regularization   or spectral normalization .
Non-Adversarial Methods: An alternative direction motivated by the limitations of GAN methods is the development of non-adversarial methods for image generation. Some notable methods include: GLO  and IMLE . Hoshen at al.  combine GLO and IMLE into a new method, GLANN which is able to synthesize sharp images from a parametric distribution. It was able to outperform GANs in on a low-resolution benchmark. VQ-VAE2 
also consists of a two-step approach, a combination of a vector-quantization VAE and an auto-regressive pixelCNN model and was able to achieve very high resolution image generation competitive with state-of-the-art GANs. Non-adversarial methods have also been successfully introduced for supervised image-mapping (Chen and Koltun), and unsupervised disentanglement (LORD ). In this work, we present a non-adversarial alternative for training unconditional image generation from a single-image.
Single image generators:
Limited work was done on training image generators from a single-image due to difficulty of the task. Deep Image Prior is a notable work which shows that training a deep network on a single image can form an effective image prior, however this work cannot perform unconditional image generation. Previous work was also performed on training image inpainting and super-resolution  from a single image however these works are limited to a single application and perform conditional generation. Our work draws much inspiration from the seminal work of Shaham et al. , and presents several novelties which are demonstrated to be advantageous. Our method is non-adversarial and therefore enjoys fast, stable and end-to-end training. It uses augmentations that are explainable (as opposed the the more opaque GAN) giving control over the learning process. Our method is also able to deal with larger-scale objects leading to attractive animations from a single image, as well as unsupervised domain translation between paint to photo-realistic images. It is also very effective for conditional generation e.g. mapping edges to image given a single training pair.
3 Analysis of SinGAN
In this section, we analyze SinGAN from the perspective of non-adversarial learning.
SinGAN consists of a cascade of upscaling networks. Each network takes as an input an upscaled low-res image, combined with noise and learns to predict the residuals between the upscaled low-res image and the high-res image. The generators are trained sequentially on the output of the previous (low-resolution) generator.
Additionally, a noise to low-res image GAN
is learned at the lowest resolution level. This unconditional generator learns to take in a low-resolution image, where each pixel was sampled from a random normal distribution.
There are two losses used: i) an reconstruction loss for where - this is a standard super-resolution loss ii) an adversarial loss at every level.
|Sample 1||Sample 2||Sample 3|
|Training Image||Input||With noise||No noise|
In order to construct a non-adversarial alternative to SinGAN, we reinterpret this method as consisting of two sets of familiar operations. First, a set of super-resolution networks trained using a standard combination of PatchGAN and losses as a perceptual loss. This is a standard conditional generation approach, also used for super-resolution. It suffers from two drawbacks on its own; a single image is not sufficient for training due to overfitting and it does not supply the means for generating novel images. SinGAN therefore combines it with a low-resolution GAN which serves two purposes: it allows for generating novel images and it provides augmentation for the single training image, allowing for training deeper networks. Additionally, we hypothesize the noise in upscaling generators () is typically not a critical component for training SinGANs.
To validate our analysis of SinGAN, we performed a supporting experiment. We removed the noise from the training of all the conditional generators i.e. from all but not from . We can see random samples in Fig. 2. The random generation is of similar quality to that obtained with noise.
To conclude, for training a single image generator having the capability of SinGAN, there are several requirements i) a perceptual loss for evaluating the upscaling ability of generators ii) a method for augmenting the single input image iii) a principled method for generating novel images. A conditional single image generator has only the first two requirements. In Sec. 4, we will present a novel non-adversarial method, AugurOne, for training single image generators.
In this section, we propose a principled non-adversarial method for training single image generators. Our method is trained end-to-end and is fast and robust.
4.1 End-to-end image upscaling
One of the conclusions of Sec. 3 is that the basic component in a single-image generator training is a high-quality upscaling network. Such a network is already sufficient for single-image conditional tasks such as harmonization, Edges2Image or Paint2Image. There were two main challenges solved by adversarial methods: the small sample size (just a single image) and an effective perceptual loss.
We propose a non-adversarial solution for training an upscaling network. Instead of using a cascade of generators, we simply train a single multi-scale generator network. The generator architecture is identical to that of SinGAN, however it is trained end-to-end. The input is an image at the lowest resolution, the expected outputs are images of the entire set of scales.
The loss function consists of the sum of reconstruction errors between the predicted and actual images across the entire image pyramid.
Conditional image generation networks are quite sensitive to the loss function used to evaluate reconstruction quality. Similarly to previous non-adversarial works, we use the VGG perceptual loss which extracts features from the predicted and actual images and computes the difference between them. It was found by  to correlate with human perceptual similarity. A sketch of our method can be seen in Fig. 3.
4.2 Augmenting the input images
In Sec. 4.1, we proposed an upscaling network for conditional generation. By itself it is quite similar to previous conditional generation methods. The unique challenge here is the ability to generalize from a single image. We solve this task by extensive augmentations. Similarly to most other deep conditional generation works, we use crops and horizontal flips. However as just a single image is significantly less data then used in most other works, we use another non-linear augmentation, thin-plate-spline (TPS) . Our TPS implementation proceeds in the following stages: i) it first constructs a target equi-spaced grid of size . We denote the target tile . ii) it randomly transforms each grid point with magnitude determined by a scale factor iii) it learns a smooth TPS transformation with linear and radial terms which approximates the randomly warped target grid while preserving smoothness. The importance of each objective is parametrized by . The TPS loss objective is presented below:
This optimization can be performed very efficiently e.g. Donato and Belongie . The resulting transformation is then used to transform the original image for a training iteration. Different TPS warps are used for every training iteration.
4.3 Learning a compact latent space
The upsampling network proposed above can effectively upsample low-resolution images. This will be shown in Sec. 5 to be effective for conditional generation tasks. In this section, we propose a method for non-adversarial unconditional single image generation.
To perform unconditional generation, our method learns a compact latent space for the augmentations of the single input image. We propose to combine the upscaling network , proposed in the previous sections with a variational autoencoder (VAE). The variational autoencoder consists of an encoder and decoder . The encoder takes in a low resolution input image, which may be the original image or one of its augmentations - in both cases downsampled to the lowest resolution. The output of the encoder is a latent code of small spatial dimensionality and a larger number of channels (typically ).
We add normally distributed per-pixel random noise to the latent code, and pass it through the decoder network :
We train the upscaling network and the VAE end-to-end, with the addition of a KL-divergence loss and a perceptual reconstruction loss on , . The full optimization loss is:
4.4 Synthesizing novel images with latent interpolations
The objective for adding the VAE front end is to be able to perform image manipulations in latent space. Once the encoder, decoder and upsampling networks are trained, we can encode every input image into its latent code . For two different augmentations of the single input image: and , we obtain the two codes and . We can generate a novel image by interpolating between the two codes and generating a high-res image:
Where is a scalar between . Furthermore, we can use the same idea for generating animations. To generate a short video clip from a single image (e.g. Fig. 7), we encode two augmentations as described before and sample the interpolation with sampled at regular intervals:
4.5 Implementation details
The VAE encoder is composed of convolutional blocks each followed by batch-norm and LeakyReLU. The convolutional blocks have
channels respectively. The VAE decoder is the inverse of the encoder composed by deconvolutions and ReLU activations. At the end of the decoder, we put a Tanh activation. We do not learn noisebut rather used a constant value of . The upscaling generator is a composition of (where depends on the resolution) blocks, each block consists of: convolutions, batch-norm, ReLU, another convolution, followed by an upsampling layer. The hyper-parameters follow the original implementation of SinGAN. In all of our experiments we trained with Adam optimizer with learning rate of , and a cosine annealing schedule.
In this section, we evaluate several applications highlighting the capabilities of our approach.
5.1 Domain translation: Paint2Image, Edges2Image
We investigate the performance of our method on domain translation tasks. The task of painting to image, trains an upscaling network without the VAE front-end (as no latent manipulation is necessary). The input to the network is the training image after being downscaled to low-resolution and after color-quantization. At inference time, we feed a downscaled painting into the upscaling network. The upscaled output can be observed in Fig. 6. Our results are compared with SinGAN. Our method performs very well on such cases as can be seen in ”Dog”, ”Ostrich” and ”Face”. SinGAN was unable to deal with larger objects, leaving them very similar to the original paint image. Our results are comparable with SinGAN on small objects such as ”Birds”. In Tab. 1 the methods are compared quantitatively in terms of Single Image FID . AugurOne outperforms SinGAN on this task for most images. Similarly, in the Edges2Image task, the input is a low-resolution edge image corresponding to the training image. The edge image can be obtained using a manual sketch or an automatic edge detector (we have experimented with both, and have found both options to work well). At test time a new sketch is provided to the network, the output is a novel photo corresponding to the sketch. Examples can be seen in Fig. 5, and many more can be seen in the appendix.
5.2 Animating still images
Our method trains an encoder jointly with the generator. The learned encoder allows us to interpolate between two images and therefore generate compelling animations with large motions. To fully appreciate our results, we strongly urge the reader to view the videos on the project page. Some examples of the results of our method can be observed in Fig. 7. We can observe from the ”Balloons” animation that our method generates more directed motions in comparison with SinGAN. The zoom-in column allows a better view of the motion synthesized by our method. Balloons are relatively small objects and feature texture-like images which SinGAN was designed for. We therefore also evaluate the two methods on face animation, a large complex object. In this case, our method is able to generate smooth and sharp animations, whereas SinGAN generates disfigured faces, demonstrating the superiority of our method on large structured objects. Our method is also evaluated on ”Ostrich” and ”Starry Night” and is shown to obtain pleasing animations.
5.3 Novel Images
Our method can be extended to generate novel images of arbitrary size by the following procedure: i) concatenating an arbitrary number of different random augmentations of the original image to form a pair of images of arbitrary size ii) Encode each of the pair of images into latent codes. iii) Interpolate between the two random latent codes and project the interpolated latent code through the generator to form a high-resolution image.
Several examples of our results can be observed in Fig. 8. We can see from ”Dog” examples that our method is able to generate compeling novel images of the dog. SinGAN fails on this image, as it struggles to deal with large complex objects. We can see that simple techniques such as linear interpolation do not generate compelling novel images. We also presented an example of generating ”Birds” images of arbitrary size. Our method performs comparably to SinGAN on this task. As a failure mode of our method, we should note that highly textured scenes are sometimes imperfectly captured by the latent space of the VAE, and our results can result in blurrier results than SinGAN in such cases.
5.4 Image Harmonization
Our method is evaluated on image harmonization in Fig. 9. The task of image harmonization places an external object on top of the single image that the generator was trained on. The task is to seamlessly blend the external object with the background image. We train our upscaling network without the VAE front-end (as there is no need for latent interpolation for this task). Similarly to SinGAN, we perform harmonization using injection at an intermediate block of the network. It can be observed that our method generates harmonious images.
The motivation for using a single image: Generative models are typically trained on large image collections and have achieved amazing results, in particular on face image generation. In this work we tackled the challenging task of training generative models from a single image. Our motivation for training models from a single image, is that for the long-tail of images we will not have image collections that are very close to it. Such collections only exist in for objects of wide interest (e.g. faces) or for very diverse datasets that require models to generalize to modeling many different types of images, an important but difficult task. Instead, single image generation makes no assumptions on the availability of large collection. Another motivation is the cost of obtaining copy-rights to a large number of images, a cost that may be prohibitive. Instead, simply training based on a single image can be performed by the image owner with no additional cost.
Pre-trained perceptual loss: In this work, we relied strongly on the availability of pre-trained deep perceptual losses. It is sometimes argued that this amounts to using significant supervision as such perceptual losses are trained on very large supervised datasets (in our case, on the ImageNet dataset). We argue that this does not suffer from any of the disadvantages that we highlighted for large dataset generative models as imagenet-pretrained models are easily available at no cost and require no extra supervision for new tasks such as single-image generator training.
Non-adversarial Learning: Generative adversarial networks have dominated image generation and translation over the last few years. Although they have many well documented advantages, they have disadvantages in terms of stability and mode collapse. We suspect that GANs have been used for applications that do not require them. This is further motivated by previous work on non-adversarial learning e.g. image generation, unsupervised domain translation. In this work, we showed that our non-adversarial approach has yielded a simpler method which could be trained end-to-end. Furthermore, the modeling flexibility has enabled us to include an encoder and be able to deal with whole objects rather than texture only. This enables us to generate object-level single image animations. One advantage of GAN methods over our method is their good performance as a perceptual loss for textures, which we observed to be better than non-adversarial perceptual losses.
Super-resolution: Our method can be trained to perform super-resolution in a very similar manner to ZSSR . Indeed, our upsampling network with a single stage is not very different from the network used by ZSSR. Similarly to this method, we can obtain more faithful reconstruction, while suffering a little in terms of perceptual quality than GAN methods. This lies on a different point of the perception-distortion curve than SinGAN.
We analysed a recent method for training image generators from a single image. Our analysis simplified SinGAN into 3 parts: upsampling, augmentation and latent manipulation. This lead to the development of a novel non-adversarial approach for single image generative modeling. Our approach was shown to handle operations on large objects better than previous methods, generating compelling single-image animations with large motion and performing effective translation between domains.
-  (2017) Wasserstein gan. In ICLR, Cited by: §2.
-  (2018) Optimizing the latent space of generative networks. In ICML, Cited by: §1, §2.
-  (1989) Principal warps: thin-plate splines and the decomposition of deformations. IEEE Transactions on pattern analysis and machine intelligence 11 (6), pp. 567–585. Cited by: §4.2.
-  (2017) Photographic image synthesis with cascaded refinement networks. ICCV. Cited by: §2.
-  (2019) Bidirectional one-shot unsupervised domain mapping. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1784–1792. Cited by: §1.
NICE: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §2.
-  (2002) Approximate thin plate spline mappings. In European conference on computer vision, pp. 21–31. Cited by: §4.2.
-  (2019) Demystifying inter-class disentanglement. arXiv preprint arXiv:1906.11796. Cited by: §2.
-  (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §2.
-  (2017) Improved training of wasserstein gans. In NIPS, Cited by: §2.
Non-adversarial image synthesis with generative latent nearest neighbors.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5811–5819. Cited by: §1, §2.
-  (2018) NAM: non-adversarial unsupervised domain mapping. In ECCV, Cited by: §1.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §2.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §4.1.
-  (2017) Learning to discover cross-domain relations with generative adversarial networks. In ICML, Cited by: §2.
-  (2018) Glow: generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039. Cited by: §2.
-  (2014) Auto-encoding variational bayes. In ICLR, Cited by: §1.
-  (2014) Auto-encoding variational bayes. In ICLR, Cited by: §2.
-  (2018) Implicit maximum likelihood estimation. arXiv preprint arXiv:1809.09087. Cited by: §2.
Which training methods for gans do actually converge?.
International Conference on Machine Learning (ICML), Cited by: §2.
-  (2018) Spectral normalization for generative adversarial networks. In ICLR, Cited by: §2.
-  (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §2.
-  (2016) . arXiv preprint arXiv:1601.06759. Cited by: §2.
-  (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, Cited by: §2.
-  (2019) Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, pp. 14837–14847. Cited by: §1, §2.
-  (2019) Singan: learning a generative model from a single natural image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4570–4580. Cited by: §1, §2, §5.1.
-  (2018) “Zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3118–3126. Cited by: §2, §6.
-  (2018) Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §2.
-  (2011) From learning models of natural image patches to whole image restoration. In ICCV, Cited by: §2.
8.1 Additional Edges2Image Results
We present additional examples of our Edge2Image results. They were obtained by: i) taking an input hi-res image ii) computing its edges using the Canny edge detector iii) training our upscaling network, taking as input a downscaled edge image and the full hi-res image, the network is trained to predict the hi-res image using the low-res edges iv) augment the single image and edge pair by crops, flips and most importantly thin plate spline (TPS) transformations. Inference was performed by drawing a new edge image and inputting into our trained network (after downscaling). Results are presented in Figs. 10 - 13. Our method is able to perform significant image manipulations resulting in very high quality outputs.
|Training Edge||Training Image||Input Edge||Output Image|
|Reducing the mouth size and cutting the chin|
|Reducing the size of the eyes and stretching the mouth|
|Lifting the nose|
|Training Edge||Training Image||Input Edge||Output Image|
|Adding a large balloon|
|Adding distant mountains|
|Lowering hand of Kuala|
|Moving the tree|
|Training Image||Input Edge||Output Image|
|Changing buckle, narrowing shoe|
|Lowering the buckle|
|Shortening the upper part|
|Increasing height of heel|
|Training Image||Input Edge||Output Image|
|Adding a leg|
|Transforming into a snake|
8.2 Additional Paint2Image Results
We present more examples of our Paint2Image results. They were obtained by standard training of our upsampilng network. Input low-res images were quantized during training (as done in SinGAN). Results are shown in Figs. 14 - 16. Our method is able to obtain extremely high-quality results generalizing from paint to high-quality image outputs.
|Training Image||Input Paint||Output Image|
|Stretching the cheeks|
|Lifting and stretching the nose|
|Shrinking the chin and the mouth|
|Training Image||Input Paint||Output Image|
|Changing the position of the head and the neck, increasing the forehead|
|Changing the neck’s position and shape|
|Changing point of view|
|Training Image||Input Paint||Output Image|
|Stretching the nose to the right|
|Stretching the nose to the middle|
|Shrinking the nose|