Generative Adversarial Nets (GANs)  have made a dramatic leap in modeling high dimensional distributions of visual data. In particular, unconditional GANs have shown remarkable success in generating realistic, high quality samples when trained on class specific datasets (, faces , bedrooms
). However, capturing the distribution of highly diverse datasets with multiple object classes (ImageNet), is still considered a major challenge and often requires conditioning the generation on another input signal  or training the model for a specific task (super-resolution , inpainting , retargeting ).
Here, we take the use of GANs into a new realm – unconditional generation learned from a single natural image. Specifically, we show that the internal statistics of patches within a single natural image typically carry enough information for learning a powerful generative model. SinGAN, our new single image generative model, allows us to deal with general natural images that contain complex structures and textures, without the need to rely on the existence of a database of images from the same class. This is achieved by a pyramid of fully convolutional light-weight GANs, each is responsible for capturing the distribution of patches at a different scale. Once trained, SinGAN can produce diverse high quality image samples (of arbitrary dimensions), which semantically resemble the training image, yet contain new object configurations and structures (Fig. 1).
Modeling the internal distribution of patches within a single natural image has been long recognized as a powerful prior in many computer vision tasks. Classical examples include denoising , deblurring , super resolution , dehazing [2, 14], and image editing [47, 37, 20, 9]. Motivated by these works, here we show how SinGAN can be used within a simple unified framework to solve a variety of image manipulation tasks, including paint-to-image, editing, harmonization, super-resolution, and animation from a single image. In all these cases, our model produces high quality results that preserve the internal patch statistics of the training image (see Fig. 2). All tasks are achieved with the same generative network, that requires no additional information or further training beyond the original training image.
1.1 Related Work
Single image deep models
Several recent works proposed to “overfit” a deep model to a single training example [49, 58, 45, 44, 7]. However, these methods are either designed for specific tasks (, super resolution , texture expansion ), or condition the generation on an additional input signal (, mapping images to images) and cannot be used to draw random samples . In contrast, our framework is generic and purely generative (maps noise to image samples). Unconditional single image GANs have been explored only in the context of texture generation [3, 26, 31]. These models do not generate meaningful samples when trained on non-texture images (Fig. 3). Our method, on the other hand, is not restricted to texture and can handle general natural images (, Fig. 1).
Generative models for image manipulation
The power of adversarial learning has been demonstrated by recent GAN-based methods, in many different image manipulation tasks [59, 10, 60, 8, 51, 54, 42, 51]. Examples include interactive image editing [59, 10], sketch2image [8, 43]
, and other image-to-image translation tasks[60, 50, 52]. However, all these methods are trained on class specific datasets, and here too, often condition the generation on another input signal. We are not interested in capturing common features among images of the same class, but rather consider a different source of training data – all the overlapping patches at multiple scales of a single natural image. We show that a powerful generative model can be learned from this data, and can be used in a number of image manipulation tasks.
Our goal is to learn an unconditional generative model that captures the internal statistics of a single training image . This task is conceptually similar to the conventional GAN setting, except that here the training samples are patches of a single image, rather than whole image samples from a database.
We opt to go beyond texture generation, and to deal with more general natural images. This requires capturing the statistics of complex image structures at many different scales. For example, we want to capture global properties such as the arrangement and shape of large objects in the image (sky at the top, ground at the bottom), as well as fine details and texture information. To achieve that, our generative framework, illustrated in Fig. 4, consists of a hierarchy of patch-GANs (Markovian discriminator) [31, 25], where each is responsible for capturing the patch distribution at a different scale of . While similar multi-scale architectures have been explored in conventional GAN settings ([27, 50, 28, 50, 12, 23]), we are the first explore it for internal learning from a single image.
2.1 Multi-scale architecture
Our model consists of a pyramid of generators, , trained against an image pyramid of : , where is a downsampled version of by a factor , for some . Each generator is responsible of producing realistic image samples w.r.t. the patch distribution in the corresponding image . This is achieved through adversarial training, where learns to fool an associated discriminator , which attempts to distinguish patches in the generated samples from patches in .
The generation of an image sample starts at the coarsest scale and sequentially passes through all generators up to the finest scale, with noise injected at every scale. All the generators have the same receptive field and thus capture structures of decreasing size as we go up the generation process. At the coarsest scale, the generation is purely generative, maps spatial white Gaussian noise to an image sample ,
The effective receptive field at this level is typically of the image’s height, hence generates the general layout of the image and the objects’ global structure. Each of the generators at finer scales () adds details that were not generated by the previous scales. Thus, in addition to spatial noise , each generator accepts an upsampled version of the image from the coarser scale, ,
All the generators have a similar architecture, as depicted in Fig. 5. Specifically, the noise is added to the image , prior to being fed into a sequence of convolutional layers. This ensures that the GAN does not disregard the noise, as often happens in conditional schemes involving randomness [60, 36, 61]. The role of the convonlutional layers is to generate the missing details in (residual learning [21, 55]). Namely, performs the operation
where is a fully convolutional net with 5 conv-blocks of the form Conv()-BatchNorm-LeakyReLU . We start with kernels per block at the coarsest scale and increase this number by a factor of every scales. Because the generators are fully convolutional, we can generate images of arbitrary size and aspect ratio at test time (by changing the dimensions of the noise maps)111Unlike retargeting methods, our generated images are random, and optimized to preserve the patch statistics, not necessarily the salient objects..
We train our multi-scale architecture sequentially, from the coarsest scale to the finest one. Once each GAN is trained, it is kept fixed. Our training loss for the th GAN is comprised of an adversarial term and a reconstruction term,
The adversarial loss penalizes for the distance between the distribution of patches in and the distribution of patches in generated samples . The reconstruction loss insures the existence of a specific set of noise maps that can produce , an important feature for image manipulation (Sec. 4). We next describe in detail. See Supplementary Materials (SM) for optimization details.
Each of the generators is coupled with a Markovian discriminator
that classifies each of the overlapping patches of its input as real or fake[31, 25]. We use the WGAN-GP loss , which we found to increase training stability, where the final discrimination score is the average over the patch discrimination map. As opposed to single-image GANs for textures (, [31, 26, 3]), here we define the loss over the whole image rather than over random crops (a batch of size ). This allows the net to learn boundary conditions (see SM), which is an important feature in our setting. The architecture of is the same as the net within , so that its patch size (the net’s receptive field) is .
We want to ensure that there exists a specific set of input noise maps, which generates the original image . We specifically choose , where is some fixed noise map (drawn once and kept fixed during training). Denote by the generated image at the th scale when using these noise maps. Then for ,
and for , we use .
The reconstructed image
has another role during training, which is to determine the standard deviationof the noise in each scale. Specifically, we take to be proportional to the root mean squared error (RMSE) between and , which gives an indication of the amount of details that need to be added at that scale.
We tested our method both qualitatively and quantitatively on a variety of images spanning a large range of scenes including urban and nature scenery as well as artistic and texture images. The images that we used are taken from the Berkeley Segmentation Database (BSD) , Places  and the Web. We always set the minimal dimension at the coarsest scale to px, and choose the number of scales s.t. the scaling factor is as close as possible to . For all the results, (unless mentioned otherwise), we resized the training image to maximal dimension px.
Qualitative examples of our generated random image samples are shown in Fig. 1, Fig. 6, and many more examples are included in the SM. For each example, we show a number of random samples with the same aspect ratio as the original image, and with decreased and expanded dimensions in each axis. As can be seen, in all these cases, the generated samples depict new realistic structures and configuration of objects, while preserving the visual content of the training image. Our model successfully preservers global structure of objects, air balloons, volcano (Fig. 1) or pyramids (Fig. 6), as well as fine texture information. Because the network has a limited receptive field (smaller than the entire image), it can generate new combinations of patches that do not exist in the training image. Furthermore, we observe that in many cases reflections and shadows are realistically synthesized, as can been in the bottom two rows of Fig. 6 (see also the first example in Fig. 8). Figure 7 illustrates training and generation of a high resolution image. Here as well, structures at all scales are nicely generated, from the global arrangement of sky, clouds and mountains, to the fine textures of the snow.
Effect of scales at test time
Our multi-scale architecture allows control over the amount of variability between samples, by choosing the scale from which to start the generation at test time. To start at scale , we fix the noise maps up to this scale to be , and use random draws only for . The effect is illustrated in Fig. 8. As can be seen, starting the generation at the coarsest scale (), results in large variability in the global structure. In certain cases with a large salient object, like the Zebra image, this may lead to unrealistic samples. However, starting the generation from finer scales, enables to keep the global structure intact, while altering only finer image features (the Zebra’s stripes).
Effect of scales during training
Figure 9 shows the effect of training with fewer scales. With a small number of scales, the effective receptive field at the coarsest level is smaller, allowing to capture only fine textures. As the number of scales increases, structures of larger support emerge, and the global object arrangement is better preserved.
3.1 Quantitative Evaluation
To quantify the realism of our generated images and how well they capture the internal statistics of the training image, we use two metrics: (i) Amazon Mechanical Turk (AMT) “Real/Fake” user study, and (ii) a new single-image version of the Fréchet Inception Distance .
AMT perceptual study
We followed the protocol of [25, 56] and performed perceptual experiments in 2 settings. (i) Paired (real vs. fake): Workers were presented with a sequence of 50 trials, in each of which a fake image (generated by SinGAN) was presented against its real training image for 1 second. Workers were asked to pick the fake image. (ii) Unpaired (either real or fake): Workers were presented with a single image for 1 second, and were asked if it was fake. In total, 50 real images and a disjoint set of 50 fake images were presented in random order to each worker.
We repeated these two protocols for two types of generation processes: Starting the generation from the coarsest (th) scale, and starting from scale (as in Fig. 8). These enable us to asses the realism of our results in two different variability levels. To quantify the diversity of the generated images, for each training example we calculated the standard deviation (std) of the intensity values of each pixel over 100 generated images, averaged it over all pixels, and normalized by the std of the intensity values of the training image (see SM for a detailed explanation).
The real images were randomly picked from the “places” database  from the subcategories Mountains, Hills, Desert, Sky. In each of the 4 tests, we had 50 different participants. In all tests, the first 10 trials were a tutorial including a feedback. The results are reported in Table 1.
(global structure of the original image remains fixed). For each of these cases, we report the confusion rates for a paired study (real-vs.-fake image pairs are shown), and unpaired study (either fake or real image is shown). Standard deviations were estimated by bootstrap.
As expected, the confusion rates are consistently larger in the unpaired case, where there is no reference for comparison. In addition, it is clear that the confusion rate decreases with the diversity of the generated images. However, even when large structures are changed, our generated images were hard to distinguish from the real images (a score of 50% would mean perfect confusion between real and fake). The full set of test images are included in the SM.
Single Image Fréchet Inception Distance
We next quantify how well SinGAN captures the internal statistics of . A common metric for GAN evaluation is the Fréchet Inception Distance (FID) 
, which measures the deviation between the distribution of deep features of generated images and that of real images. In our setting, however, we only have a single real image, and are rather interested in itsinternal
patch statistics. We thus propose the Single Image FID (SIFID) metric. Instead of using the activation vector after the last pooling layer in the Inception Network (a single vector per image), we use the internal distribution of deep features at the output of the convolutional layer just before the second pooling layer (one vector per location in the map). Our SIFID is the FID between the statistics of those features in the real image and in the generated sample.
As can be seen in Table 2, the average SIFID is lower for generation from scale than for generation from scale , which aligns with the user study results. To better understand these numbers, we also report the correlation between the SIFID scores and the confusion rates for the fake images. Note that there is a significant (anti) correlation between the two, implying that a small SIFID is typically a good indicator for a large confusion rate. The correlation is stronger for the paired tests, since SIFID is a paired measure (it operates on the pair ).
|1st Scale||SIFID||Survey||SIFID/AMT Correlation|
We explore the use of SinGAN for a number of image manipulation tasks. To do so, we use our model after training, with no architectural changes or further tuning and follow the same approach for all applications. The idea is to utilize the fact that at inference, SinGAN can only produce images with the same patch distribution as the training image. Thus, manipulation can be done by injecting (a possibly downsampled version of) an image into the generation pyramid at some scale , and feed forwarding it through the generators so as to match its patch distribution to that of the training image. Different injection scales lead to different effects. We consider the following applications (see SM for more results and the injection scale effect).
Increase the resolution of an input image by a factor . We train our model on the low-resolution (LR) image, with a reconstruction loss weight of and a pyramid scale factor of for some . Since small structures tend to recur across scales of natural scenes , at test time we upsample the LR image by a factor of and inject it (together with noise) to the last generator, . We repeat this times to obtain the final high-res output. An example result is shown in Fig. 10. As can be seen, the visual quality of our reconstruction exceeds that of state-of-the-art internal methods [49, 45] as well as of external methods that aim for PSNR maximization . Interestingly, it is comparable to the externally trained SRGAN method , despite having been exposed to only a single image. Following , we compare these 5 methods in Table 3 on the BSD100 dataset  in terms of distortion (RMSE) and perceptual quality (NIQE ), which are two fundamentally conflicting requirements . As can be seen, SinGAN excels in perceptual quality; its NIQE score is only slightly inferior to SRGAN, and its RMSE is slightly better.
|External methods||Internal methods|
Transfer a clipart into a photo-realistic image. This is done by feeding the clipart image into one of the coarse scales (typically or ). Examples are shown in Fig. 2 and Fig. 11. As can be seen, the global structure of the painting is preserved, while texture and high frequency information matching the original image are realistically generated. Our method outperforms style transfer methods [38, 16] in terms of visual quality (Fig. 11).
Produce a composite in which a pasted object is realistically blended with a background image. We train SinGAN on the background image, and inject the naively pasted composite at test time. As can be seen in Fig. 2 and Fig. 13, our model tailors the pasted object’s texture to match the background, and often preserves its structure better than . Scales 2,3,4 typically lead to good balance between preserving the object’s structure and transferring the background’s texture.
Produce a seamless composite in which image regions have been copied and pasted in other locations. Here, again, we inject the composite into one of the coarse scales. As shown in Fig. 2 and Fig. 12, SinGAN re-generates fine textures and seamlessly stitches the pasted parts, producing nicer results than Photoshop’s Content-Aware-Move.
Single Image Animation
Create a short video clip with realistic object motion, from a single input image. Natural images often contain repetitions, which reveal different “snapshots” in time of the same dynamic object  (an image of a flock of birds reveals all wing postures of a single bird). Using SinGAN, we can travel along the manifold of all appearances of the object in the image, thus synthesizing motion from a single image. We found that for many types of images, a realistic effect is achieved by a random walk in -space, starting with for the first frame at all generation scales. Results are available on https://youtu.be/xk8bWLZk4DU.
We introduced SinGAN, a new unconditional generative scheme that is learned from a single natural image. We demonstrated its ability to go beyond textures and to generate diverse realistic samples for natural complex images. Internal learning is inherently limited in terms of semantic diversity compared to externally trained generation methods. For example, if the training image contains a single dog, our model will not generate samples of different dog breeds. Nevertheless, as demonstrated by our experiments, SinGAN can provide a very powerful tool for a wide range of image manipulation tasks.
We thank Idan Kligvasser for the assistance and valuable insights. This research was supported by the Israel Science Foundation (grant no. 852/17).
-  S. Avidan and A. Shamir. Seam carving for content-aware image resizing. In ACM Transactions on graphics (TOG), volume 26, page 10. ACM, 2007.
-  Y. Bahat and M. Irani. Blind dehazing using internal patch recurrence. In 2016 IEEE International Conference on Computational Photography (ICCP), pages 1–9. IEEE, 2016.
-  U. Bergmann, N. Jetchev, and R. Vollgraf. Learning texture manifolds with the periodic spatial GAN. arXiv preprint arXiv:1705.06566, 2017.
-  Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor. The 2018 pirm challenge on perceptual image super-resolution. In European Conference on Computer Vision Workshops, pages 334–355. Springer, 2018.
Y. Blau and T. Michaeli.
The perception-distortion tradeoff.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6228–6237, 2018.
-  A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
-  C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. arXiv preprint arXiv:1808.07371, 2018.
-  W. Chen and J. Hays. Sketchygan: towards diverse and realistic sketch to image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9416–9425, 2018.
-  T. S. Cho, M. Butman, S. Avidan, and W. T. Freeman. The patch transform and its applications to image editing. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
-  T. Dekel, C. Gan, D. Krishnan, C. Liu, and W. T. Freeman. Sparse, smart contours to represent and edit images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3511–3520, 2018.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
-  E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
-  B. Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pages 569–593. Springer, 1992.
-  G. Freedman and R. Fattal. Image and video upscaling from local self-examples. ACM Transactions on Graphics (TOG), 30(2):12, 2011.
L. Gatys, A. S. Ecker, and M. Bethge.
Texture synthesis using convolutional neural networks.In Advances in neural information processing systems, pages 262–270, 2015.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016.
-  D. Glasner, S. Bagon, and M. Irani. Super-resolution from a single image. In 2009 IEEE 12th International Conference on Computer Vision (ICCV), pages 349–356. IEEE, 2009.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
-  K. He and J. Sun. Statistics of patch offsets for image completion. In European Conference on Computer Vision, pages 16–29. Springer, 2012.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
-  X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5077–5086, 2017.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.arXiv preprint, 2017.
-  N. Jetchev, U. Bergmann, and R. Vollgraf. Texture synthesis with spatial generative adversarial networks. Workshop on Adversarial Training, NIPS, 2016.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
-  T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
-  C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pages 702–716. Springer, 2016.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 136–144, 2017.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
-  F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep painterly harmonization. arXiv preprint arXiv:1804.03189, 2018.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In null, page 416. IEEE, 2001.
-  M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
-  R. Mechrez, E. Shechtman, and L. Zelnik-Manor. Saliency driven image manipulation. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1368–1376. IEEE, 2018.
-  R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextual loss for image transformation with non-aligned data. In Proceedings of the European Conference on Computer Vision (ECCV), pages 768–783, 2018.
-  T. Michaeli and M. Irani. Blind deblurring using internal patch recurrence. In European Conference on Computer Vision, pages 783–798. Springer, 2014.
-  A. Mittal, R. Soundararajan, and A. C. Bovik. Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2013.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
-  G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez. Invertible conditional GANs for image editing. arXiv preprint arXiv:1611.06355, 2016.
-  P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scribbler: Controlling deep image synthesis with sketch and color. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5400–5409, 2017.
-  A. Shocher, S. Bagon, P. Isola, and M. Irani. Internal distribution matching for natural image retargeting. arXiv preprint arXiv:1812.00231, 2018.
-  A. Shocher, N. Cohen, and M. Irani. “Zero-Shot” Super-Resolution using Deep Internal Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3118–3126, 2018.
-  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760. Springer, 2012.
-  D. Simakov, Y. Caspi, E. Shechtman, and M. Irani. Summarizing visual data using bidirectional similarity. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2008.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional GANs. arXiv preprint arXiv:1711.11585, 2017.
-  X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. 2016.
-  W. Xian, P. Sangkloy, V. Agrawal, A. Raj, J. Lu, C. Fang, F. Yu, and J. Hays. Texturegan: Controlling deep image synthesis with texture patches. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  X. Xu, L. Wan, X. Liu, T.-T. Wong, L. Wang, and C.-S. Leung. Animating animal motion from still. ACM Transactions on Graphics (TOG), 27(5):117, 2008.
J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang.
Generative image inpainting with contextual attention.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5505–5514, 2018.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
R. Zhang, P. Isola, and A. A. Efros.
Colorful image colorization.In European conference on computer vision, pages 649–666. Springer, 2016.
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
Learning deep features for scene recognition using places database.In Advances in neural information processing systems, pages 487–495, 2014.
-  Y. Zhou, Z. Zhu, X. Bai, D. Lischinski, D. Cohen-Or, and H. Huang. Non-stationary texture synthesis by adversarial expansion. arXiv preprint arXiv:1805.04487, 2018.
-  J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision (ECCV), pages 597–613. Springer, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, 2017.
-  J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465–476, 2017.
-  M. Zontak and M. Irani. Internal statistics of a single natural image. In CVPR 2011, pages 977–984. IEEE, 2011.
-  M. Zontak, I. Mosseri, and M. Irani. Separating signal from noise using patch recurrence across scales. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1195–1202, 2013.