A Survey on Adversarial Image Synthesis

06/30/2021 ∙ by William Roy, et al. ∙ 0

Generative Adversarial Networks (GANs) have been extremely successful in various application domains. Adversarial image synthesis has drawn increasing attention and made tremendous progress in recent years because of its wide range of applications in many computer vision and image processing problems. Among the many applications of GAN, image synthesis is the most well-studied one, and research in this area has already demonstrated the great potential of using GAN in image synthesis. In this paper, we provide a taxonomy of methods used in image synthesis, review different models for text-to-image synthesis and image-to-image translation, and discuss some evaluation metrics as well as possible future research directions in image synthesis with GAN.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With recent advances in deep learning, machine learning algorithms have evolved to such an extent that they can compete and even defeat humans in some tasks, such as image classification on ImageNet. In particular, state-of-the-art generative adversarial networks (GANs) are able to generate high-fidelity natural images of diverse categories. It is demonstrated that, given proper training, GANs are able to synthesize semantically meaningful data from standard data distributions. The GAN was introduced by Goodfellow et al. in 2014, and performs better than other generative models in producing synthetic images, and later has become an active research area in computer vision. The standard GAN contains two neural networks, a generator and a discriminator, in which the generator attempts to create realistic samples that deceive the discriminator, which strives to distinguish the real samples from the fake ones. The training procedure continues until the generator wins the adversarial game.Then, the discriminator makes the decision that a random sample either is fake or real.

From the above basic definition, we see that converting an image from one source domain to another target domain can cover many problems in computer vision, computer graphics. Specifically, GAN models has been broadly applied in image synthesis [46, 86, 28, 53, 64, 69], image segmentation [59, 14, 31], style transfer [84, 23, 62, 67, 40, 73, 54, 70]

, image inpainting

[47, 85, 50, 36, 82]

, 3D pose estimation

[11, 32, 77], image composition [65, 72, 71, 75, 66, 74]

, image/video colorization

[79, 51, 15, 76, 57, 30]

, image super-resolution

[63, 80], domain adaptation [43, 3, 35, 68]. We will analyze and discuss these related applications in details in following section.

In this paper, In this paper, we provide an empirical comparative study of GAN models for synthetic image generation. We show how GAN can be trained efficiently to learn hidden discriminative features. For fair comparison between the tested approaches, we used a common framework in Python and Tensorflow to train the models with 4 NVIDIA GTX Geforce 1080 Ti GPUs. The outline of our review contains the definition of the problems, a summary of the main GAN methods, and the detailed coverage of the specific solutions. We reserve a detailed section for all the synthetic image generation benchmarks that are related to GANs. Moreover, this survey discusses the advantages and disadvantages of current GAN models with the mathematical foundations and theoretical analysis. Finally, an accompanying web-page as a living repository of papers that address GANs for image synthesis problems is structured based on our taxonomy.

Ii Related Work

In this section, we first enlist some of the famous works that have benefited from GANs, then we focus on the various applications including medical imaging and image-to-image translation. The literature review shows few review papers on GAN architectures and performance are available. The generative model [27, 56, 44]

assumes that data is created by a particular distribution that is defined by two parameters (i.e., a Gaussian distribution) or non-parametric variants (each instance has its own contribution to the distribution), and it approximates that underlying distribution with particular algorithms. This approach enables the generative model to generate data rather than only discriminate between data (classification)

[6] [60] [55] and generating invisible data. In this paper, we gather a wide range of GAN models and discuss them in details. To avoid interruptions in the flow of our exposition, we first present the original GAN definition and then illustrate its variations in next subsections.

A generative G parameterized by and receives random noise as input and output will be sample G . Hence the output can be a sample generated from the distribution: G . Moreover, there are a massive training data x received from , and the objective of the G is to approximate while using [12, 1, 39, 48, 41, 13, 81, 2, 19, 42], as well as the intuition behind them. Both models basically aim to construct a replica for generating the desired samples from the latent variable , but their specific approaches are different. The produced samples that are not close to decision boundary can deliver more gradient when updating the

, which solves the vanishing gradient problems of GAN training. It is worth mentioning that, there are three key challenges with the GANs models.

The generator inputs a random noise sampled from the model’s prior distribution to generate a fake image to fit the distribution of real data as much as possible. Then, the discriminator randomly takes the real sample from the dataset and the fake sample

as input to output a probability between 0 and 1, indicating whether the input is a real or fake image. In other words,

wants to discriminate the generated fake sample while intends to create samples to confuse . The original GAN proposed by [12]

can be considered an unconditional GAN. It adopts the multilayer perceptron (MLP)

[45] to construct a structured probabilistic model taking latent noise variables and observed real data

as inputs. Because the convolutional neural network (CNN)

[26] has been demonstrated to be more effective than the MLP in representing image features, the studies in [48] proposed the deep convolutional generative adversarial networks (DGANs) to learn a better representation of images and improve the original GAN performance.

Although GAN is very effective in image synthesis, its training process is very unstable and requires a lot of tricks to get a good result. Despite its instability in training, GAN also suffers from the mode collapse

problem, as discussed by Goodfellow et al. In the original GAN formulation, the discriminator does not need to consider the variety of synthetic samples, but only focuses on telling whether each sample is realistic or not, which makes it possible for the generator to spend efforts in generating a few samples that are good enough to fool the discriminator. For example, although the MNIST dataset contains images of digits from 0 to 9, in an extreme case, a generator only needs to learn to generate one of the ten digits perfectly to completely fool the discriminator, and then the generator stops trying to generate the other nine digits. The absence of the other nine digits is an example of

inter-class mode collapse. An example of intra-class mode collapse is, there are many writing styles for each of the digits, but the generator only learns to generate one perfect sample for each digit to successfully fool the discriminator.

Ii-a GANs for Image Synthesis

The first application of GAN is image synthesis, whose purpose is to generate images conditioned on certain input. In vanilla image synthesis, the generator takes no additional input from the user, so the output is only dependent on the random noise . To let users control what to images to be generated, we must put additional constraints on the input to the generator, e.g. text descriptions or even images. In this section, we will briefly discuss unconditional GAN, while putting more attention on text-to-image synthesis and image-to-image synthesis. GAN models that perform Unconditional Image Synthesis

are those that require no additional information from users, but only the sampled random vector

, etc. Such methods focus more on improving the quality of generated images without additional input, or on studying the theories and philosophy of GAN so as to overcome its deficiencies like training instability and mode collapse. Even though the output images of these methods appear to be less fancy than many conditional GAN applications, these models provide many insights that helps improve the performance of other GAN applications.

Moving from to convolutional neural networks (CNNs) is suitable for the image data. Previous experiments have shown that it is extremely difficult to train and while using CNNs, mostly due to five reasons: Non-convergence, Diminished gradient, Unbalance between the generator and the discriminator, model collapse, hyper parameter selections. One solution is to use Laplacian pyramids of adversarial networks. In this model, a real image is converted into a multi scale pyramid image, and a convolutional GAN is trained to produce multiscale and multi-level feature maps where the final feature map can be derived by combining all of them. The Laplacian pyramid is a linear invertible image demonstration containing band-pass images and a low-frequency residual.

At the initial step the model starts by and at the final level produces a residual image by expending noise vector

. Radford et al. introduced a deep convolutional GAN that enables smooth training for both G and D. This model uses the stride and fractionally-stride convolution layers which support the spatial down and up sampling operators to be significantly learned throughout the training. The role of these operators is to manage the changes in sample distributions and rates. For the 3D synthesize data generation, Wu et al. presented an architecture that uses auto-encoder and long-range context information to directly reconstruct a 3D objects from a 2D input images. However, this work suffers from high computational cost. Guibas et al. proposed a new, two stage model by using dual network for generating synthetic medical images. Despite, the model has a lightweight network but the results are limited and the network is trained on a small size dataset.

Although learning settings may differ, most of these image synthesis techniques tend to learn a deterministic one-to-one mapping and only generate single-modal output. However, in practice, the two-domain image synthesis is inherently ambiguous, as one input image may correspond to multiple possible outputs, namely, multimodal outputs. Multimodal image synthesis translates the input image from one domain to a distribution of potential outputs in the target domain while remaining faithful to the input. These diverse outputs represent different color or style texture themes (i.e., multimodal) but still preserve the similar semantic content as the input source image.

Zhou et al. introduced a normalization technique with conditional GAN that limits the searching space of the weights in a low-dimensional manifold. In , the authors proposed a conditional adversarial network for energy management systems. Their method is demonstrated to converge faster in term of number of epoch, but the authors did not highlight the model complexity. Odena et al. proposed a novel GAN classifier (

ACGAN)in which the architecture is similar to Infogan. In this model, the condition variable

will not be added to the discriminator, and an external classifier is applied to predicting the probability over the class labels. The loss function is optimized to improve the class prediction. Class conditioning is applied in the hidden space to run the generation procedure towards the objected class. The

in the BAGAN is adjusted with the encoder module that enables it to learn in the hidden space. The structure of BAGAN is similar to InfoGAN and ACGAN. However, BAGAN only generate a single output but, InfoGAN and ACGAN have two outputs.

Karacan2016, presented a deep conditional GAN model that takes its strength from the semantic layout and scene attributes integrated as conditioning variables. This approach able to produces realistic images under different situations, with clear object edges. What’s more, SPADE [46] proposes the spatially-adaptive normalization layer to further improve the quality of the synthesized images. But SPADE uses only one style code to control the entire style of an image and inserts style information only in the beginning of a network. Zhang et al. [78] proposed an exemplar-based image synthesis framework to translate images by establishing the dense semantic correspondence between cross-domain images. However, the semantic matching process may lead to a prohibitive memory footprint when estimating a high-resolution correspondence. SEAN [86] designs semantic region-adaptive normalization layer to alleviate the two shortcomings. Another method combines cVAE-GAN [17, 24, 25] and cLR-GAN model [5, 7, 9] to generate diverse and realistic outputs. In detail, they utilize cVAE-GAN [17, 24, 25] to encode ground truth target image into a latent space and the generator uses input source image conditioned with randomly sampled latent code to produce translated image . The process can be denoted as . Then, they exploit cLR-GAN model [5, 7, 9] which generates translated output image with input and random latent code and then tries to reconstruct the latent code from . Similarly, the training process can be denoted as

. By combining the two objectives into a hybrid model, BicycleGAN can generate diverse and realistic outputs. They train a CNN-based regressor to regress an overly smoothed image from an incomplete input. Then, they performed a pixelwise nearest-neighbor queries to match pixels using multi-scale descriptors and generated multiple high-quality, high-frequency outputs in a controllable manner. In detail, they first employ cross-domain autoencoders to get disentanglement components of source and target images, where exclusive representations are learned by Gradient Reversal Layer (GRL) and shared representations are learned by the

loss with adding noise on features of two domains. The multi-modal translated results are obtained by using random noise as exclusive representation.

In progressive GANs, the model expands the architecture of the standard network [20] where the idea was extracted from progressive neural networks. This model has high performance as it can receive additional leverage via lateral connections to earlier learned features. This architecture is widely used for extracting complex features. For training, the model starts with low resolution images and progressively and grow to reach the desirable results. It is worth of mentioning, during this growing process, all the variables remain trainable. Taigman et al. [52]

present a domain transfer network (DTN) for unpaired cross-domain image generation by assuming constant latent space between two domains, which could generate images of the target domains’ style and preserve their identity. Similar to the idea of dual learning in neural machine translation, DualGAN

[61], This progressive training strategy helps the networks to be of stable learning. Currently, several state-of-the-art GANs adopted such training strategy to improve their overall performance. Heljakka et al. adopted the progressive GANs into Autoencoder network for image reconstruction. The authors claim this model has promising results in image synthesis and inference. However, the model is only evaluated in CelebA dataset and the efficiency of the proposed model is not evaluated. map a pair of source image and target image into one same latent code in a shared latent space. They used a VAE-GAN network to model each image domain and achieved cross-domain translation by a cycle-consistency constraint and weight-sharing constraint. Their model can effectively handle the translation with holistic changes or large shape changes. They first sketch a primary coarse result in low-resolution adopted a similar architecture of CycleGAN [84]. And then they deployed a stacked structure to refine result with more details and high resolution by multiple refinement processes.

While the cycle-consistency constraint can get rid of the dependence on supervised paired data, it tends to enforce the model to generate translated image which contains all the information of the input image in order to reconstruct the input image. Therefore, As for generator, they built on CycleGAN [84] but includes residual blocks at multiple layers of both the decoder and encoder to learn both higher and lower spatial resolution features. Recently, Katzir et al. [21]

mitigated shape translation in a cascaded, deep-to-shallow fashion, in which they exploited the deep features extracted from a pretrained VGG-19 and translated them at the feature level. They proved that the obtained deep features by descending the deep layers of a pretrained network can represent larger receptive fields in image space and encode high-level semantic content.

We have introduced the disentangled representations [5, 16, 22] in multi-modal outputs of supervised. It disentangles an image into two kinds of features: domain-invariant features content and the domain-specific features style. Certainly, such representations are also beneficial to unsupervised multi-modal settings. Certainly, disentangled representations [5, 16, 22] are also beneficial to unsupervised settings. Disentangled representations [5, 16, 22] in multi-modal outputs of supervised image synthesis. It disentangles an image into two kinds of features: domain-invariant features content and the domain-specific features style. Certainly, such representations are also beneficial to unsupervised multi-modal settings. based on the backbone of MUNIT [18] and DRIT [29]. Specifically, they first disentangle content and style on objects and entire image with object coordinates. Then, they generate style codes from objects, background and entire images to build style code bank. By swapping or content association strategy across multi-granularity areas and cyclic reconstruction, INIT generates translates images and certainly different global styles can get more diverse results.

Chang et al. [4] declare that the shared domain-invariant content space in [18, 29] could limit the ability to represent content since they ignore the relationship between content and style. They present DSMAP to leverage two extra domain-specific mapping and to remap the content features from shared domain-invariant content space to two independent domain-specific content spaces for two domains.

Recently, Lin et al. [33] argue that multi-modal translated images would get worse detection accuracy than the single output methods. They propose structure-consistent image synthesis network to generate diverse outputs with less artifacts. They therefore propose Multimodal AugGAN to translate each existing detection training image from its original domain to diverse results, each of which possesses different degrees of transformation at the target domain.

It tackles multi-domain image synthesis problem with a global shared variational autoencoder and domain-specific component banks. Each bank consists of an encoder and a decoder for one domain. They adopt weight sharing constraint in the last few layers of encoder tuples and the first few layers of decoder tuples to map arbitrary images pairs into shared latent space and then remap the latent code to images. For each domain, they also define discriminator tuples to identify the translated images. ModularGAN consists of four kinds of modules: an encoder module to encode an input image to an intermediate feature map, multiple transformer modules to modify a certain attribute of feature map, an reconstruction module to reconstruct the image from an intermediate feature map and multiple discriminator modules to determine whether an image is real or fake while to predict the attributes of the input image. In test phase, ModularGAN dynamically combines different transformers according to the translation task to sequentially manipulate any number of attributes in arbitrary order. Given domains: , CollaGAN translates source image to the translated image using a single generator via a collaborative mapping , where denotes the complementary set from the other types of multiple images. They train CollaGAN with the constraint of multiple cycle consistency loss, discriminator loss (contained classification loss) and structural similarity index loss.

Rather than introducing an auxiliary domain classifier, Lin et al. [34] propose introducing an additional auxiliary domain and constructing a multipath consistency loss for multi-domain image synthesis. Their work is motivated by an important property, namely, the direct translation (i.e., one-hop translation) from brown hair to blonde should ideally be identical to the indirect translation (i.e., two-hop translation) from brown to black to blonde. Their multipath consistency loss evaluates the differences between direct two-domain translation and indirect multiple-domain translations with domain as an auxiliary domain. The method regularizes the training of each task and obtains a better performance. Different from other face synthesis models which can only generate a discrete number of expressions, GANimation can synthesize continuously anatomical facial movements by controlling the magnitude of activation of each Action Unit (AU). In detail, an image and a vector representing the activation of AU are input to the generator, and then attention and color masks are regressed respectively. The desired image is synthesized by the combination of two masks.

Image synthesis refers to creating new images from forms of description. The most common applications in image synthesis are primarily realistic-looking image synthesis, image manipulation, filling in missing pixels and art creations. For realistic-looking image synthesis, the related image synthesis works tend to generate photos of real-world scenes given different forms of input data. A typical task involves translating a semantic segmentation mask [46, 86, 28, 53] into real-world images, that is, semantic synthesis. Person image synthesis, including virtual try-on [58, 37, 49, 38, 10, 8] and skeleton-to-person translation [78, 83], learns to translate an image of a person to another image of the same person with a new outfit as well as diverse poses by manipulating the target clothes and pose.

Iii Conclusion and Discussion

This paper reviewed the existing GAN-variants for synthetic image generation based on architecture, performance, and stable training. We also reviewed the current GAN-related research architecture, loss functions, and datasets that are generally used for synthetic image generation. In particular, it is difficult, yet important for image synthesis tasks to explicitly define the loss. For instance, to perform style transfer, it is difficult to set a loss function to evaluate the matching of an image to a certain style. Each input image in synthetic image generation may have several legitimate outputs, however these outputs may not cover all the conditions. For synthetic image generation, several recent supervised and unsupervised methods have been reviewed, their strengths and weaknesses are thoroughly discussed.

Although we have conducted several experimental evaluation, GANs for synthetic image generation, still lacks a thorough study of domain adaptation and transfer learning. In addition, the computer vision community would benefit from an extension of this practical study that compares in addition to accuracy, the training and testing time of these models. Moreover, we think that the effect of normalization models on the learning capabilities of CNNs should also be thoroughly explored. At the time of this writing, there are a few published works on using GANs for video, time series generation, and natural language processing. Future research should be directed towards investigating the use of GANs in those fields as well as others.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §II.
  • [2] D. Berthelot, T. Schumm, and L. Metz (2017) BEGAN: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §II.
  • [3] J. Cao, O. Katzir, P. Jiang, D. Lischinski, D. Cohen-Or, C. Tu, and Y. Li (2018) DiDA: disentangled synthesis for domain adaptation. External Links: 1805.08019 Cited by: §I.
  • [4] H. Chang, Z. Wang, and Y. Chuang (2020) Domain-specific mappings for generative adversarial style transfer. External Links: 2008.02198 Cited by: §II-A.
  • [5] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §II-A, §II-A.
  • [6] H. Chu, C. Yeh, and Y. Frank Wang (2018) Deep generative models for weakly-supervised multi-label classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 400–415. Cited by: §II.
  • [7] J. Donahue, P. Krähenbühl, and T. Darrell (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §II-A.
  • [8] H. Dong, X. Liang, X. Shen, B. Wang, H. Lai, J. Zhu, Z. Hu, and J. Yin (2019-10) Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §II-A.
  • [9] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville (2016) Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: §II-A.
  • [10] P. Esser, E. Sutter, and B. Ommer (2018-06) A variational u-net for conditional appearance and shape generation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §II-A.
  • [11] H. Fish Tung, A. W. Harley, W. Seto, and K. Fragkiadaki (2017-10) Adversarial inverse graphics networks: learning 2d-to-3d lifting and image-to-image translation from unpaired supervision. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §I.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems (NIPS), pp. 2672–2680. Cited by: §II, §II.
  • [13] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §II.
  • [14] X. Guo, Z. Wang, Q. Yang, W. Lv, X. Liu, Q. Wu, and J. Huang (2020) GAN-based virtual-to-real image translation for urban scene semantic segmentation. Neurocomputing 394, pp. 127–135. External Links: ISSN 0925-2312 Cited by: §I.
  • [15] M. He, D. Chen, J. Liao, P. V. Sander, and L. Yuan (2018) Deep exemplar-based colorization. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–16. Cited by: §I.
  • [16] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2016) Beta-vae: learning basic visual concepts with a constrained variational framework. Cited by: §II-A.
  • [17] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §II-A.
  • [18] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §II-A, §II-A.
  • [19] A. Jolicoeur-Martineau (2018) The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734. Cited by: §II.
  • [20] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §II-A.
  • [21] O. Katzir, D. Lischinski, and D. Cohen-Or (2019) Cross-domain cascaded deep feature translation. arXiv, pp. arXiv–1906. Cited by: §II-A.
  • [22] H. Kim and A. Mnih (2018) Disentangling by factorising. arXiv preprint arXiv:1802.05983. Cited by: §II-A.
  • [23] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192. Cited by: §I.
  • [24] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §II-A.
  • [25] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2016) Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pp. 1558–1566. Cited by: §II-A.
  • [26] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989) Backpropagation applied to handwritten zip code recognition. Neural computation 1 (4), pp. 541–551. Cited by: §II.
  • [27] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang (2006) A tutorial on energy-based learning. Predicting structured data 1 (0). Cited by: §II.
  • [28] C. Lee, Z. Liu, L. Wu, and P. Luo (2020) Maskgan: towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5549–5558. Cited by: §I, §II-A.
  • [29] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV), pp. 35–51. Cited by: §II-A, §II-A.
  • [30] J. Lee, E. Kim, Y. Lee, D. Kim, J. Chang, and J. Choo (2020-06) Reference-based sketch image colorization using augmented-self reference and dense semantic correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [31] R. Li, W. Cao, Q. Jiao, S. Wu, and H. Wong (2020) Simplified unsupervised image translation for semantic segmentation adaptation. Pattern Recognition 105, pp. 107343. External Links: ISSN 0031-3203 Cited by: §I.
  • [32] S. Li, S. Gunel, M. Ostrek, P. Ramdya, P. Fua, and H. Rhodin (2020-06)

    Deformation-aware unpaired image translation for pose estimation on laboratory animals

    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [33] C. Lin, Y. Wu, P. Hsu, and S. Lai (2020) Multimodal structure-consistent image-to-image translation.. In AAAI, pp. 11490–11498. Cited by: §II-A.
  • [34] J. Lin, Y. Xia, Y. Wang, T. Qin, and Z. Chen (2019) Image-to-image translation with multi-path consistency regularization. arXiv preprint arXiv:1905.12498. Cited by: §II-A.
  • [35] A. H. Liu, Y. Liu, Y. Yeh, and Y. F. Wang (2018) A unified feature disentangler for multi-domain image translation and manipulation. In Advances in neural information processing systems, pp. 2590–2599. Cited by: §I.
  • [36] M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz (2019-10) Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §I.
  • [37] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool (2017) Pose guided person image generation. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 406–416. Cited by: §II-A.
  • [38] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz (2018-06) Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-A.
  • [39] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Cited by: §II.
  • [40] Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim (2018) Unsupervised attention-guided image-to-image translation. In Advances in Neural Information Processing Systems, pp. 3693–3703. Cited by: §I.
  • [41] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §II.
  • [42] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §II.
  • [43] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim (2018-06) Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [44] A. Oussidi and A. Elhassouny (2018) Deep generative models: survey. In 2018 International Conference on Intelligent Systems and Computer Vision (ISCV), pp. 1–8. Cited by: §II.
  • [45] S. K. Pal and S. Mitra (1992)

    Multilayer perceptron, fuzzy sets, classifiaction

    Cited by: §II.
  • [46] T. Park, M. Liu, T. Wang, and J. Zhu (2019-06) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A, §II-A.
  • [47] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016-06) Context encoders: feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [48] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §II, §II.
  • [49] A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe (2018) Deformable gans for pose-based human image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3408–3416. Cited by: §II-A.
  • [50] Y. Song, C. Yang, Z. Lin, X. Liu, Q. Huang, H. Li, and C.-C. J. Kuo (2018-09) Contextual-based image inpainting: infer, match, and translate. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §I.
  • [51] P. L. Suárez, A. D. Sappa, and B. X. Vintimilla (2017) Infrared image colorization based on a triplet dcgan architecture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 18–23. Cited by: §I.
  • [52] Y. Taigman, A. Polyak, and L. Wolf (2016) Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200. Cited by: §II-A.
  • [53] H. Tang, D. Xu, Y. Yan, P. H.S. Torr, and N. Sebe (2020-06) Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
  • [54] M. Tomei, M. Cornia, L. Baraldi, and R. Cucchiara (2019) Art2real: unfolding the reality of artworks via semantically-aware image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5849–5859. Cited by: §I.
  • [55] M. Tschannen, E. Agustsson, and M. Lucic (2018) Deep generative models for distribution-preserving lossy compression. In Advances in Neural Information Processing Systems, pp. 5929–5940. Cited by: §II.
  • [56] J. Xu, H. Li, and S. Zhou (2015) An overview of deep generative models. IETE Technical Review 32 (2), pp. 131–139. Cited by: §II.
  • [57] Z. Xu, T. Wang, F. Fang, Y. Sheng, and G. Zhang (2020) Stylization-based architecture for fast deep exemplar colorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9363–9372. Cited by: §I.
  • [58] Y. Yan, J. Xu, B. Ni, W. Zhang, and X. Yang (2017) Skeleton-aided articulated motion generation. In Proceedings of the 25th ACM International Conference on Multimedia, MM ’17, New York, NY, USA, pp. 199–207. External Links: ISBN 9781450349062, Document Cited by: §II-A.
  • [59] Q. Yang, N. Li, Z. Zhao, X. Fan, E. I. Chang, and Y. Xu (2018) MRI cross-modality neuroimage-to-neuroimage translation. External Links: 1801.06940 Cited by: §I.
  • [60] R. A. Yeh, C. Chen, T. Yian Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do (2017) Semantic image inpainting with deep generative models. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5485–5493. Cited by: §II.
  • [61] Z. Yi, H. Zhang, P. Tan, and M. Gong (2017) Dualgan: unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, pp. 2849–2857. Cited by: §II-A.
  • [62] Z. Yi, H. Zhang, P. Tan, and M. Gong (2017) Dualgan: unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, pp. 2849–2857. Cited by: §I.
  • [63] Y. Yuan, S. Liu, J. Zhang, Y. Zhang, C. Dong, and L. Lin (2018-06) Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §I.
  • [64] F. Zhan, S. Lu, and C. Xue (2018) Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 249–266. Cited by: §I.
  • [65] F. Zhan, S. Lu, C. Zhang, F. Ma, and X. Xie (2020) Adversarial image composition with auxiliary illumination. In Proceedings of the Asian Conference on Computer Vision, Cited by: §I.
  • [66] F. Zhan, S. Lu, C. Zhang, F. Ma, and X. Xie (2020) Towards realistic 3d embedding via view alignment. arXiv preprint arXiv:2007.07066. Cited by: §I.
  • [67] F. Zhan and S. Lu (2019) Esir: end-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2059–2068. Cited by: §I.
  • [68] F. Zhan, C. Xue, and S. Lu (2019) GA-dan: geometry-aware domain adaptation network for scene text detection and recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9105–9115. Cited by: §I.
  • [69] F. Zhan, Y. Yu, K. Cui, G. Zhang, S. Lu, J. Pan, C. Zhang, F. Ma, X. Xie, and C. Miao (2021) Unbalanced feature transport for exemplar-based image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [70] F. Zhan, Y. Yu, R. Wu, K. Cui, A. Xiao, S. Lu, and L. Shao (2021) Bi-level feature alignment for semantic image translation & manipulation. arXiv preprint. Cited by: §I.
  • [71] F. Zhan, Y. Yu, R. Wu, C. Zhang, S. Lu, L. Shao, F. Ma, and X. Xie (2021)

    GMLight: lighting estimation via geometric distribution approximation

    arXiv preprint arXiv:2102.10244. Cited by: §I.
  • [72] F. Zhan, C. Zhang, Y. Yu, Y. Chang, S. Lu, F. Ma, and X. Xie (2020) EMLight: lighting estimation via spherical distribution approximation. arXiv preprint arXiv:2012.11116. Cited by: §I.
  • [73] F. Zhan and C. Zhang (2020) Spatial-aware gan for unsupervised person re-identification. Proceedings of the International Conference on Pattern Recognition. Cited by: §I.
  • [74] F. Zhan, H. Zhu, and S. Lu (2019) Scene text synthesis for efficient and effective deep network training. arXiv preprint arXiv:1901.09193. Cited by: §I.
  • [75] F. Zhan, H. Zhu, and S. Lu (2019) Spatial fusion gan for image synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3653–3662. Cited by: §I.
  • [76] B. Zhang, M. He, J. Liao, P. V. Sander, L. Yuan, A. Bermak, and D. Chen (2019) Deep exemplar-based video colorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8061. Cited by: §I.
  • [77] C. Zhang, F. Zhan, and Y. Chang (2021) Deep monocular 3d human pose estimation via cascaded dimension-lifting. arXiv preprint arXiv:2104.03520. Cited by: §I.
  • [78] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen (2020) Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5143–5153. Cited by: §II-A, §II-A.
  • [79] R. Zhang, J. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. Efros (2017) Real-time user-guided image colorization with learned deep priors. arXiv preprint arXiv:1705.02999. Cited by: §I.
  • [80] Y. Zhang, S. Liu, C. Dong, X. Zhang, and Y. Yuan (2019) Multiple cycle-in-cycle generative adversarial networks for unsupervised image super-resolution. IEEE transactions on Image Processing 29, pp. 1101–1112. Cited by: §I.
  • [81] J. Zhao, M. Mathieu, and Y. LeCun (2016) Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126. Cited by: §II.
  • [82] L. Zhao, Q. Mo, S. Lin, Z. Wang, Z. Zuo, H. Chen, W. Xing, and D. Lu (2020-06) UCTGAN: diverse image inpainting based on unsupervised cross-space translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [83] X. Zhou, B. Zhang, T. Zhang, P. Zhang, J. Bao, D. Chen, Z. Zhang, and F. Wen (2020) Full-resolution correspondence learning for image translation. arXiv preprint arXiv:2012.02047. Cited by: §II-A.
  • [84] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §I, §II-A, §II-A.
  • [85] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pp. 465–476. Cited by: §I.
  • [86] P. Zhu, R. Abdal, Y. Qin, and P. Wonka (2020-06) SEAN: image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A, §II-A.