Domain mapping or image-to-image translation, which targets at translating an image from one domain to another, has been intensively investigated over the past few years. Let
denote a random variable representing source domain images andrepresent target domain images. According to whether we have access to a paired sample , domain mapping can be studied in a supervised or unsupervised manner. While several works have successfully produced high-quality translations by focusing on supervised domain mapping with constraints provided by cross-domain image pairs [43, 24, 56, 55], the progress of unsupervised domain mapping is relatively slow. Unluckily, obtaining paired training examples is expensive and even infeasible in some situations. For example, if we want to learn translators between Monet’s paintings and Photographs, how can we collect sufficient well-defined (Monet’s painting, photograph) pairs for model training? By contrast, collecting unpaired sets is often convenient since infinite images are available online. From this viewpoint, unsupervised domain mapping has great potential for real-world applications in the long term.
In unsupervised domain mapping, from a probabilistic modeling perspective, our goal is to model the joint distributiongiven samples drawn from the marginal distributions and in individual domains. Since the two marginal distributions can be inferred from an infinite set of possible joint distributions, it is difficult to guarantee that an individual input and the output are paired up in a meaningful way without additional assumptions or constraints.
To address this problem, recent approaches have exploited the cycle-consistency assumption, i.e., a mapping and its inverse mapping should be bijections [62, 26, 58]. Specifically, when feeding an example into the networks , the output should be a reconstruction of and vise versa for , i.e., and . Further, DistanceGAN  showed that maintaining the distances between images within domains allows one-sided unsupervised domain mapping rather than simultaneously learning both and .
Existing constraints overlook the special properties of images that simple geometric transformations (global geometric transformations without shape deformation) do not change the image’s semantic structure. Here, semantic structure refers to the information that distinguishes different object/staff classes, which can be easily perceived by humans regardless of trivial geometric transformations such as rotation. Based on this property, we develop a geometry-consistency constraint, which helps in reducing the search space of possible solutions while still keeping the correct set of solutions under consideration, and results in a geometry-consistent generative adversarial network (GcGAN) for unsupervised domain mapping.
Our geometry-consistency constraint is motivated by the fact that a given geometric transformation between the input images should be preserved by related translators and , if and are the domains obtained by applying on the examples of and , respectively. Mathematically, given a random example from the source domain and a predefined geometric transformation function , geometry consistency can be expressed as and , where is the inverse function of . Because it is unlikely that and always fail in the same location, and co-regularize each other by the geometry-consistency constraint and thus correct each others’ failures in local regions of their respective translations (see Figure 1 for an illustrative example). Our geometry-consistency constraint allows one-sided unsupervised domain mapping, i.e., can be trained independently from . In this paper, we employ two simple but representative geometric transformations as examples, i.e., vertical flipping (vf) and 90 degrees clockwise rotation (rot), to illustrate geometry consistency. Quantitative and qualitative comparisons with the baseline (GAN alone) and the state-of-the-art methods including CycleGAN  and DistanceGAN  demonstrate the effectiveness of our model in generating realistic images.
2 Related Work
., a generator and a discriminator, in a staged zero-sum game fashion to generate images from inputs. Many applications and computer vision tasks have recently been developed based on deep convolutional GANs (DCGANs), such as image inpainting, text to image synthesis, style transfer, and domain adaptation[7, 59, 43, 45, 29, 57, 9, 49, 21, 50, 60, 25, 47]. The key components enabling GANs is the proposed adversarial constraint, which enforces the generated images to be indistinguishable from real images. Our formulation also benefits from an adversarial constraint to learn translators between two individual domains.
Many well-known computer vision tasks, such as scene parsing and image colorization, follow similar settings to domain mapping or image-to-image translation. Specific to recent adversarial domain mapping, this problem has been studied in a supervised or unsupervised manner with respect to paired or unpaired inputs.
There are a variety of literatures [43, 29, 24, 56, 53, 55, 23, 34, 4, 10] on supervised domain mapping. One representative example is conditional GAN , which learns the discriminator to distinguish and instead of and , where is a meaningful pair across domains. Further, Wang et al.  showed that conditional GANs can be used to generate high-resolution images with a novel feature matching loss, as well as multi-scale generator and discriminator architectures. While there has been significant progress in supervised domain mapping, many real-word applications can not provide aligned images across domains because data preparation is expensive. Thus, different constraints and frameworks have been proposed for image-to-image translation in the absence of training pairs, i.e., unsupervised domain mapping.
In unsupervised domain mapping, only unaligned examples in individual domains are provided, making the task more practical but more difficult. Unpaired domain mapping has a long history, and some successes in adversarial networks have recently been presented [37, 62, 5, 36, 39, 35, 6, 11]. For example, Liu and Tuzel  introduced coupled GAN (CoGAN) to learn cross-domain representations by enforcing a weight-sharing constraint. Subsequently, CycleGAN , DiscoGAN , and DualGAN  enforced that translators and should be bijections. Thus, jointly learning and by enforcing cycle consistency can help to produce convincing mappings. Since then, many constraints and assumptions have been proposed to improve cycle consistency [8, 17, 22, 30, 32, 11, 2, 63, 18, 41, 36, 33, 1]. Recently, Benaim and Wolf  reported that maintaining the distances between samples within domains allows one-sided unsupervised domain mapping. GcGAN is also a one-sided framework coupled with our geometry-consistency constraint, and produces competitive and even better translations than the two-sided CycleGAN in various applications.
Let and be two domains with unpaired training examples and , where and are drawn from the marginal distributions and , where and are two random variables associated with and , respectively. In the paper, we exploit style transfer without undesirable semantic distortions in unsupervised domain mapping, and have two goals. First, we need to learn a mapping such that has the same distribution as , i.e., . Second, the learned mapping function only changes the image style without distorting the semantic structures.
While many works have modeled the invertibility between and for convincing mappings since the success of CycleGAN, here we propose to enforce geometry consistency as a constraint that allows one-sided domain mapping, i.e., learning without simultaneously learning . Let be a predefined geometric transformation. We can obtain two extra domains and with examples and by applying on and , respectively. We learn an additional image-to-image translator while learning , and introduce our geometry-consistency constraint based on the predefined transformation such that the two networks can regularize each other. Our framework enforces that and should keep the same geometric transformation with the one between and , i.e., , where . We denote the two adversarial discriminators as and with respect to domains and , respectively.
4 Proposed Method
We present our geometry-consistency constraint and GcGAN beginning with a review of the cycle-consistency constraint and the distance constraint. An illustration of the main differences between these constraints is shown in Figure 2.
4.1 Unsupervised Constraints
Cycle-consistency constraint. Following the cycle-consistency assumption [26, 62, 58], through the translators and , the examples and in domain and should recover the original images, i.e., and . Cycle consistency is implemented by a bidirectional reconstruction process that requires and to be jointly learned, as shown in Figure 2 (CycleGAN). The cycle consistency loss takes the form as:
Distance constraint. The assumption behind the distance constraint is that the distance between two examples and in domain should be preserved after mapping to domain , i.e., , where is a predefined function to measure the distance between two examples and and are the linear coefficient and bias. In DistanceGAN , the distance consistency loss is the exception to the absolute differences between distances:
where , (,
) are the means (standard deviations) of distances of all the possible pairs ofwithin domain and within domain , respectively, and are precomputed. Distance preservation makes one-sided unsupervised domain mapping possible.
4.2 Geometry-consistent Generative Adversarial Networks
Adversarial constraint. Taking as an example, an adversarial loss  enforces and to simultaneously optimize each other in an minimax game, i.e., . In other words, aims to distinguish real examples from translated samples . By contrast, aims to fool so that can label a fake example as a sample satisfying . The objective can be expressed as:
In the transformed domains and , we employ the adversarial loss that has the same form to .
Geometry-consistency constraint. As shown in Figure 2 (GcGAN), given a predefined geometric transformation function , we feed the images and into the translators and , respectively. Following our geometry-consistency constraint, the outputs and should also satisfy like and . Considering both and the inverse geometric transformation function , our complete geometry consistency loss has the following form:
This geometry-consistency loss can be seen as a reconstruction loss that relies on the predefined geometric transformation function .
In this paper, we only take two common geometric transformations as examples, namely vertical flipping (vf) and clockwise rotation (rot), to demonstrate the effectiveness of our geometry-consistency constraint. Note that, and have the same architecture and share all the parameters.
Full objective. By combining our geometry-consistency constraint with the standard adversarial constraint, a remarkable one-sided unsupervised domain mapping can be targeted. The full objective for our GcGAN will be:
in all the experiments) is a trade-off hyperparameter to weight the contribution ofand during the model training. Because that we do not make great effects to choose , heavily tuning may give preferable results to specific translation tasks.
Network architecture. The full framework of our GcGAN is illustrated in Figure 2
. Our experimental settings, network architectures, and learning strategies follow CycleGAN. We employ the same discriminator and generator as CycleGAN depending on the specific tasks. Specifically, the generator is a standard encoder-decoder, where the encoder contains two convolutional layers with stride 2 and several residual blocks (6 / 9 blocks with respect to / of input resolution), and the decoder contains two deconvolutional layers also with stride 2. The discriminator distinguishes images at the patch level following PatchGANs [24, 31]. Like CycleGAN, we also use an identity mapping loss  in all of our experiments (except SVHN MNIST), including our baseline (GAN alone). For other details, we use LeakyReLU as nonlinearity for the discriminators and instance normalization  to normalize convolutional feature maps.
Learning and inference. We use the Adam solver  with a learning rate of
and coefficients of (0.5, 0.999), where the latter is used to compute running averages of gradients and their squares. The learning rate is fixed in the initial 100 epochs, and linearly decays to zero over the next 100 epochs. Following CycleGAN, the negative log likelihood objective is replaced with a more stable and effective least-squares loss for . The discriminator is updated with random samples from a history of generated images stored in an image buffer  of size 50. The generator and discriminator are optimized alternately. In the inference phase, we feed an image only into the learned generator to obtain a translated image.
|method||image parsing||parsing image|
|pixel acc||class acc||mean IoU||pixel acc||class acc||mean IoU|
|BiGAN/ALI [15, 16]||0.41||0.13||0.07||0.19||0.06||0.02|
|CycleGAN (Cycle) ||0.58||0.22||0.16||0.52||0.17||0.11|
|GAN alone (baseline)||0.514||0.160||0.104||0.437||0.161||0.098|
|Ablation Studies (Robustness & Compatibility)|
|GcGAN-rot + Cycle||0.587||0.246||0.182||0.557||0.201||0.132|
We apply our GcGAN to a wide range of applications and make both quantitative and qualitative comparisons with the baseline (GAN alone) and previous state-of-the-art methods including DistanceGAN and CycleGAN. We also study different ablations (based on rot) to analyze our geometry-consistency constraint. Since adversarial networks are not always stable, every independent experiment could result in slightly different scores. The scores in the quantitative analysis are computed by the average on three independent experiments.
5.1 Quantitative Analysis
The results demonstrate that our geometry-consistency constraint can not only partially filter out the candidate solutions having mode collapse or semantic distortions and thus produce more sensible translations, but also compatible with other unsupervised constraints such as cycle consistency  and distance preservation .
Cityscapes. Cityscapes  contains 3975 image-label pairs, with 2975 used for training and 500 for validation (test in this paper). For a fair comparison with CycleGAN, the translators are trained at a resolution of in an unaligned fashion. We evaluate our domain mappers using FCN scores and scene parsing metrics following previous works [38, 12, 62]. Specifically, for parsing image, we assume that a high-quality translated image should produce qualitative semantic segmentation like real images when feeding it into a scene parser. Thus, we employ the pretrained FCN-8s  provided by pix2pix  to predict semantic labels for the 500 translated images. The label maps are then resized to the original resolution of and compared against the ground truth labels using some standard scene parsing metrics including pixel accuracy, class accuracy, and mean IoU . For image parsing, since the fake labels are in the RGB format, we simply convert them into class-level labels using the nearest neighbor search strategy. In particular, we have 19 (category labels) 1 (ignored label) categories for Cityscapes, each with a corresponding color value (RGB). For a pixel in a translated parsing, we compute the distances between the 20 groundtruth color values and the color value of pixel . The label of pixel should be the one with the smallest distance. Then, the aforementioned metrics are used to evaluate our mapping on the 19 category labels.
The parsing scores for both image parsing and parsing image tasks are presented in Table 1. Our GcGAN outperforms the baseline (GAN alone) by a large margin. We take the average of pixel accuracy, class accuracy, and mean IoU as the final score for analysis , i.e., . For image parsing, GcGAN () yields a slightly higher score than CycleGAN (). For parsing image, GcGAN () obtains a convincing improvement of over the state-of-the-art approach distanceGAN ().
We next perform ablation studies to investigate the robustness and compatibility of GcGAN, including GcGAN-rot-Seperate, GcGAN-Mix, and GcGAN-rot + Cycle. The scores are reported in Table 1. Specifically, GcGAN-rot-Seperate shows that the generator employed in GcGAN is sufficient to handle both the style transfers (without shape deformation) and . GcGAN-Mix demonstrates that persevering a geometric transformation can filter out most of the candidate solutions having mode collapse or undesired shape deformation, but preserving more ones can not leach more. For GcGAN-rot + Cycle, we set the trade-off parameter for to as published in CycleGAN. The consistent improvement is a credible support that our geometry-consistency constraint is compatible with the widely-used cycle-consistency constraint.
|method||class acc ()|
|DistanceGAN (Dist.) ||26.8|
|CycleGAN (Cycle) ||26.1|
|Ablation Studies (Compatibility)|
|Cycle + Dist. ||18.0|
|GcGAN-rot + Dist.||34.0|
|GcGAN-rot + Cycle||33.8|
|GcGAN-rot + Dist. + Cycle||33.2|
SVHN MNIST. We then apply our approach to the SVHN MNIST translation task. The translation models are trained on 73257 and 60000 training images of resolution contained in the SVHN and MNIST training sets, respectively. The experimental settings follow DistanceGAN , including the default trade-off parameters for and
, and the network architectures for the generators and the discriminators. We compare our GcGAN with both DistanceGAN and CycleGAN in this translation task. To obtain quantitative results, we feed the translated images into a pretrained classifier trained on the MNIST training split, as done in. Note that, the experimental settings for domain mapping (GcGAN) and domain adaptation are totally different, so is the captured classification accuracy. Domain adaptation methods have access to the source domain digit labels while image translation does not.
Classification accuracies are reported in Table 2. Both GcGAN-rot and GcGAN-vf outperform DistanceGAN and CycleGAN by a large margin (about ). From the ablations, adding our geometry-consistency constraint to current unsupervised domain mapping frameworks will achieve different levels of improvements against the original ones. Note that, it seems that the distance-preservation constraint is not compatible with the cycle-consistency constraint, but our geometry-consistency constraint can improve both ones.
Google Maps. We obtain 2194 pairs of images in and around New York City from Google Maps , and split them into training and test sets with 1096 and 1098 pairs, respectively. We train Map Aerial photo translators with an image size of using the training set in an unsupervised manner (unpaired) by ignoring the pair information. For Aerial photo Map, we make comparisons with CycleGAN using average RMSE and pixel accuracy (). Given a pixel with the ground-truth RGB value and the predicted RGB value , if , we consider this is an accurate prediction. Since maps only contain a limited number of different RGB values, it is reasonable to compute pixel accuracy using this strategy ( and in this paper). For Map Aerial photo, we only show some qualitative results in Figure 3.
|GAN alone (baseline)||33.27||19.3||42.0|
|Ablation Studies (Robustness & Compatibility)|
|GcGAN-rot + Cycle||28.21||40.6||63.5|
From the scores presented in Table 3, it can be seen that GcGAN produces superior translations to the baseline (GAN alone). In particular, GcGAN yields an improvement over the baseline with respect to pixel accuracy when , demonstrating that the fake maps obtained by our GcGAN contain more details. In addition, our one-sided GcGANs achieve competitive even slightly better scores compared with the two-sided CycleGAN.
5.2 Qualitative Evalutation
The qualitative results are shown in Figure 3, Figure 4, and Figure 5. While GAN alone suffers from mode collapse, our geometry-consistency constraint can provide an effective remedy, thus helps to generate empirically more impressive translations on various applications. The following applications are trained in the image size of with the rot geometric transformation.
Horse Zebra. We apply GcGAN to the widely studied object transfiguration application task, i.e., Horse
Zebra. The images are randomly sampled from ImageNet using the keywords (i.e., wild horse and zebra). The numbers of training images are 939 and 1177 for horse and zebra, respectively. We find that training GcGAN without parameter sharing would produce preferable translations for the task.
Synthetic Real. We employ the 2975 training images from Cityscapes as the real-world scenes, and randomly select 3060 images from SYNTHIA-CVPR16 , which is a virtual urban scene benchmark, as the synthetic images.
Summer Winter. The images used for the season translation tasks are provided by CycleGAN. The training set sizes for Summer and Winter are 1273 and 854.
Photo Artistic Painting. We translate natural images to artistic paintings with different art styles, including Monet, Cezanne, Van Gogh, and Ukiyo-e. We also perform GcGAN on the translation task of Monet’s paintings photographs. We use the photos and paintings (Monet: 1074, Cezanne: 584, Van Gogh: 401, Ukiyo-e: 1433, and Photographs: 6853) collected by CycleGAN for training.
Day Night. We randomly extract 4500 training images for both Day and Night from the 91 webcam sequences captured by .
In this paper, we propose to enforce geometry consistency as a constraint, which can be viewed as a predefined geometric transformation preserving the geometry of a scene for unsupervised domain mapping. The geometry-consistency constraint makes the translation networks on the original images and transformed images co-regularize each other, which not only provides an effective remedy to the mode collapse problem suffered by standard GANs, but also reduces the semantic distortions in the translation. We evaluate our model, i.e., the geometry-consistent generative adversarial network (GcGAN), both qualitatively and quantitatively in various applications. Our experimental results demonstrate that GcGAN achieves competitive and sometimes even better translations than the state-of-the-art methods including DistanceGAN and CycleGAN. Finally, our geometry-consistency constraint is compatible with other well-studied unsupervised constraints.
-  A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. ICML, 2018.
-  A. Anoosheh, E. Agustsson, R. Timofte, and L. Van Gool. Combogan: Unrestrained scalability for image domain translation. In CVPRW, 2018.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  S. Azadi, M. Fisher, V. Kim, Z. Wang, E. Shechtman, and T. Darrell. Multi-content gan for few-shot font style transfer. In CVPR, 2018.
-  S. Benaim and L. Wolf. One-sided unsupervised domain mapping. In NIPS, 2017.
-  S. Benaim and L. Wolf. One-shot unsupervised cross domain translation. NIPS, 2018.
-  K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, 2016.
-  H. Chang, J. Lu, F. Yu, and A. Finkelstein. Pairedcyclegan: Asymmetric style transfer for applying and removing makeup. In CVPR, 2018.
-  D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stereoscopic neural style transfer. In CVPR, 2018.
-  Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In ICCV, 2017.
-  Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In CVPR), 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
-  E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a? laplacian pyramid of adversarial networks. In NIPS, 2015.
-  J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
-  V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
-  A. Gokaslan, V. Ramanujan, D. Ritchie, K. I. Kim, and J. Tompkin. Improving shape deformation in unsupervised image-to-image translation. ECCV, 2018.
-  A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio. Image-to-image translation for cross-domain disentanglement. NIPS, 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. ICML, 2018.
-  X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. ECCV, 2018.
-  T. Isokane, F. Okura, A. Ide, Y. Matsushita, and Y. Yagi. Probabilistic plant modeling via multi-view image-to-image translation. CVPR, 2018.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.In CVPR, 2017.
J. Johnson, A. Alahi, and L. Fei-Fei.
Perceptual losses for real-time style transfer and super-resolution.In ECCV, 2016.
-  T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient attributes for high-level understanding and editing of outdoor scenes. ACM TOG, 33(4):149, 2014.
-  C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
-  H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang. Diverse image-to-image translation via disentangled representations. In ECCV, 2018.
-  C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV, 2016.
-  M. Li, H. Huang, L. Ma, W. Liu, T. Zhang, and Y. Jiang. Unsupervised image-to-image translation with stacked cycle-consistent adversarial networks. ECCV, 2018.
-  X. Liang, H. Zhang, and E. P. Xing. Generative semantic manipulation with contrasting gan. NIPS, 2017.
-  J. Lin, Y. Xia, T. Qin, Z. Chen, and T.-Y. Liu. Conditional image-to-image translation. In CVPR, 2018.
-  A. Liu, Y.-C. Liu, Y.-Y. Yeh, and Y.-C. F. Wang. A unified feature disentangler for multi-domain image translation and manipulation. NIPS, 2018.
-  M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017.
-  M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In NIPS, 2016.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
-  S. Ma, J. Fu, C. W. Chen, and T. Mei. Da-gan: Instance-level image translation by deep attention generative adversarial networks. In CVPR, 2018.
-  X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In ICCV, 2017.
-  Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim. Unsupervised attention-guided image to image translation. NIPS, 2018.
-  A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. In NIPS, 2016.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
-  A. Royer, K. Bousmalis, S. Gouws, F. Bertsch, I. Moressi, F. Cole, and K. Murphy. Xgan: Unsupervised image-to-image translation for many-to-many mappings. ICLR, 2018.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016.
-  F. Shen, S. Yan, and G. Zeng. Neural style transfer via meta networks. In CVPR, 2018.
-  L. Sheng, Z. Lin, J. Shao, and X. Wang. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In CVPR, 2018.
-  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017.
-  Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation, 2016.
-  D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: feed-forward synthesis of textures and stylized images. In ICML, 2016.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
-  C. Wang, H. Zheng, Z. Yu, Z. Zheng, Z. Gu, and B. Zheng. Discriminative region proposal adversarial networks for high-quality image-to-image translation. In ECCV, 2018.
-  T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017.
-  Y. Wang, J. van de Weijer, and L. Herranz. Mix and match networks: encoder-decoder alignment for zero-pair image translation. In CVPR, 2018.
-  Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In CVPR, 2017.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
-  Y. Zhang, Y. Zhang, and W. Cai. Separating style and content for generalized style transfer. In CVPR, 2018.
-  B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In CVPR, 2017.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In CVPR, 2017.
-  J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In NIPS, 2017.
The generator and discriminator (except for SVHN MNIST) presented before are shown in Tab. 4. For convenience, we use the following abbreviation: C = Feature channel, K = Kernel size, S = Stride size, Deconv/Conv = Deconvolutional/Convolutional layer, and ResBlk = A residual block.
Conv + ReLU
|2||Conv + ReLU||128||3||2|
|3||Conv + ReLU||256||3||2|
|4||ResBlk + ReLU||256||3||1|
|5||ResBlk + ReLU||256||3||1|
|6||ResBlk + ReLU||256||3||1|
|7||ResBlk + ReLU||256||3||1|
|8||ResBlk + ReLU||256||3||1|
|9||ResBlk + ReLU||256||3||1|
|10||ResBlk + ReLU||256||3||1|
|11||ResBlk + ReLU||256||3||1|
|12||ResBlk + ReLU||256||3||1|
|12||Deconv + ReLU||128||3||2|
|13||Deconv + ReLU||64||3||2|
|1||Conv + LeakyReLU||64||4||2|
|2||Conv + LeakyReLU||128||4||2|
|3||Conv + LeakyReLU||256||4||2|
|4||Conv + LeakyReLU||512||4||1|
The network architecture for SVHN MNIST is reported in Tab. 5.
|1||Conv + LeakyReLU||64||4||2|
|2||Conv + LeakyReLU||128||4||2|
|3||Conv + LeakyReLU||128||3||1|
|4||Conv + LeakyReLU||128||3||1|
|5||Deconv + LeakyReLU||64||4||2|
|5||Deconv + LeakyReLU||1||4||2|
|1||Conv + LeakyReLU||64||4||2|
|2||Conv + LeakyReLU||128||4||2|
|3||Conv + LeakyReLU||256||4||2|
|4||Conv + LeakyReLU||512||4||1|