1 Introduction
Domain mapping or imagetoimage translation, which targets at translating an image from one domain to another, has been intensively investigated over the past few years. Let
denote a random variable representing source domain images and
represent target domain images. According to whether we have access to a paired sample , domain mapping can be studied in a supervised or unsupervised manner. While several works have successfully produced highquality translations by focusing on supervised domain mapping with constraints provided by crossdomain image pairs [43, 24, 56, 55], the progress of unsupervised domain mapping is relatively slow. Unluckily, obtaining paired training examples is expensive and even infeasible in some situations. For example, if we want to learn translators between Monet’s paintings and Photographs, how can we collect sufficient welldefined (Monet’s painting, photograph) pairs for model training? By contrast, collecting unpaired sets is often convenient since infinite images are available online. From this viewpoint, unsupervised domain mapping has great potential for realworld applications in the long term.In unsupervised domain mapping, from a probabilistic modeling perspective, our goal is to model the joint distribution
given samples drawn from the marginal distributions and in individual domains. Since the two marginal distributions can be inferred from an infinite set of possible joint distributions, it is difficult to guarantee that an individual input and the output are paired up in a meaningful way without additional assumptions or constraints.To address this problem, recent approaches have exploited the cycleconsistency assumption, i.e., a mapping and its inverse mapping should be bijections [62, 26, 58]. Specifically, when feeding an example into the networks , the output should be a reconstruction of and vise versa for , i.e., and . Further, DistanceGAN [5] showed that maintaining the distances between images within domains allows onesided unsupervised domain mapping rather than simultaneously learning both and .
Existing constraints overlook the special properties of images that simple geometric transformations (global geometric transformations without shape deformation) do not change the image’s semantic structure. Here, semantic structure refers to the information that distinguishes different object/staff classes, which can be easily perceived by humans regardless of trivial geometric transformations such as rotation. Based on this property, we develop a geometryconsistency constraint, which helps in reducing the search space of possible solutions while still keeping the correct set of solutions under consideration, and results in a geometryconsistent generative adversarial network (GcGAN) for unsupervised domain mapping.
Our geometryconsistency constraint is motivated by the fact that a given geometric transformation between the input images should be preserved by related translators and , if and are the domains obtained by applying on the examples of and , respectively. Mathematically, given a random example from the source domain and a predefined geometric transformation function , geometry consistency can be expressed as and , where is the inverse function of . Because it is unlikely that and always fail in the same location, and coregularize each other by the geometryconsistency constraint and thus correct each others’ failures in local regions of their respective translations (see Figure 1 for an illustrative example). Our geometryconsistency constraint allows onesided unsupervised domain mapping, i.e., can be trained independently from . In this paper, we employ two simple but representative geometric transformations as examples, i.e., vertical flipping (vf) and 90 degrees clockwise rotation (rot), to illustrate geometry consistency. Quantitative and qualitative comparisons with the baseline (GAN alone) and the stateoftheart methods including CycleGAN [62] and DistanceGAN [5] demonstrate the effectiveness of our model in generating realistic images.
2 Related Work
Generative Adversarial Networks. Generative adversarial networks (GANs) [19, 42, 14, 44, 48, 3] learn two networks, i.e
., a generator and a discriminator, in a staged zerosum game fashion to generate images from inputs. Many applications and computer vision tasks have recently been developed based on deep convolutional GANs (DCGANs), such as image inpainting, text to image synthesis, style transfer, and domain adaptation
[7, 59, 43, 45, 29, 57, 9, 49, 21, 50, 60, 25, 47]. The key components enabling GANs is the proposed adversarial constraint, which enforces the generated images to be indistinguishable from real images. Our formulation also benefits from an adversarial constraint to learn translators between two individual domains.Domain Mapping.
Many wellknown computer vision tasks, such as scene parsing and image colorization, follow similar settings to domain mapping or imagetoimage translation. Specific to recent adversarial domain mapping, this problem has been studied in a supervised or unsupervised manner with respect to paired or unpaired inputs.
There are a variety of literatures [43, 29, 24, 56, 53, 55, 23, 34, 4, 10] on supervised domain mapping. One representative example is conditional GAN [24], which learns the discriminator to distinguish and instead of and , where is a meaningful pair across domains. Further, Wang et al. [56] showed that conditional GANs can be used to generate highresolution images with a novel feature matching loss, as well as multiscale generator and discriminator architectures. While there has been significant progress in supervised domain mapping, many realword applications can not provide aligned images across domains because data preparation is expensive. Thus, different constraints and frameworks have been proposed for imagetoimage translation in the absence of training pairs, i.e., unsupervised domain mapping.
In unsupervised domain mapping, only unaligned examples in individual domains are provided, making the task more practical but more difficult. Unpaired domain mapping has a long history, and some successes in adversarial networks have recently been presented [37, 62, 5, 36, 39, 35, 6, 11]. For example, Liu and Tuzel [37] introduced coupled GAN (CoGAN) to learn crossdomain representations by enforcing a weightsharing constraint. Subsequently, CycleGAN [62], DiscoGAN [26], and DualGAN [58] enforced that translators and should be bijections. Thus, jointly learning and by enforcing cycle consistency can help to produce convincing mappings. Since then, many constraints and assumptions have been proposed to improve cycle consistency [8, 17, 22, 30, 32, 11, 2, 63, 18, 41, 36, 33, 1]. Recently, Benaim and Wolf [5] reported that maintaining the distances between samples within domains allows onesided unsupervised domain mapping. GcGAN is also a onesided framework coupled with our geometryconsistency constraint, and produces competitive and even better translations than the twosided CycleGAN in various applications.
3 Preliminaries
Let and be two domains with unpaired training examples and , where and are drawn from the marginal distributions and , where and are two random variables associated with and , respectively. In the paper, we exploit style transfer without undesirable semantic distortions in unsupervised domain mapping, and have two goals. First, we need to learn a mapping such that has the same distribution as , i.e., . Second, the learned mapping function only changes the image style without distorting the semantic structures.
While many works have modeled the invertibility between and for convincing mappings since the success of CycleGAN, here we propose to enforce geometry consistency as a constraint that allows onesided domain mapping, i.e., learning without simultaneously learning . Let be a predefined geometric transformation. We can obtain two extra domains and with examples and by applying on and , respectively. We learn an additional imagetoimage translator while learning , and introduce our geometryconsistency constraint based on the predefined transformation such that the two networks can regularize each other. Our framework enforces that and should keep the same geometric transformation with the one between and , i.e., , where . We denote the two adversarial discriminators as and with respect to domains and , respectively.
4 Proposed Method
We present our geometryconsistency constraint and GcGAN beginning with a review of the cycleconsistency constraint and the distance constraint. An illustration of the main differences between these constraints is shown in Figure 2.
4.1 Unsupervised Constraints
Cycleconsistency constraint. Following the cycleconsistency assumption [26, 62, 58], through the translators and , the examples and in domain and should recover the original images, i.e., and . Cycle consistency is implemented by a bidirectional reconstruction process that requires and to be jointly learned, as shown in Figure 2 (CycleGAN). The cycle consistency loss takes the form as:
(1) 
Distance constraint. The assumption behind the distance constraint is that the distance between two examples and in domain should be preserved after mapping to domain , i.e., , where is a predefined function to measure the distance between two examples and and are the linear coefficient and bias. In DistanceGAN [5], the distance consistency loss is the exception to the absolute differences between distances:
(2) 
where , (,
) are the means (standard deviations) of distances of all the possible pairs of
within domain and within domain , respectively, and are precomputed. Distance preservation makes onesided unsupervised domain mapping possible.4.2 Geometryconsistent Generative Adversarial Networks
Adversarial constraint. Taking as an example, an adversarial loss [19] enforces and to simultaneously optimize each other in an minimax game, i.e., . In other words, aims to distinguish real examples from translated samples . By contrast, aims to fool so that can label a fake example as a sample satisfying . The objective can be expressed as:
(3) 
In the transformed domains and , we employ the adversarial loss that has the same form to .
Geometryconsistency constraint. As shown in Figure 2 (GcGAN), given a predefined geometric transformation function , we feed the images and into the translators and , respectively. Following our geometryconsistency constraint, the outputs and should also satisfy like and . Considering both and the inverse geometric transformation function , our complete geometry consistency loss has the following form:
(4) 
This geometryconsistency loss can be seen as a reconstruction loss that relies on the predefined geometric transformation function .
In this paper, we only take two common geometric transformations as examples, namely vertical flipping (vf) and clockwise rotation (rot), to demonstrate the effectiveness of our geometryconsistency constraint. Note that, and have the same architecture and share all the parameters.
Full objective. By combining our geometryconsistency constraint with the standard adversarial constraint, a remarkable onesided unsupervised domain mapping can be targeted. The full objective for our GcGAN will be:
(5) 
where (
in all the experiments) is a tradeoff hyperparameter to weight the contribution of
and during the model training. Because that we do not make great effects to choose , heavily tuning may give preferable results to specific translation tasks.Network architecture. The full framework of our GcGAN is illustrated in Figure 2
. Our experimental settings, network architectures, and learning strategies follow CycleGAN. We employ the same discriminator and generator as CycleGAN depending on the specific tasks. Specifically, the generator is a standard encoderdecoder, where the encoder contains two convolutional layers with stride 2 and several residual blocks
[20] (6 / 9 blocks with respect to / of input resolution), and the decoder contains two deconvolutional layers also with stride 2. The discriminator distinguishes images at the patch level following PatchGANs [24, 31]. Like CycleGAN, we also use an identity mapping loss [52] in all of our experiments (except SVHN MNIST), including our baseline (GAN alone). For other details, we use LeakyReLU as nonlinearity for the discriminators and instance normalization [54] to normalize convolutional feature maps.Learning and inference. We use the Adam solver [27] with a learning rate of
and coefficients of (0.5, 0.999), where the latter is used to compute running averages of gradients and their squares. The learning rate is fixed in the initial 100 epochs, and linearly decays to zero over the next 100 epochs. Following CycleGAN, the negative log likelihood objective is replaced with a more stable and effective leastsquares loss
[40] for . The discriminator is updated with random samples from a history of generated images stored in an image buffer [51] of size 50. The generator and discriminator are optimized alternately. In the inference phase, we feed an image only into the learned generator to obtain a translated image.method  image parsing  parsing image  

pixel acc  class acc  mean IoU  pixel acc  class acc  mean IoU  
Benchmark Performance  
CoGAN [37]  0.45  0.11  0.08  0.40  0.10  0.06 
BiGAN/ALI [15, 16]  0.41  0.13  0.07  0.19  0.06  0.02 
SimGAN [51]  0.47  0.11  0.07  0.20  0.10  0.04 
CycleGAN (Cycle) [62]  0.58  0.22  0.16  0.52  0.17  0.11 
DistanceGAN [5]        0.53  0.19  0.11 
GAN alone (baseline)  0.514  0.160  0.104  0.437  0.161  0.098 
GcGANrot  0.574  0.234  0.170  0.551  0.197  0.129 
GcGANvf  0.576  0.232  0.171  0.548  0.196  0.127 
Ablation Studies (Robustness & Compatibility)  
GcGANrotSeperate  0.575  0.233  0.170  0.545  0.196  0.124 
GcGANMix  0.573  0.229  0.168  0.545  0.197  0.128 
GcGANrot + Cycle  0.587  0.246  0.182  0.557  0.201  0.132 
5 Experiments
We apply our GcGAN to a wide range of applications and make both quantitative and qualitative comparisons with the baseline (GAN alone) and previous stateoftheart methods including DistanceGAN and CycleGAN. We also study different ablations (based on rot) to analyze our geometryconsistency constraint. Since adversarial networks are not always stable, every independent experiment could result in slightly different scores. The scores in the quantitative analysis are computed by the average on three independent experiments.
5.1 Quantitative Analysis
The results demonstrate that our geometryconsistency constraint can not only partially filter out the candidate solutions having mode collapse or semantic distortions and thus produce more sensible translations, but also compatible with other unsupervised constraints such as cycle consistency [62] and distance preservation [5].
Cityscapes. Cityscapes [12] contains 3975 imagelabel pairs, with 2975 used for training and 500 for validation (test in this paper). For a fair comparison with CycleGAN, the translators are trained at a resolution of in an unaligned fashion. We evaluate our domain mappers using FCN scores and scene parsing metrics following previous works [38, 12, 62]. Specifically, for parsing image, we assume that a highquality translated image should produce qualitative semantic segmentation like real images when feeding it into a scene parser. Thus, we employ the pretrained FCN8s [38] provided by pix2pix [24] to predict semantic labels for the 500 translated images. The label maps are then resized to the original resolution of and compared against the ground truth labels using some standard scene parsing metrics including pixel accuracy, class accuracy, and mean IoU [38]. For image parsing, since the fake labels are in the RGB format, we simply convert them into classlevel labels using the nearest neighbor search strategy. In particular, we have 19 (category labels) 1 (ignored label) categories for Cityscapes, each with a corresponding color value (RGB). For a pixel in a translated parsing, we compute the distances between the 20 groundtruth color values and the color value of pixel . The label of pixel should be the one with the smallest distance. Then, the aforementioned metrics are used to evaluate our mapping on the 19 category labels.
The parsing scores for both image parsing and parsing image tasks are presented in Table 1. Our GcGAN outperforms the baseline (GAN alone) by a large margin. We take the average of pixel accuracy, class accuracy, and mean IoU as the final score for analysis [61], i.e., . For image parsing, GcGAN () yields a slightly higher score than CycleGAN (). For parsing image, GcGAN () obtains a convincing improvement of over the stateoftheart approach distanceGAN ().
We next perform ablation studies to investigate the robustness and compatibility of GcGAN, including GcGANrotSeperate, GcGANMix, and GcGANrot + Cycle. The scores are reported in Table 1. Specifically, GcGANrotSeperate shows that the generator employed in GcGAN is sufficient to handle both the style transfers (without shape deformation) and . GcGANMix demonstrates that persevering a geometric transformation can filter out most of the candidate solutions having mode collapse or undesired shape deformation, but preserving more ones can not leach more. For GcGANrot + Cycle, we set the tradeoff parameter for to as published in CycleGAN. The consistent improvement is a credible support that our geometryconsistency constraint is compatible with the widelyused cycleconsistency constraint.
method  class acc () 

Benchmark Performance  
DistanceGAN (Dist.) [5]  26.8 
CycleGAN (Cycle) [62]  26.1 
SelfDistance [5]  25.2 
GcGANrot  32.5 
GcGANvf  33.3 
Ablation Studies (Compatibility)  
Cycle + Dist. [5]  18.0 
GcGANrot + Dist.  34.0 
GcGANrot + Cycle  33.8 
GcGANrot + Dist. + Cycle  33.2 
SVHN MNIST. We then apply our approach to the SVHN MNIST translation task. The translation models are trained on 73257 and 60000 training images of resolution contained in the SVHN and MNIST training sets, respectively. The experimental settings follow DistanceGAN [5], including the default tradeoff parameters for and
, and the network architectures for the generators and the discriminators. We compare our GcGAN with both DistanceGAN and CycleGAN in this translation task. To obtain quantitative results, we feed the translated images into a pretrained classifier trained on the MNIST training split, as done in
[5]. Note that, the experimental settings for domain mapping (GcGAN) and domain adaptation are totally different, so is the captured classification accuracy. Domain adaptation methods have access to the source domain digit labels while image translation does not.Classification accuracies are reported in Table 2. Both GcGANrot and GcGANvf outperform DistanceGAN and CycleGAN by a large margin (about ). From the ablations, adding our geometryconsistency constraint to current unsupervised domain mapping frameworks will achieve different levels of improvements against the original ones. Note that, it seems that the distancepreservation constraint is not compatible with the cycleconsistency constraint, but our geometryconsistency constraint can improve both ones.
Google Maps. We obtain 2194 pairs of images in and around New York City from Google Maps [24], and split them into training and test sets with 1096 and 1098 pairs, respectively. We train Map Aerial photo translators with an image size of using the training set in an unsupervised manner (unpaired) by ignoring the pair information. For Aerial photo Map, we make comparisons with CycleGAN using average RMSE and pixel accuracy (). Given a pixel with the groundtruth RGB value and the predicted RGB value , if , we consider this is an accurate prediction. Since maps only contain a limited number of different RGB values, it is reasonable to compute pixel accuracy using this strategy ( and in this paper). For Map Aerial photo, we only show some qualitative results in Figure 3.
method  RMSE  acc  acc 

Benchmark Performance  
CycleGAN [62]  28.15  41.8  63.7 
GAN alone (baseline)  33.27  19.3  42.0 
GcGANrot  28.31  41.2  63.1 
GcGANvf  28.50  37.3  58.9 
Ablation Studies (Robustness & Compatibility)  
GcGANrotSeparate  30.25  40.7  60.8 
GcGANMix  27.98  42.8  64.6 
GcGANrot + Cycle  28.21  40.6  63.5 
From the scores presented in Table 3, it can be seen that GcGAN produces superior translations to the baseline (GAN alone). In particular, GcGAN yields an improvement over the baseline with respect to pixel accuracy when , demonstrating that the fake maps obtained by our GcGAN contain more details. In addition, our onesided GcGANs achieve competitive even slightly better scores compared with the twosided CycleGAN.
5.2 Qualitative Evalutation
The qualitative results are shown in Figure 3, Figure 4, and Figure 5. While GAN alone suffers from mode collapse, our geometryconsistency constraint can provide an effective remedy, thus helps to generate empirically more impressive translations on various applications. The following applications are trained in the image size of with the rot geometric transformation.
Horse Zebra. We apply GcGAN to the widely studied object transfiguration application task, i.e., Horse
Zebra. The images are randomly sampled from ImageNet
[13] using the keywords (i.e., wild horse and zebra). The numbers of training images are 939 and 1177 for horse and zebra, respectively. We find that training GcGAN without parameter sharing would produce preferable translations for the task.Synthetic Real. We employ the 2975 training images from Cityscapes as the realworld scenes, and randomly select 3060 images from SYNTHIACVPR16 [46], which is a virtual urban scene benchmark, as the synthetic images.
Summer Winter. The images used for the season translation tasks are provided by CycleGAN. The training set sizes for Summer and Winter are 1273 and 854.
Photo Artistic Painting. We translate natural images to artistic paintings with different art styles, including Monet, Cezanne, Van Gogh, and Ukiyoe. We also perform GcGAN on the translation task of Monet’s paintings photographs. We use the photos and paintings (Monet: 1074, Cezanne: 584, Van Gogh: 401, Ukiyoe: 1433, and Photographs: 6853) collected by CycleGAN for training.
Day Night. We randomly extract 4500 training images for both Day and Night from the 91 webcam sequences captured by [28].
6 Conclusion
In this paper, we propose to enforce geometry consistency as a constraint, which can be viewed as a predefined geometric transformation preserving the geometry of a scene for unsupervised domain mapping. The geometryconsistency constraint makes the translation networks on the original images and transformed images coregularize each other, which not only provides an effective remedy to the mode collapse problem suffered by standard GANs, but also reduces the semantic distortions in the translation. We evaluate our model, i.e., the geometryconsistent generative adversarial network (GcGAN), both qualitatively and quantitatively in various applications. Our experimental results demonstrate that GcGAN achieves competitive and sometimes even better translations than the stateoftheart methods including DistanceGAN and CycleGAN. Finally, our geometryconsistency constraint is compatible with other wellstudied unsupervised constraints.
References
 [1] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville. Augmented cyclegan: Learning manytomany mappings from unpaired data. ICML, 2018.
 [2] A. Anoosheh, E. Agustsson, R. Timofte, and L. Van Gool. Combogan: Unrestrained scalability for image domain translation. In CVPRW, 2018.
 [3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [4] S. Azadi, M. Fisher, V. Kim, Z. Wang, E. Shechtman, and T. Darrell. Multicontent gan for fewshot font style transfer. In CVPR, 2018.
 [5] S. Benaim and L. Wolf. Onesided unsupervised domain mapping. In NIPS, 2017.
 [6] S. Benaim and L. Wolf. Oneshot unsupervised cross domain translation. NIPS, 2018.
 [7] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, 2016.
 [8] H. Chang, J. Lu, F. Yu, and A. Finkelstein. Pairedcyclegan: Asymmetric style transfer for applying and removing makeup. In CVPR, 2018.
 [9] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stereoscopic neural style transfer. In CVPR, 2018.
 [10] Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In ICCV, 2017.
 [11] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multidomain imagetoimage translation. In CVPR, 2018.

[12]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.
In CVPR), 2016.  [13] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [14] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a? laplacian pyramid of adversarial networks. In NIPS, 2015.
 [15] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 [16] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 [17] A. Gokaslan, V. Ramanujan, D. Ritchie, K. I. Kim, and J. Tompkin. Improving shape deformation in unsupervised imagetoimage translation. ECCV, 2018.
 [18] A. GonzalezGarcia, J. van de Weijer, and Y. Bengio. Imagetoimage translation for crossdomain disentanglement. NIPS, 2018.
 [19] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 [20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [21] J. Hoffman, E. Tzeng, T. Park, J.Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycleconsistent adversarial domain adaptation. ICML, 2018.
 [22] X. Huang, M.Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised imagetoimage translation. ECCV, 2018.
 [23] T. Isokane, F. Okura, A. Ide, Y. Matsushita, and Y. Yagi. Probabilistic plant modeling via multiview imagetoimage translation. CVPR, 2018.

[24]
P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros.
Imagetoimage translation with conditional adversarial networks.
In CVPR, 2017. 
[25]
J. Johnson, A. Alahi, and L. FeiFei.
Perceptual losses for realtime style transfer and superresolution.
In ECCV, 2016.  [26] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover crossdomain relations with generative adversarial networks. In ICML, 2017.
 [27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [28] P.Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient attributes for highlevel understanding and editing of outdoor scenes. ACM TOG, 33(4):149, 2014.
 [29] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photorealistic single image superresolution using a generative adversarial network. In CVPR, 2017.
 [30] H.Y. Lee, H.Y. Tseng, J.B. Huang, M. Singh, and M.H. Yang. Diverse imagetoimage translation via disentangled representations. In ECCV, 2018.
 [31] C. Li and M. Wand. Precomputed realtime texture synthesis with markovian generative adversarial networks. In ECCV, 2016.
 [32] M. Li, H. Huang, L. Ma, W. Liu, T. Zhang, and Y. Jiang. Unsupervised imagetoimage translation with stacked cycleconsistent adversarial networks. ECCV, 2018.
 [33] X. Liang, H. Zhang, and E. P. Xing. Generative semantic manipulation with contrasting gan. NIPS, 2017.
 [34] J. Lin, Y. Xia, T. Qin, Z. Chen, and T.Y. Liu. Conditional imagetoimage translation. In CVPR, 2018.
 [35] A. Liu, Y.C. Liu, Y.Y. Yeh, and Y.C. F. Wang. A unified feature disentangler for multidomain image translation and manipulation. NIPS, 2018.
 [36] M.Y. Liu, T. Breuel, and J. Kautz. Unsupervised imagetoimage translation networks. In NIPS, 2017.
 [37] M.Y. Liu and O. Tuzel. Coupled generative adversarial networks. In NIPS, 2016.
 [38] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [39] S. Ma, J. Fu, C. W. Chen, and T. Mei. Dagan: Instancelevel image translation by deep attention generative adversarial networks. In CVPR, 2018.
 [40] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In ICCV, 2017.
 [41] Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim. Unsupervised attentionguided image to image translation. NIPS, 2018.
 [42] A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. In NIPS, 2016.
 [43] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
 [44] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
 [45] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016.
 [46] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
 [47] A. Royer, K. Bousmalis, S. Gouws, F. Bertsch, I. Moressi, F. Cole, and K. Murphy. Xgan: Unsupervised imagetoimage translation for manytomany mappings. ICLR, 2018.
 [48] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016.
 [49] F. Shen, S. Yan, and G. Zeng. Neural style transfer via meta networks. In CVPR, 2018.
 [50] L. Sheng, Z. Lin, J. Shao, and X. Wang. Avatarnet: Multiscale zeroshot style transfer by feature decoration. In CVPR, 2018.
 [51] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017.
 [52] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised crossdomain image generation, 2016.
 [53] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: feedforward synthesis of textures and stylized images. In ICML, 2016.
 [54] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
 [55] C. Wang, H. Zheng, Z. Yu, Z. Zheng, Z. Gu, and B. Zheng. Discriminative region proposal adversarial networks for highquality imagetoimage translation. In ECCV, 2018.
 [56] T.C. Wang, M.Y. Liu, J.Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. Highresolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017.
 [57] Y. Wang, J. van de Weijer, and L. Herranz. Mix and match networks: encoderdecoder alignment for zeropair image translation. In CVPR, 2018.
 [58] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for imagetoimage translation. In CVPR, 2017.
 [59] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
 [60] Y. Zhang, Y. Zhang, and W. Cai. Separating style and content for generalized style transfer. In CVPR, 2018.
 [61] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In CVPR, 2017.
 [62] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. In CVPR, 2017.
 [63] J.Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal imagetoimage translation. In NIPS, 2017.
Network Architecture
The generator and discriminator (except for SVHN MNIST) presented before are shown in Tab. 4. For convenience, we use the following abbreviation: C = Feature channel, K = Kernel size, S = Stride size, Deconv/Conv = Deconvolutional/Convolutional layer, and ResBlk = A residual block.
Generator  

Index  Layer  C  K  S 
1  Conv + ReLU 
64  7  1 
2  Conv + ReLU  128  3  2 
3  Conv + ReLU  256  3  2 
4  ResBlk + ReLU  256  3  1 
5  ResBlk + ReLU  256  3  1 
6  ResBlk + ReLU  256  3  1 
7  ResBlk + ReLU  256  3  1 
8  ResBlk + ReLU  256  3  1 
9  ResBlk + ReLU  256  3  1 
10  ResBlk + ReLU  256  3  1 
11  ResBlk + ReLU  256  3  1 
12  ResBlk + ReLU  256  3  1 
12  Deconv + ReLU  128  3  2 
13  Deconv + ReLU  64  3  2 
14  Conv  3  7  1 
15  Tanh       
Discriminator  
1  Conv + LeakyReLU  64  4  2 
2  Conv + LeakyReLU  128  4  2 
3  Conv + LeakyReLU  256  4  2 
4  Conv + LeakyReLU  512  4  1 
5  Conv  512  4  1 
The network architecture for SVHN MNIST is reported in Tab. 5.
Generator  

Index  Layer  C  K  S 
1  Conv + LeakyReLU  64  4  2 
2  Conv + LeakyReLU  128  4  2 
3  Conv + LeakyReLU  128  3  1 
4  Conv + LeakyReLU  128  3  1 
5  Deconv + LeakyReLU  64  4  2 
5  Deconv + LeakyReLU  1  4  2 
15  Tanh       
Discriminator  
1  Conv + LeakyReLU  64  4  2 
2  Conv + LeakyReLU  128  4  2 
3  Conv + LeakyReLU  256  4  2 
4  Conv + LeakyReLU  512  4  1 
5  Conv  512  4  1 
Comments
There are no comments yet.