|(a) Real-world (Cityscapes)|
|(b) Virtual-world (GTA-V)|
Recently, very promising visual perception performances on a variety of tasks (e.g
. classification and detection) have been achieved by deep learning models[14, 19, 20, 34], driven by the large-scale annotated datasets. However, more fine-grained tasks (e.g. semantic segmentation) still have much space to be resolved due to the insufficient pixel-wise annotations in diverse scenes. High quality annotations are often prohibitively difficult to obtain with the need of tons of human efforts, e.g., Cityscapes dataset  reports it will take more than 90 minutes for manually labeling a single image. Moreover, the learned models on limited and biased dataset often tend to not generalize well on other datasets in different domains, as demonstrated in prior domain adaption works .
An alternative solution to alleviate this data issue is to seek an automatic data generation approach. Rather than relying on expensive labors on annotating real-world data, recent progresses in Computer Graphics [23, 35, 36] make it possible to automatically or semi-automatically capture both images and their corresponding semantic labeling from video games, e.g., Grand Theft Auto V (GTA V), which is a realistic open-world game based on Los Angeles. In virtual-world, we can easily collect diverse labeled data that is several orders of magnitude larger than real-world human annotations in an unlimited way.
However, utilizing virtual-world knowledge to facilitate real-world perception tasks is not a trivial technique due to the common severe domain shift problem . Images collected from virtual-world often yield inconsistent distribution with that of real-world ones, because of the limitation in rendering and object simulation techniques, as shown in Figure 1. It is thus desirable to bridge the gap between virtual-world and real-world data for exploiting the shared semantic knowledge for perception. Previous domain adaption approaches can be summarized as two lines: minimizing the difference between the source and target feature distributions [12, 15, 16, 17, 18, 39]; or explicitly ensuring that two data distributions close to each other by adversarial learning [24, 29, 30, 37, 45, 47] or feature combining [10, 11, 22, 25, 40]. On the one hand, those feature-based adaption methods require the supervision for each specific task in both source and target domains, which cannot be widely applicable. On the other hand, despite the promising adaption performance achieved by Generative Adversarial Networks (GANs) , where a discriminator is trained to distinguish fake images from real images and a generator is optimized for generating realistic images to deceive discriminator, existing models can only transfer holistic color and texture of the source images to target images while disregarding the key characteristics of each semantic region (e.g. road vs. car), yielding very blurry and distorted results. The loss of fine-grained details in generated images would severely hinder their capabilities of facilitating downstream vision perception tasks.
In this work, we propose a novel Semantic-aware Grad-GAN (SG-GAN) that aims at transferring personalized styles (e.g. color, texture) for distinct semantic regions in virtual-world images to approximate the real-world distributions. Our SG-GAN, as one kind of image-based adaption approaches, is able to not only preserve key semantic and structure information in source domain but also enforce each semantic region close to their corresponding real-world distributions.
Except the traditional adversarial objective used in prior GANs, we propose two main contributions to achieve the above mentioned goals. First, a new gradient-sensitive objective is introduced into optimizing the generator, which emphasizes the semantic boundary consistencies between virtual images and adapted images. It is able to regularize the generator render distinct color/texture for each semantic region in order to keep semantic boundaries, which can alleviate the common blurry issues. Second, previous works often learn a whole image discriminator for validating the fidelity of all regions, which makes the color/texture of all pixels in original images easily collapse into a monotonous pattern. We here argue that the appearance distributions for each semantic region should be regarded differently and purposely. For example, road region in real-world often appears with coarse texture of asphalt concrete while vehicle region is usually smooth and reflective. In contrast to standard discriminator that eventually examines on a global feature map, we employ a new semantic-aware discriminator for evaluating the image adaption quality in a semantic-wise manner. The semantic-aware discriminator learns distinct discriminate parameters for examining regions with respect to each semantic label. This distinguishes SG-GAN with existing GANs as a controllable architecture that personalizes texture rendering for different semantic regions and results in adapted images with finer details.
Extensive qualitative and quantitative experiments on adapting GTA-V virtual images demonstrate that our SG-GAN can successfully generate realistic images without changing semantic information. To further demonstrate the quality of adapted images, we use the adapted images to train semantic segmentation models and evaluate them on public Cityscapes dataset . The substantial performance improvement over using original virtual data on semantic segmentation speaks well the superiority of our SG-GAN for semantic-aware virtual-to-real scene adaption.
2 Related work
Real-world vs. virtual-world data acquiring: Fine-grained semantic segmentation on urban scenes takes huge amount of human effort, which results in much less data than that of image classification datasets, as referred to as “curse of dataset annotation” in . For example, CamVid dataset  provides 700 road scene images with an annotation speed of 60 minutes/image. Cityscapes dataset  releases 5000 road scene annotations and reports annotation speed as more than 90 minutes/image. On the contrary, collecting urban scene data from video games such as GTA V has attracted lots of interests [23, 35, 36] for automatically obtaining a large amount of data. Specifically, Richter et al.  inject the connection between GTA V and GPU to collect rendered data and develop an interactive interface to extract 24966 images with annotations within 49 hours. Richter et al.  further develop real-time rendering pipelines enabling video-rate data and groundtruth collection, and release a dataset of 254064 fully annotated video frames. However, despite its diversity, virtual-world scene data often looks very unrealistic (e.g. flawed lighting and shadowing) due to the imperfect texture rendering. Directly utilizing such unrealistic data would damage real-world visual tasks due to their discrepant data distributions.
Domain adaption can be approached by either adapting scene images or adapting hidden feature representations guided by the targets. Image-based adaption can be also referred to as image-to-image translation, i.e., translating images from source domain to target domain, which can be summarized into two following directions.
propose a method to combine content of one image and style of another image through matching Gram matrix on deep feature maps, at the expense of some loss of content information. Second, a generative model can be trained through adversarial learning for image translation. Isolaet al.  use conditional GANs to learn mapping function from source domain to target domain, with a requirement of paired training data, which is unpractical for some tasks. To remove the requirement of paired training data, extra regularization could be applied, including self-regularization term  , cycle structure [24, 45, 47] or weight sharing [29, 30]. There are also approaches making use of both feature matching and adversarial learning [5, 44]. However, in urban scene adaption, despite having the ability to generate relatively realistic images, existing approaches often modify semantic information, e.g., the sky will be adapted to tree structure, or a road lamp may be rendered from nothing.
In contrast to image-based adaption that translates images to target domain, hidden feature representation based adaption aims at adapting learned models to target domain [12, 15, 17, 18, 26, 27, 28, 39]. By sharing weight  or incorporating adversarial discriminative setting , those feature-based adaption methods help mitigate performance degradation caused by domain shifting. However, feature-based adaption methods require different objective or architecture for different vision tasks, thus not as widely-applicable as image-based adaption.
Image synthesis: Apart from domain adaption, there exist some other approaches that generate realistic image from text [33, 46], segmentation groundtruth  and low resolution image . For instance, cascaded refinement network  is proposed to synthesize realistic images from semantic segmentation input, and semantic information can be preserved as our approach. However, since semantic segmentation is by nature much harder to be retrieved than raw image, image translation approaches has more potential in large scale data generation.
3 Semantic-aware Grad-GAN
The goal of the proposed SG-GAN is to perform virtual-to-real domain adaption while preserving their key semantic characteristics for distinct contents. Capitalized on the Generative Adversarial Networks (GANs), SG-GAN presents two improvements over the traditional GAN model, i.e., a new soft gradient-sensitive objective over generators and a novel semantic-aware discriminator.
3.1 Semantic-aware cycle objective
Our SG-GAN is based on the cycle-structured GAN objective since it has shown the advantages of training stability and generation quality [24, 45, 47]. Specifically, let us denote the unpaired images from the virtual-world domain and real-world domain as and , respectively. Our SG-GAN learns two symmetric mappings , along with two corresponding semantic-aware discriminators , in an adversarial way. and map images between virtual-world and real-world domains. ’s target is to distinguish between real-world images and fake real-world images , and vice versa for . The details of semantic-aware discriminators will be introduced later in Section 3.2. Figure 2 illustrates the relationship of , , , , and .
3.1.1 Adversarial loss
Our objective function is constructed based on standard adversarial loss . Two sets of adversarial losses are applied to and pairs. Specifically, the adversarial loss for optimizing is defined as:
Note that this is a mini-max problem as aims to maximize and aims to minimize . The objective of can be formulated as:
The formula is similar for the generator and semantic-aware discriminator , of which the adversarial loss can be noted as .
3.1.2 Cycle consistency loss
Another part of our objective function is cycle consistency loss , which is shown helpful to reduce the space of possible mappings, i.e., and . The cycle consistency loss confines that after going through and , an image should be mapped as close as to itself, i.e., , . In this work, we define cycle consistency loss as:
Cycle consistency loss can be seen as introducing a regularization on positions of image elements. Mapping functions are trained in a way that moving positions of image components is not encouraged. However, as position is only a fraction of semantic information, cycle consistency loss itself can’t guarantee to preserve well semantic information. For complex adaption such as urban scene adaption, a model purely with cycle consistency loss often fails by wrongly mapping a region with one semantic label to another label, e.g., the sky region may be wrongly adapted into a tree region, as shown in Figure 4. This limitation of cycle structure is also discussed in .
3.1.3 Soft gradient-sensitive objective
In order to keep semantic information from being changed through the mapping functions, we introduce a novel soft gradient-sensitive loss, which uses image’s semantic information in a gradient level. We first introduce gradient-sensitive loss, and then show ways to make the gradient-sensitive loss into a soft version.
The motivation of gradient-sensitive loss is that no matter how texture of each semantic class changes, there should be some distinguishable visual differences at the boundaries of semantic classes. Visual differences for adjacent pixels can be captured through convolving gradient filters upon the image. A typical choice of gradient filter is Sobel filter  as defined in Equation 4.
Since our focus is visual differences on semantic boundaries, a 0-1 mask is necessary that only has non-zero values on semantic boundaries. Such mask can be retrieved by convolving a gradient filter upon semantic labeling since it only has different adjacent values on semantic boundaries. Semantic labeling can be obtained by human annotation, segmentation models , or Computer Graphics tools [23, 35, 36]. By multiplying the convolved semantic labeling and the convolved image element-wise, attention will only be paid to visual differences on semantic boundaries.
More specifically, for an input image and its corresponding semantic labeling , since we desire and share the same semantic information, the gradient-sensitive loss for image can be defined as Equation 5, in which and are gradient filters for image and semantic labeling, stands for convolution, stands for element-wise multiplication, represents absolute value, means L1-norm, and is the sign function.
In practice, we may hold belief that and share similar texture within semantic classes. Since texture information can also be extracted from image gradient, a soft gradient-sensitive loss for image can be defined as Equation 6 to represent such belief, in which controls how much belief we have on texture similarities.
Given the soft gradient-sensitive loss for a single image, the final objective for soft gradient-sensitive loss can be defined as Equation 7, in which is semantic labeling for and is semantic labeling for .
3.1.4 Full objective function
Our full objective function is a combination of adversarial loss, cycle consistency loss and soft gradient-sensitive loss, as Equation 8, where and control the relative importance of cycle consistency loss and soft gradient-sensitive loss, compared with adversarial loss.
Our optimization target can be then represented as:
3.2 Semantic-aware discriminator
The introduction of soft gradient-sensitive loss contributes to smoother textures and clearer semantic boundaries (Figure 5). However, the scene adaption also needs to retain more high-level semantic consistencies for each specific semantic region. A typical example is after the virtual-to-real adaption, the tone goes dark for the whole image as real-world images are not as luminous as virtual-world images, however, we may only want roads to be darker without changing much of the sky, or even make sky lighter. The reason for yielding such inappropriate holistic scene adaption is that the traditional discriminator only judges realism image-wise, regardless of texture differences in a semantic-aware manner. To make discriminator semantic-aware, we introduce semantic-aware discriminators and . The idea is to create a separate channel for each different semantic class in the discriminator. In practice, this can be achieved by transiting the number of filters in the last layer of standard discriminator to number of semantic classes, and then applying semantic masks upon filters to let each of them focus on different semantic classes.
More specifically, the last (
-th) layer’s feature map of a standard discriminator is typically a tensorwith shape , where stands for width and stands for height. will then be compared with an all-one or all-zero tensor to calculate adversarial objective. In contrast, the semantic-aware discriminator we propose will change as a tensor with shape , where is the number of semantic classes. We then convert image’s semantic labeling to one-hot style and resize to , which will result in a mask with same shape , and . By multiplying and element-wise, each filter within will only focus on one particular semantic class. Finally, by summing up along the last dimension, a tensor with shape will be acquired and adversarial objective can be calculated the same way as the standard discriminator. Figure 3 gives an illustration of proposed semantic-aware discriminator.
|Method AMethod B||CycleGAN||DualGAN||SimGAN||BiGAN||SG-GAN-2K|
|SG-GAN-2K||79.2% - 20.8%||93.4% - 6.6%||97.2% - 2.8%||99.8% - 0.2%||—|
|SG-GAN-25K||83.4% - 16.6%||94.0% - 6.0%||98.4% - 1.6%||99.8% - 0.2%||53.8% - 46.2%|
|Adapted image with||Adapted image without|
|(a) Input||(b) Diff between (e) and (f)|
|(c) Diff between (a) and (e)||(d) Diff between (a) and (f)|
|(e) Adapted image with||(f) Adapted image without|
|(g) 4X zoomed (e)||(h) 4X zoomed (f)|
Dataset. We randomly sample 2000 images each from GTA-V dataset  and Cityscapes training set  as training images for and . Another 500 images each from GTA-V dataset and Cityscapes training set are sampled for visual comparison and validation. Cityscapes validation set is not used for validating adaption approaches here since it will later be applied to evaluate semantic segmentation scores in Section 4.4. We train SG-GAN on such dataset and term it as SG-GAN-2K. The same dataset is used for training all baselines in Section 4.2, making them comparable with SG-GAN-2K. To study the effect of virtual-world images, we further expand virtual-world training images to all 24966 images of GTA-V dataset, making a dataset with 24966 virtual images and 2000 real images. A variant of SG-GAN is trained on the expanded dataset and termed as SG-GAN-25K.
Network architecture. We use images for training phase due to GPU memory limitation. For the generator, we adapt the architecture from Isola et al. , which is a U-Net structure with skip connections between low level and high level layers. For the semantic-aware discriminator, we use a variant of PatchGAN [21, 47]
, which is a fully convolutional network consists of multiple layers of (leaky-ReLU, instance norm, convolution) and helps the discriminator identify realism patch-wise.
Training details. To stabilize training, we use history of refined images  for training semantic-aware discriminators and . Moreover, we apply least square objective instead of log likelihood objective for adversarial loss, which is shown helpful in stabilizing training and generating higher quality images, as proposed by Mao et al. . For parameters in Equation 8, we set , . is set as
for the first three epochs and then changed to. For gradient filters in Equation 6, we use Sobel filter (Equation 4) for and filters in Equation 10 for
to avoid artifacts on image borders caused by reflect padding. For number of semantic classes in semantic-aware discriminator, we cluster 30 classes into 8 categories to avoid sparse classes, i.e.,
. Learning rate is set as 0.0002 and we use a batch size of 1. We implement SG-GAN based on TensorFlow framework, and train it with a single Nvidia GTX 1080.
Testing. Semantic information will only be needed at training time. At test time SG-GAN only requires images without semantic information. Since the generators and the discriminators we use are fully convolutional, SG-GAN can handle images with high resolution () at test time. The testing time is 1.3 second/image with a single Nvidia GTX 1080.
4.2 Comparison with state-of-the-art methods
We compare our SG-GAN with current state-of-the-art baselines for unpaired virtual-to-real scene adaption for demonstrating its superiority.
SimGAN  introduces a self-regularization for GAN and local adversarial loss to train a refiner for image adaption. In the experiments we use channel-wise mean values as self-regularization term. We use the architecture as proposed in .
DualGAN  uses U-Net structure for generators that are identical with SG-GAN. It uses the same PatchGAN structure as CycleGAN, but different from CycleGAN it follows the loss format and training procedure proposed in Wasserstein GAN .
BiGAN [8, 9] learns the inverse mapping of standard GANs . While standard GANs learn generators mapping random noises to images , i.e., , BiGAN [8, 9] also aims at inferring latent noises based on images . By taking as image, BiGAN can also be used for unpaired scene adaption. For the implementation of BiGAN we use the codes provided by .
4.2.2 Qualitative and quantitative evaluation
Figure 4 compares between SG-GAN-2K and other state-of-the-art methods visually. In general, SG-GAN generates better visualization results, in the form of clear boundaries, consistent semantic classes, smooth texture, etc. Moreover, SG-GAN-2K shows its ability for personalized adaption, e.g., while we retain the red color of vehicle’s headlight, the red color of sunset is changed to sunny yellow that is closer to real-world images.
To further evaluate our approach quantitatively, we conduct A/B tests on Amazon Mechanical Turk (AMT) by comparing SG-GAN-2K and baseline approaches pairwise. We use 500 virtual-world images with size of as input, and present pairs of adapted images generated by different methods to workers for A/B tests. For each image-image pair, we ask workers which image is more realistic than the other and record their answers. There are 123 workers participated in our A/B tests and the results are shown in Table 1. According to the statistics SG-GAN shows its superiority over all other approaches by a high margin. We attribute such superiority to clearer boundaries and smoother textures achieved by soft gradient-sensitive loss, and personalized texture rendering with the help of semantic-aware discriminator.
4.3 Ablation studies
Effectiveness of soft gradient-sensitive objective. To demonstrate the effectiveness of soft gradient-sensitive loss , we train a variant of SG-GAN without applying and compare it with SG-GAN-25K. Figure 5 shows an example by inspecting details through a 4X zoom. Compared with SG-GAN-25K, the variant without has coarse semantic boundaries and rough textures, which demonstrates soft gradient-sensitive loss can help generate adapted images with clearer semantic boundaries and smoother textures.
Effectiveness of semantic-aware discriminator. We use a variant of SG-GAN without applying semantic-aware discriminator () and compare it with SG-GAN-25K to study the effectiveness of . As shown in Figure 6, comparing (g) and (h), the variant without lacks for details, e.g., the color of traffic light, and generates coarser textures, e.g., the sky. The difference maps, i.e., (b), (c), (d) in Figure 6, further reveal that semantic-aware discriminator leads to personalized texture rendering for each distinct region with specific semantic meaning.
The effect of virtual training image size. Figure 4 compares variants of SG-GAN that use distinct numbers of virtual-world images for training. Generally, SG-GAN-25K generates clearer details than SG-GAN-2K for some images. Further A/B tests between them in Table 1 show SG-GAN-25K is slightly better than SG-GAN-2K because of using more training data. Both qualitative and quantitative comparisons indicate more data could help, however, the improved performance may be only notable if dataset difference is in orders of magnitude.
Discussion. While SG-GAN generates realistic results for almost all tested images, in very rare case the adapted image is unsatisfactory, as shown in Figure 7
. Our model learns the existence of sunlight, however it is unaware of the image is taken in a tunnel and thus sunlight would be abnormal. We attribute this rare unsatisfactory case to the lack of diversity of real-world dataset compared with virtual-world dataset, thus such case could be seen as an outlier.
More real-world images could help alleviate such unsatisfactory case, but SG-GAN is restricted by the limited number of fine-grained real-world annotations for training, e.g., Cityscapes dataset only contains a fine-grained training set of 2975 images. However, we foresee a possibility to solve the data insufficient issue by using coarse annotations labeled by human or semantic segmentation models. In our implementation of semantic-aware discriminator, semantic masks are actually clustered to avoid sparse classes, e.g., semantic classes “building”, “wall”, “fence”, “guard rail”, “bridge” and “tunnel” are clustered into a single mask indicating “construction”. Considering such cluster, annotation granularity may not be a vital factor for our model. Thus investigating the trade-off between annotation granularity and dataset size would be a possible next step.
4.4 Application on semantic segmentation
To further demonstrate the scene adaption quality of SG-GAN, we conduce comparisons on the downstream semantic segmentation task on Cityscapes validation set  by adapting from GTA-V dataset , similar to . The idea is to train semantic segmentation model merely based on adapted virtual-world data, i.e., 24966 images of GTA-V dataset , and evaluate model’s performance on real-world data, i.e., Cityscapes validation set . For the semantic segmentation model we use the architecture proposed by Wu et al.  and exactly follow its training procedure, which shows impressive results on Cityscapes dataset. Table 2 shows the results. The baseline method is the version that trains semantic segmentation model directly on original virtual-world data and groundtruth pairs.
We first compare SG-GAN with CycleGAN . The substantially higher semantic segmentation performance by SG-GAN shows its ability to yield adapted images closer to real-world data distribution. Figure 8 illustrates the visual comparison between SG-GAN and baseline to further show how SG-GAN helps improve segmentation. We further compare our approach with a hidden feature representation based adaption method proposed by Huffman et al. , and SG-GAN achieves a high performance margin. These evaluations on semantic segmentation again confirm SG-GAN’s ability to adapt high quality images, benefiting from preserving consistent semantic information and rendering personalized texture closer to real-world via soft gradient-sensitive objective and semantic discriminator.
|(a) Input||(b) Adapted|
|Method||Pixel acc.||Class acc.||Class IOU|
|Hoffman et al. ||–||–||27.10|
|(a) Real-world image||(b) Groundtruth|
|(c) Baseline||(d) SG-GAN-25K|
In this work, we propose a novel SG-GAN for virtual-to-real urban scene adaption with the good property of retaining critical semantic information. SG-GAN employs a new soft gradient-sensitive loss to confine clear semantic boundaries and smooth adapted texture, and a semantic-aware discriminator to personalize texture rendering. We conduct extensive experiments to compare SG-GAN with other state-of-the-art domain adaption approaches both qualitatively and quantitatively, which all demonstrate the superiority of SG-GAN. Further experiments on the downstream semantic segmentation confirm the effectiveness of SG-GAN in virtual-to-real urban scene adaption. In future, we plan to apply our model on Playing-for-Benchmarks  dataset, which has an order of magnitude more annotated data from virtual-world for further boosting adaption performance.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  A. Bansal, Y. Sheikh, and D. Ramanan. Pixelnn: Example-based image synthesis. arXiv preprint arXiv:1708.05349, 2017.
-  G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 2009.
-  M. Cha, Y. Gwon, and H. Kung. Adversarial nets with perceptual losses for text-to-image synthesis. arXiv preprint arXiv:1708.09321, 2017.
-  Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In ICCV, 2017.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
-  J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
-  V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
-  L. Gatys, A. Ecker, and M. Bethge. A neural algorithm of artistic style. Nature Communications, 2015.
-  L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shechtman. Preserving color in neural artistic style transfer. arXiv preprint arXiv:1606.05897, 2016.
-  T. Gebru, J. Hoffman, and L. Fei-Fei. Fine-grained recognition in the wild: A multi-task domain adaptation approach. In ICCV, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko. Lsda: Large scale detection through adaptation. In NIPS, 2014.
-  J. Hoffman, D. Pathak, T. Darrell, and K. Saenko. Detector discovery in the wild: Joint multiple instance and representation learning. In CVPR, 2015.
-  J. Hoffman, D. Pathak, E. Tzeng, J. Long, S. Guadarrama, T. Darrell, and K. Saenko. Large scale visual recognition through adaptation using joint representation and multiple instance learning. JMLR, 2016.
-  J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017.
-  G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
J. Johnson, A. Alahi, and L. Fei-Fei.
Perceptual losses for real-time style transfer and super-resolution.In ECCV, 2016.
-  M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and R. Vasudevan. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? In ICRA, 2017.
-  T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
-  Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neural style transfer. arXiv preprint arXiv:1701.01036, 2017.
-  X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing. Recurrent topic-transition gan for visual paragraph generation. arXiv preprint arXiv:1703.07022, 2017.
X. Liang, L. Lee, W. Dai, and E. P. Xing.
Dual motion gan for future-flow embedded video prediction.
IEEE International Conference on Computer Vision (ICCV), volume 1, 2017.
-  X. Liang, H. Zhang, and E. P. Xing. Generative semantic manipulation with contrasting gan. arXiv preprint arXiv:1708.00315, 2017.
-  M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848, 2017.
-  M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In NIPS, 2016.
-  X. Mao, Q. Li, H. Xie, R. Y. Lau, and Z. Wang. Multi-class generative adversarial networks with the l2 loss function. arXiv preprint arXiv:1611.04076, 2016.
J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence.
Dataset shift in machine learning. The MIT Press, 2009.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  S. R. Richter, Z. Hayder, and V. Koltun. Playing for benchmarks. In ICCV, 2017.
-  S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In ECCV, 2016.
-  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017.
-  I. Sobel. An isotropic 3 3 image gradient operator. Machine vision for three-dimensional scenes, 1990.
-  E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Adversarial discriminative domain adaptation. In CVPR, 2017.
-  D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: feed-forward synthesis of textures and stylized images. In ICML, 2016.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
-  Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080, 2016.
-  J. Xie, M. Kiefel, M.-T. Sun, and A. Geiger. Semantic instance annotation of street scenes by 3d to 2d label transfer. In CVPR, 2016.
-  W. Xiong, W. Luo, L. Ma, W. Liu, and J. Luo. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. arXiv preprint arXiv:1709.07592, 2017.
-  Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV, 2017.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.