Pixel and Feature Level Based Domain Adaption for Object Detection in Autonomous Driving

09/30/2018 ∙ by Yuhu Shan, et al. ∙ 0

Annotating large scale datasets to train modern convolutional neural networks is prohibitively expensive and time-consuming for many real tasks. One alternative is to train the model on labeled synthetic datasets and apply it in the real scenes. However, this straightforward method often fails to generalize well mainly due to the domain bias between the synthetic and real datasets. Many unsupervised domain adaptation (UDA) methods are introduced to address this problem but most of them only focus on the simple classification task. In this paper, we present a novel UDA model to solve the more complex object detection problem in the context of autonomous driving. Our model integrates both pixel level and feature level based transformtions to fulfill the cross domain detection task and can be further trained end-to-end to pursue better performance. We employ objectives of the generative adversarial network and the cycle consistency loss for image translation in the pixel space. To address the potential semantic inconsistency problem, we propose region proposal based feature adversarial training to preserve the semantics of our target objects as well as further minimize the domain shifts. Extensive experiments are conducted on several different datasets, and the results demonstrate the robustness and superiority of our method.



There are no comments yet.


page 1

page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

objectdetection aims to assign each object a bounding box along with class labels, e.g., person, bicycle, motorcycle or car in an image, which plays an important role in modern autonomous driving systems since it is crucial to detect other traffic participants as shown in Fig. 1. Despite the performance of object detection algorithms has be greatly improved since the introduction of AlexNet[1]

in 2012, it is still far from satisfactory when it comes to the practical applications, mainly due to the limited data and expensive labeling cost. Supervised learning algorithms based on deep neural networks require large number of fine labeled data, which is extremely difficult to acquire in real cases. For example, it takes almost ninety minutes to annotate one image from Cityscapes dataset


for driving scene understanding. Even it is already one of the largest driving scene datasets, there is only 2975 training images with fine labels. One promising method to address this problem is to train models on synthetic datasets. Fortunately, with the great progress achieved in graphics and simulation infrastructure, many large scale datasets with high quality annotations have been produced to simulate different real scenes. However, models trained purely on rendered images can not generalize to real images, mainly due to the domain bias or domain shift

[3][4]. During past several years, many researchers have proposed various unsupervised domain adaptation (UDA) methods[5] to solve this problem. However, most of them only target on the simple classification task, which is not fit for more complex vision tasks such as object detection or semantic segmentation.

In this paper, we present a new UDA model with adaptations both in pixel and feature spaces to deal with the more complex object detection problem in the context of autonomous driving. In the setting of UDA, we want to generalize the models trained on source data with ground truth labels to target data without any annotations. Perhaps the most similar works to us are[6][7], in which [6]

trains its image generation model based on generative adversarial network (GAN), which is simultaneously combined with the task model for object classification and pose estimation.

[7] tries to solve the cross domain semantic segmentation problem based on CycleGAN[8] and traditional segmentation networks. To overcome the shortcoming of semantic inconsistency in CycleGAN, the author uses source image label map as additional supervision to preserve the semantics in their image translation stage. We also employ similar CycleGAN structure for pixel level based adaptation in this work. Instead of using additional semantic maps, we propose to use region proposal based feature adversarial learning to keep the semantics of the target objects during the end-to-end training process. Meanwhile, domain bias can also be further minimized in the feature space which brings better performance. Qualitative and quantitative results on several datasets show the robustness of our method.

Fig. 1: Detection task in the autonomous driving system.

Ii literature review

Ii-a Object Detection

Object detection[9][10]

as a fundamental problem in computer vision, has achieved great progress since 2012 with the development of deep neural networks. Based on Alexnet, many different convolutional neural networks (CNN), such as VGGnet

[11], GoogLeNet[12], Resnet[13], Densenet[14]

, etc., are proposed to learn more powerful deep features from data. Object detection algorithms also benefit from these architectures, since better features are also helpful for other vision tasks. Apart from the different network architectures, recent CNN-based object detectors can be mainly divided into two categories: single-stage detectors (YOLO

[15][16] and SSD[17][18]) and two-stage detectors (Faster R-CNN[19] and R-FCN[20], etc.). Single stage detectors directly predict object labels and bounding box coordinates within one image based on the default anchor boxes, which is fast but not accurate enough. In contrast, two stage detectors firstly generate large number of region proposals based on the CNN features, and then recognize the proposals with heavy head. Therefore, it has better performance than single stage detectors, but the detection speed is slower.

Fig. 2: Overall architecture of the model. On the left, we show the pixel level based image translation, in which source image is firstly converted to the target domain before it is used to train the detection network. On the right, detection network is enhanced with domain classification for feature level based adversarial training.

Ii-B Unsupervised Domain Adaptation

Unsupervised domain adaptation tries to solve the learning problem in a target domain without labels by leveraging training data in a different but related source domain with ground truth labels . Although it has been studied in computer vision for a long time, learn to perform UDA is still an open research problem. Pan et al.[21] propose Transfer Component Analysis (TCA), a kernel method based on Maximum Mean Discrepancy (MMD)[22], to learn better feature representation across domains. Based on TCA, [23]

provides a new method of Joint Distribution Adaptation (JDA) to jointly adapt both the marginal distribution and conditional distribution, which is robust for substantial distribution difference. Recently, with the advent of deep learning, many works are also trying to learn deep domain invariant features within the neural networks. For example, Long et al.

[24] [25] propose to embed hidden network features in a reproducing kernel Hilbert space and explicitly measure the difference between the two domains with MMD and its variants. Sun et al.[26] try to minimize domain shift by matching the second order statistics of feature distributions between the source and target domains. Rather than explicitly model the term to measure the domain discrepancy, another stream of works utilize adversarial training to implicitly find the domain invariant feature representations. Ganin et al.[27][28]

add one more domain classifier to the deep neural network model to classify which domain the inputs belong to, and conduct adversarial training by reversing the gradients from the domain classification loss before they flow into the shared CNN feature layers. Rather than using shared feature layers, Tzeng et al.

[29] propose to learn indistinguishable features for the target domain data by training a separate network with objectives similar to the traditional GAN model[30]. Then it is combined with the classifier trained on source domain for recognition tasks. [31] is another related work which argues that each domain should have its own specific features and only part of the features are shared between the different domains. Therefore, they explicitly model the private and shared domain representations and only do adversarial training on these shared ones.

Despite so many methods have been proposed to solve UDA, most of them only focus on the classification problem. Very limited works are about more complex tasks like object detection, or semantic segmentation. As far as we know, [32]

is the first work to deal with the domain adaptation problem for object detection. They conduct adversarial training on features from convolution layers and fully connected layers, along with regular training for detection task. Then the domain label consistency check is used as a regularization to help the network to learn better domain invariant features. Although good detection results are achieved in this paper, we argue that conducting adaptation both on feature space and pixel space would be a better alternative since adapting features at higher levels can fail to model the low level image details and normally is not human interpretable. Therefore, our work is also closely related to image-to-image translation works, as shown in the following section.

Ii-C Image-to-Image Translation

So far, many works have been done to convert images into another style. [33][34][35][36][37] conduct image translation based on the assumption of available paired training datasets, which is not appropriate for unsupervised domain adaptation. Several other works try to solve the problem under unpaired setting. [38][39] share part of the network weights to learn the joint distribution of multi-modal images. PixelDA[6] propose a novel GAN based architecture to convert images across domains, while this method need prior knowledge about which parts of the image contributing to the content similarity loss. Neural style transfer [40][41][42] is another kind of method to convert one image into another style while preserving its own contents by optimizing the pixel values during backpropogation to match the gram matrix statistics of pretrained deep network features. One shortcoming of style transfer method is that it only targets on translation between two specific images while not at the dataset level. Recently proposed CycleGAN model[8] is a promising method for unpaired image translation, and has produced compelling results like converting aerial photos to Google maps or Monet paintings into photoes. The author propose cycle consistency loss to regularize the generative model and hence to preserve structural information of the transfered image. However, this method can only guarantee that, one aera if occupied by one object before the translation, will also get occupied after the generation process. The semantics of pixels are not guaranteed to be consistent with this only cycle consistence loss. [7] and [43]

propose to use semantic label maps as additional signals to regularize the generative models of CycleGAN to generate the same segmentation results. However, this method need to train additional segmentation networks which can slow down the whole training process. It is actually not necessary to pay the same attention to all the pixels for some specific tasks like object detection. We care more about objects such as cars or pedestrains along the road while not the sky for example. Therefore, we propose to conduct adversarial training on the features extracted by the region proposals to keep the semantics of significant objects as well as further reduce the domain shifts in the feature space.

Iii Method

We tackle the problem of object detection in the context of unsupervised domain adaptation, in which we assume available source images with ground truth label and target images without any labels. Our target is to train model with source data which also need to perform well on the target dataset. The whole framework is shown in the above Fig. 2. Our model can be divided into two parts: pixel level domain adaptation part (PDA) mainly based on CycleGAN and object detection part which is enhanced with several additional neural layers for feature based domain adpatation (FDA). The two parts are integrated together and can be trained end-to-end to pursue better performance for our target task. Source images are firstly converted into the target image style. Then the transfered image is sent into the second part to train the network for object detection as well as domain classification together with the sampled image from the target domain.

Iii-a Pixel Level Based Domain Adaptation

We firstly introduce the pixel level based domain adaption module in this section. As shown in Fig. 2, two symetric generative models and are employed to generate images and seperately in two domains and two discriminators and are trained to distinguish the real sampled and fake generated images. The whole training process is a min-max problem since the generator always wants to generate images which can not be distiguished from the real ones, while the discriminator is also trained simultaneously to be good at classifying real and fake instances. The whole objectives for generator and discriminator can be formulated as following equation (1).


Similar equation can also be formulated for and as , which is ignored here. This kind of GAN objectives can, in theory, learn the mapping functions and to produce images identically sampled from the data distribution of and . However, it also faces the problem of mode collapse[30] and losing the structural information in the source image. To address these problems, cycle consistency loss is adopted here to force the image generated by to have idectical results as after it is sent into generator and visa versa. The whole cycle consistency loss is formulated in equation (2):


Iii-B Object Detection With Domain Adversarial Training

Our detection network is based on the famous Faster R-CNN model, in which region poposal network (RPN) is trained to generate region proposals and Fast R-CNN[10] module for region classification and bounding box regression. In order to solve the semantic missing problem as well as further minimizing domain bias, a small fully convolutional netwrok is added to the detection module for domain adversarial training with inputs from the features extracted by each individual proposal on the final convolutional layers of our backbone network. Gradients generated by detection loss and domain classification loss will flow into the shared convolution layers and the PDA module to make our model more fittable for the object detection task. Assuming there are categories in our detection task, then the region classification layer will output a

dimension probability distribution for each RP,

, with one more category for background. Four real values are predicted for each possible class by the bounding box regression layer to approach the ground truth regression target . The full objectives for our object detection module can be formulated as equation (3).


in which we adopt cross entropy for caculating the classification loss and smooth loss[10] for bounding box regression. As shown in Fig. 2, domain classifier learns to classify as label and as label , then the gradient is reversed before it flows into the shared convolutional layers. The objective of the domain adversarial trainging module is shown in equation (4).


in which indicates the classification output on each RP in the image. Based on the above equations, we can formulate our full training objectives in the following equation (5), in which are weights to balance the different losses.


Iv Implementation

Iv-a Dataset

To test our method’s validity and also compare with current state-of-the-art (SOTA) work[32], we choose Cityscapes, Foggy-Cityscapes[44], Kitti[45] and Sim10k[46] dataset for our experiments. Cityscapes dataset has 2975 training images and 500 images for validation. Eight classes of common traffic participants are annotated with instance labels. Kitti is another famous dataset for benchmarking different vision tasks in autonomous driving such as depth estimation, stereo matching, scene flow or optical flow estimation, object detecton, etc. There are 7481 labeled training images with bounding boxes for categories ’car’, ’pedestrain’ and ’cyclist’. Foggy-Cityscapes and Sim10k are both synthetic datasets which simulate the driving scenes. Particularly, Foggy-Cityscapes images are rendered based on the real Cityscapes dataset to simulate the foggy weather condition, while the Sim10k dataset has 10,000 training images which are collected from the computer game Grabd Theft Auto (GTA) and annotated automatically by access to the original game engine. In Sim10k dataset, only objects ’car’ are annotated with bounding boxes for the detection task. Therefore, we only calculate and compare the ’car’ detection result for the experiments on Sim10k in this paper. Four example images randomly sampled from above four datasets are shown in Fig. 3 to show the domain bias.

(a) Cityscapes
(b) Foggy-Cityscapes
(c) Sim10k
(d) Kitti
Fig. 3: Four example images randomly sampled from the four datasets to show the domain bias.

Iv-B Network Architecture

We use the U-Net structure[47] with skip connections between layers for the two generators in the pixel adaption module and PatchGAN[36] for the other two discriminators. Instance normalization is adopted since it is more effective as stated in original CycleGAN paper. For the detection network, we use VGG16[11] as the backbone network and another small fully convolutional network for domain adaptation.

Iv-C Implementation Details

Constrained by the GPU memory, we scale the height of the image to 512 in the training stage, and then crop image patches with size 512x512 for the pixel level based adaptation. Least square GAN[48]

objectives are used to replace the log likelihood objectives for adversarial losses due to its capability of stablizing the training process and generating high quality images. Then the generated target image along with its ground truth labels are sent into the following network for detection training as well as domain adversarial training. Inputs for the domain classifier are the cropped conv5 features of VGG16 based on the region proposals generated by the RPN module. In our practical training process, we choose to firstly pretrain the detection and image translation networks indepedently and then conduct end-to-end training based on these two pretrained models. This mainly considers the fact that most of the generted images are quite noisy in the start training stage of the pixel level adaptation module. We train the PDA module with Adam optimizer and an initial learning rate of 0.0002. After 30 epoches, the learning rate linearly decays to be zero in the following training process for another 30 epoches. FDA module is trained together with the object detection network with initial learning rate of 0.001 based on the standard SGD algorithm. After 6 epoches, we reduce the learning rate to 0.0001 and train the network for another 3 epoches. Gradients from the domain classifier are reversed before they flow into the shared CNN layers during the backpropogation process. For the end-to-end training, all the above initial learning rates are scaled down by ten times. We then finetune the PDA and FDA modules for another 10 epoches. we set

and in all the experiments, unless otherwise mentioned. Tesla M40 with 24G memory is used for our network training.

Faster R-CNN 30.1
SOTA[32] 38.97
Our method 37.8
Our method 33.8
Our method
Oracle 48.3
TABLE I: Detection results evaluated on Cityscapes with Sim10k as source dataset

V Results

We show our experimental results conducted on the dataset pairs of Sim10k to Cityscapes, Sim10k to Kitti and Cityscape to Foggy-Cityscapes separately. For each dataset pair, we conduct three experiments considering our designed PDA module and FDA module as well as the final end-to-end training. To test the vadility of the PDA module, we only train the network for image translation, and use the translated images to train a pure detection network. For the FDA training, we directly use the source images as inputs for the detection training. Randomly sampled image from the target domain is combined with the sorce image for the domain adversarial training. Finally, we integrate the two modules together and train the whole network end-to-end. All the results are evaluated and compared using the commonly used metric of mean Average Precision (mAP) over all the detection classes.

V-a Sim10k to Cityscapes

Table I shows our results with different adaptation modules. Our baseline is trained purely on the source Sim10k dataset and evaluated on Cityscapes, which has mAP of 30.12% with the VGG16 backboned Faster R-CNN, while the oracle result is obtained by training directly on the original Cityscapes dataset. Compared with the baseline and current SOTA, our method can have +9.48% and +0.63% performance gains separately. Specifically, our feature level based domain adversarial training can improve the performance of baseline onto 33.8%, while the PDA module can bring +7.78% gains. When we integrate the two modules, our result is even slightly better than the current SOTA. Fig. 4 shows several image translation results generated by the PDA module. When the two modules (PDA and FDA) are integrated together, better adaptation results can be achieved as shown in Fig. 5, where parts of the target cars (e.g. the roof) emerged with the background with the only PDA training, reserve more details and characteristics after the combined PDA and FDA training.

V-B Sim10k to Kitti

Since there is no other current works reporting the detection results under this specific setting, we only compare our results with the baseline work. All the results are evalutated on the Kitti training split. As shown in Table II, the baseline Faster R-CNN network has an accuracy of 52.67%, which can then be improved to 55.3% and 58.4% by our prposed FDA and PDA modules independently. After the end-to-end training, we can get the highest accuracy of 59.3%. The translated images from Sim10k to Kitti are shown in the following Fig. 6.

V-C Cityscapes to Foggy-Cityscapes

To compare with current SOTA work, we also show our experiamental resules considering the domain adaptation from Cityscapes to its foggy version. The results are shown in the following Table III. All the training settings for the baseline and oracle results are the same with section A. We achieve +10.1% performance gains than the baseline result and +1.8% gains than the SOTA. Both of our proposed two modules can largely improve the detection performance under the foggy circumstance. The image adaptation results are qualitatively shown in Fig. 7.

Faster R-CNN 52.7
Our method 58.4
Our method 55.3
Our method 59.3
TABLE II: Detection results evaluated on Kitti with Sim10k as source dataset
Faster R-CNN 18.8
SOTA[32] 27.6
Our method 27.1
Our method 23.6
Our method
Oracle 35.0
TABLE III: Detection results evaluated on Foggy-Cityscapes with Cityscapes as source dataset

V-D Discussion

From above three experiments, we can see that our proposed methods can achieve much better performance than the baseline method. Even compared with current SOTA, our model still slightly outperforms their results, which further certifies the importance of performing domain adaptation both in feature and pixel spaces. Our experimental results of PDA also show that image translation purely based on the cycle consistency loss can not guarantee the pixels’ semantics to be consistent as shown by (b) in Fig. 4, where the sky is mapped into trees to fit the domain characteristics of the Cityscapes dataset. Training the two proposed modules together can help to keep the details of our target objects (e.g. the cars in Fig. 5) and hence improve the performance of our target vision tasks.

Fig. 4: Results of image translation from Sim10k to Cityscapes. (a) and (c) are two randomly sampled Sim10k images; (b) and (d) are the corresponding translated images in the domain of Cityscapes.
(d) cars cropped for better comparison
Fig. 5: Comparison between the images generated by PDA and PDA+FDA.(a) is the sampled real image in Sim10k; (b) is the transformed image with only PDA; (c) is the transformed image with training PDA and FDA together; (d) is the cropped cars for better comparing the details of the adaption with different methods.
Fig. 6: Results of image translation from Sim10k to Kitti. (a) and (c) are two randomly sampled Sim10k images; (b) and (d) are the corresponding translated images in the domain of Kitti.
Fig. 7: Results of image translation from Cityscapes to Foggy-Cityscapes. (a) and (c) are two randomly sampled Cityscapes images; (b) and (d) are the corresponding translated images in the domain of Foggy-Cityscapes.

Vi Conclusion and future works

A new unsupervised domain adaptation method is proposed in this paper to solve the object detection problem in the field of autonomous driving. Extensive experiments are implemented to testify the capability of our model. We can achieve better performance than current SOTA work through conducting adaption both in pixel and feature spaces.Despite its validity, one shortcoming of our method is that it can only address the transformation between two specific modals at one time. One interesting research direction may be to design the multi-modal domain adaption networks to equip our model the capability of dealing with, for example, different weathers(rain, snow, fog) or season changes. We also try to solve more complex cross domain vision tasks like instance segmentation or depth estimation based on our proposed methods, which will be our future work.


  • [1]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, pp. 1097–1105, 2012.
  • [2] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 3213–3223, 2016.
  • [3] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,” in Advances in neural information processing systems, pp. 137–144, 2007.
  • [4] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, vol. 79, no. 1-2, pp. 151–175, 2010.
  • [5] R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation for object recognition: An unsupervised approach,” in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 999–1006.   IEEE, 2011.
  • [6] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan, “Unsupervised pixel-level domain adaptation with generative adversarial networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, no. 2, p. 7, 2017.
  • [7] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” arXiv preprint arXiv:1711.03213, 2017.
  • [8] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” arXiv preprint arXiv:1703.10593, 2017.
  • [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587, 2014.
  • [10] R. Girshick, “Fast r-cnn,” arXiv preprint arXiv:1504.08083, 2015.
  • [11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [12] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions.”   Cvpr, 2015.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • [14] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 1, no. 2, p. 3, 2017.
  • [15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788, 2016.
  • [16] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
  • [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision, pp. 21–37.   Springer, 2016.
  • [18] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd: Deconvolutional single shot detector,” arXiv preprint arXiv:1701.06659, 2017.
  • [19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, pp. 91–99, 2015.
  • [20] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional networks,” in Advances in neural information processing systems, pp. 379–387, 2016.
  • [21] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011.
  • [22] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola, “Integrating structured biological data by kernel maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.
  • [23] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in Proceedings of the IEEE international conference on computer vision, pp. 2200–2207, 2013.
  • [24] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features with deep adaptation networks,” arXiv preprint arXiv:1502.02791, 2015.
  • [25] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Unsupervised domain adaptation with residual transfer networks,” in Advances in Neural Information Processing Systems, pp. 136–144, 2016.
  • [26] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation.” in AAAI, vol. 6, no. 7, p. 8, 2016.
  • [27] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
  • [28] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” arXiv preprint arXiv:1409.7495, 2014.
  • [29] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Computer Vision and Pattern Recognition (CVPR), vol. 1, no. 2, p. 4, 2017.
  • [30] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
  • [31] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan, “Domain separation networks,” in Advances in Neural Information Processing Systems, pp. 343–351, 2016.
  • [32] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain adaptive faster r-cnn for object detection in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3339–3348, 2018.
  • [33]

    N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in

    Advances in neural information processing systems, pp. 2222–2230, 2012.
  • [34] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696, 2011.
  • [35]

    J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,”

    IEEE transactions on image processing, vol. 19, no. 11, pp. 2861–2873, 2010.
  • [36]

    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,”

    arXiv preprint, 2017.
  • [37] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, “Toward multimodal image-to-image translation,” in Advances in Neural Information Processing Systems, pp. 465–476, 2017.
  • [38] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,” in Advances in neural information processing systems, pp. 469–477, 2016.
  • [39] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and A. Torralba, “Cross-modal scene networks,” IEEE transactions on pattern analysis and machine intelligence, 2017.
  • [40] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision, pp. 694–711.   Springer, 2016.
  • [41] H. Zhang and K. Dana, “Multi-style generative network for real-time transfer,” arXiv preprint arXiv:1703.06953, 2017.
  • [42] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp. 2414–2423.   IEEE, 2016.
  • [43] P. Li, X. Liang, D. Jia, and E. P. Xing, “Semantic-aware grad-gan for virtual-to-real urban scene adaption,” arXiv preprint arXiv:1801.01726, 2018.
  • [44] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene understanding with synthetic data,” arXiv preprint arXiv:1708.07819, 2017.
  • [45] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
  • [46] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and R. Vasudevan, “Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks?” in Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 746–753.   IEEE, 2017.
  • [47] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, pp. 234–241.   Springer, 2015.
  • [48] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2813–2821.   IEEE, 2017.