Active Adversarial Domain Adaptation

04/16/2019 ∙ by Jong-Chyi Su, et al. ∙ 16

We propose an active learning approach for transferring representations across domains. Our approach, active adversarial domain adaptation (AADA), explores a duality between two related problems: adversarial domain alignment and importance sampling for adapting models across domains. The former uses a domain discriminative model to align domains, while the latter utilizes it to weigh samples to account for distribution shifts. Specifically, our importance weight promotes samples with large uncertainty in classification and diversity from labeled examples, thus serves as a sample selection scheme for active learning. We show that these two views can be unified in one framework for domain adaptation and transfer learning when the source domain has many labeled examples while the target domain does not. AADA provides significant improvements over fine-tuning based approaches and other sampling methods when the two domains are closely related. Results on challenging domain adaptation tasks, e.g., object detection, demonstrate that the advantage over baseline approaches is retained even after hundreds of examples being actively annotated.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The assumption that the training and test data are drawn from the same distribution may not be true in practical applications of machine learning and computer vision. Consequently, a predictor trained on the source domain

may perform poorly when evaluated on the target domain different from the source. This covariate shift problem is common in many problems, , the seasonal distribution of natural species may change in a camera trap dataset, or the image resolution can change from one dataset to another.

Many domain adaptation (DA) methods have been proposed to address this issue [7, 32, 59, 57, 33, 10, 58]. The covariate shift assumes that the marginal distribution of the data changes from to , while the conditional label distribution remains the same. Domain adaptation methods operate by minimizing the differences of the marginal distributions of in the source domain and target domain by projecting the data through an embedding , , a deep network, while at the same time being predictive of the distribution in the source domain. By matching the marginals, the covariate shift is reduced, thus improving the generalization of the model on the target domain compared to an “unadapted” model.

Figure 1: Source and target domain data are shown in blue and red. Circle and cross represent class labels, while question marks are unlabeled data. We employ adversarial training to align features across source and target domains, and use discriminator predictions to compute the importance weight for sample selection of active learning.

While domain adaptation provides a good starting point, the performance of unsupervised DA methods often fall far behind their supervised counterparts [56, 4]. In such cases, some labeled data from the target domain may bring in performance benefits. However, obtaining ground-truth annotation can be laborious and naïvely collecting annotated data could be inefficient. In this work, we aim to answer the following questions: 1) how to select data to label from the target domain effectively, and 2) how to perform adaptation given these labeled data from the target domain.

To this end, we propose an Active Adversarial Domain Adaptation (AADA) that exploits the relation between domain adaptation and active learning to answer those questions. When there exists labeled data from the target domain besides the source, the problem changes to multiple-source DA [35, 65]. Addressing our second question, we propose a domain adversarial learning [10] between the union of labeled data from both source and target and unlabeled target data when the amount of labeled target data is small. However, after several rounds of active selection to accumulate many labeled data from the target domain, doing adversarial adaptation becomes counter productive and simple transfer learning approaches (, fine-tuning) serve the purpose.

Inspired by the importance weighted empirical risk minimization [53, 54], we address our first question by proposing a sample selection criterion composed of the two cues: the diversity cue and the uncertainty cue. The diversity cue is from the importance

where it can be estimated efficiently from the domain discriminator thanks to domain adversarial learning 

[12]. This allows one to sample unlabeled target data that are different from the labeled ones. The uncertainty cue is a lower bound to the empirical risk, which in our case is in the form of entropy of classification distribution. This promotes unlabeled data with low confidence for next round of annotation. The overall framework of our AADA is illustrated in Figure 1.

In experiments, we first validate the effectiveness of our approach on digit classification task from SVHN to MNIST in Section 4, showing significant improvement over other baselines on domain adaptation, transfer learning and active learning. We conduct experiments on object recognition on the Office [46] and VisDA dataset [39] with larger domain shifts in Section 5. Last, we extend our method to object detection task, adapting from the KITTI dataset [11] to the Cityscapes dataset [6]. The proposed AADA outperforms the fine-tuning baseline by 6% when only 50 labeled images from the target domain are available.

Finally, we summarize our contributions as follows:

  • An active learning framework by integrating domain adversarial learning and active learning for continuous semi-supervised domain adaptation.

  • Improved classification performance with domain adversarial learning, while the discriminator prediction yields better importance weight for sampling.

  • A connection between our sampling method and importance weight with domain adversarial training.

  • Reduced labeling cost on target domain on object classification and detection tasks.

2 Related Work

2.1 Domain Adaptation

Domain adaptation (DA) aims to make the model invariant to the data from source and target domain. For example, [7]

uses unlabeled data to measure inconsistency of source and target domain classifiers. Deep domain adaptation has been successful in recent years. The key idea is to measure the domain discrepancy at a certain layer of deep networks using domain discriminator 

[10, 3] or maximum mean discrepancy kernel [32, 59, 57, 33]

and train CNNs to reduce the discrepancy. In addition, approaches that combine techniques from semi-supervised learning, such as entropy minimization 

[13, 27], are proposed to enhance classification performance [33, 64]. It has also been applied to more complicated vision tasks beyond classification, such as object detection [4, 19] and semantic segmentation [16, 15, 56]. The annotation of these tasks is much more costly and how to select images to label become more crucial.

2.2 Active Learning

Active learning aims to maximize the performance with limited annotation budget [5, 50]. Thus, the challenge is to quantify the informativeness of unlabeled data [23] so that they are maximally useful when annotated. Many sampling strategies based on uncertainty [48, 28], diversity [17, 8], representativeness [62], reducing expected error [44, 61] and maximizing expected label changes [9, 60, 21] are studied and applied to vision tasks such as classification [41, 20], object detection [22], image segmentation [55, 60, 34], and human pose estimation [31]. Among these, uncertainty sampling is simple and computationally efficient, making it a popular strategy in real-world applications.

Learning-based active learning methods [25, 18] are proposed recently by formulating a regression problem for the query procedure and learning strategies based on previous outcomes. Deep active learning methods [49, 52]

are studied for image classification and named-entity recognition.

[66, 36] propose to use generative models to synthesize data for training, but the performance is largely dependent on the quality of synthetic data, limiting their generality.

2.3 Active Learning for Domain Adaptation

Different from aforementioned methods, we aim to unify active learning and domain adaptation. In this regards, the most relevant work is ALDA [47, 42], which demonstrated its effectiveness on sentiment and landmine classification tasks. ALDA trains three models, a source classifier , a domain adaptive classifier , and a domain separator . It first selects unlabeled target samples using , and decide whether to acquire label from (without cost) or from the oracle (with cost) using . is then updated with the obtained labeled data.

Besides using deep learning, the proposed AADA is different from ALDA in several ways. First, our discriminator not only helps the sample selection, but also trains the recognition model adversarially to reduce the domain gap. Moreover, we combine diversity in the form of discriminator prediction and uncertainty in the form of entropy. To the best of our knowledge, we are the first to jointly tackle domain adaptation and active learning using deep networks on vision tasks.

3 Proposed Algorithm

In this section, we introduce our active adversarial domain adaptation (AADA). We begin with the background of domain adversarial neural networks in Section 

3.1, then we motivate our sampling strategy by importance in Section 3.2. Algorithm summary and its theoretical background under semi-supervised domain adaptation setting are provided in Section 3.3.

3.1 Domain Adaptation

In this section, we will introduce the learning objective of our domain adaptation model. For simplicity, we describe the model in the image classification task. We denote as the input space and as the label space. The source data and (unlabeled) target data are drawn from the distribution and distribution respectively. We adopt the domain adversarial neural network (DANN) [10], which is composed of three components: feature extractor for the input , class predictor that predicts the class label , and discriminator that classifies the domain label . We use for the source domain and for the target domain. The objective function of the discriminator is defined as:


where are parameterized by , respectively. To perform domain alignment, features generated from should be able to fool the discriminator , and hence we adopt an adversarial loss to form a min-max game:


where is the cross-entropy loss for classification, is the class label, and is the weight between two losses.

Figure 2: Our proposed algorithm AADA. We start from unsupervised domain adaptation setting with labeled source and unlabeled target data and train the model with domain adversarial loss. In each following round, we first select samples using importance weight from unlabeled target domain to obtain annotations. We then re-train the model with labeled data and unlabeled data .

3.2 Sample Selection

Given an unsupervised domain adaptation setting where labeled data is only available from the source domain, the goal of our sample selection is to find the most informative data from unlabeled target domain. We motivate the sample selection criteria from the idea of importance weighted empirical risk minimization (IWERM) [53], whose learning objective is defined as follows:


where is an importance of each labeled data in the source domain. The formulation indicates which data is more important during optimization. First, the data with higher empirical risk , and second, the one with higher importance, , larger density in the target distribution but lower in the source .

Unfortunately, applying this intuition to come up with a sample selection strategy is non-trivial. This is because the target data is mostly unlabeled and the empirical risk cannot be computed before annotation. Another problem is that the importance estimation of high-dimensional data is difficult 

[54]. We take advantage of domain discriminator to resolve the second issue. Note that, with adversarial training, the optimal discriminator [12] is obtained at


where . Next, assuming cross-entropy as an empirical risk, we resolve the first issue by measuring the entropy of unlabeled data, which is a lower bound to the cross-entropy.111. Finally, our sample selection criterion for unlabeled target data is written as follows:


Two components in the measure are interpreted as follows: 1) diversity cue , and 2) uncertainty cue . The diversity cue let us select unlabeled target data which is less similar to the labeled ones in the source domain, while the uncertainty cue suggests data which the model cannot predict confidently.

3.3 Active Adversarial Domain Adaptation

Based on the two objectives of domain adaptation and sample selection, we explain the role of these two components in their collaboration for the active learning for domain adaptation purpose.

Input: labeled source ; unlabeled target ;
      labeled target ; budget per round
Model: ={, , }; feature extractor ;
      class predictor ; discriminator
Train with
for round 1 to MaxRound do
     Compute via (5)
     Select a set of images from according to
     Get labels from oracle
     Train with
Algorithm 1 AADA

Collaborative Roles. For domain adaptation, the goal is to learn domain-invariant features via (2) that better serves as a starting point for the next sample selection step. During the adversarial learning process, a discriminator is learned to separate source and target data, and thus we are able to utilize its output prediction as an indication for selection via the importance weight in (5). By iteratively performing adversarial learning and active learning, the proposed method gradually selects informative samples for annotations guided by the domain discriminator, and then these selected samples are used for supervised training to minimize the domain gap, in a collaborative manner.

One may still obtain a discriminator without adversarial learning and it can be easily learned to separate samples across two different domains. However, learning a discriminator in this way can be problematic for active learning. Firstly, this discriminator may give identically high scores to most target samples. Thus it lacks the capability of selecting informative ones. Moreover, the learned classifier and this discriminator may focus on different properties if they are not learned jointly. If this is the case, the informative samples that current discriminator selects are not necessarily beneficial for classifier update. We provide more evidence for the necessity of adversarial training in Section 4.3.

Active Learning Process. Our overall active learning framework is illustrated in Figure 2. We start our AADA algorithm by learning a DANN model in an unsupervised domain adaptation setting as described in Section 3.1, and then use the learned discriminator to perform the initial round of sample selection from all unlabeled target samples based on (5). Once obtaining the selected samples, we acquire their ground-truth labels.

For following rounds after annotating selected samples, we have a small set of labeled target data , a set of labeled source data , and the remaining unlabeled target data . Thus, the learning setting is different from the initial stage as we now have multiple labeled domains and . To accommodate labeled data from both domains, we revisit an analysis of multi-source domain adaptation [1, 2] whose generalization bound is given as:


with , is the number of labeled examples, is VC-dimension of hypothesis class and is hypothesis (, classifier).

is a weight vector between the errors of labeled source and labeled target, while

is a proportion of labeled examples for source and target domains. Assuming zero error on the labeled examples (, ), the bound is the tightest when and .

This leads to train a new DANN model that adapts from all labeled data to unlabeled data with uniform sampling of individual examples from labeled set to ensure the tightest bound. Therefore, we use uniform sampling of labeled source and target examples for sampling batches during training unless otherwise stated. Then, we select candidates from remaining unlabeled target set based on the new discriminator and new classifier following the same importance sampling strategy for the next round of training. The overall algorithm is shown in Algorithm 1.

4 Experiments on Digit Classification

As discussed above, our proposed method aims to address two questions: 1) how to select images to label from to yield the most performance gain? and 2) how to train a classifier given ? Our experiments then consists of our explorations for both components. In this section, we firstly perform detailed experiments in a mix-and-match way on digit classification task from SVHN [37] to MNIST [26]. Specifically, we explore the following training schemes:

1) Adversarial Training:

we train the classifier via (2) using .

2) Joint Training:

we train the classifier in a supervised way using . Note that we still train a discriminator for sample selection but without adversarial training.

3) Fine-tuning:

we train a classifier using and then fine-tune it on , both in a supervised way. Discriminator is trained in a similar manner to Joint Training.

4) Target Only:

we train our classifier with only.
The sampling strategies we explored are:

1) Importance Weight:

we select samples based on the proposed importance weight  (5).

2) K-means Clustering:

we perform k-means clustering on image features , where the number of clusters is set to in each round. For each cluster, we select one sample which is the closest to its center.

3) K-center (Core-set) [49]:

we use greedy k-center clustering to select b images from such that the largest distance between unlabeled data and labeled data is minimized. We use L2 distance between image features for the measurement.

4) Diversity [8]:

for each unlabeled sample in , we compute its distance to all samples in and obtain the average distance. Then we rank unlabeled samples w.r.t. its average distance in descending order and select the top samples. L2 distance is applied on features .

5) Best-versus-Second Best (BvSB) [20]:

we use the difference between the highest and the second highest class prediction as the uncertainty measure., , , where class has the second highest prediction.

6) Random Selection:

we select samples uniformly at random from all the unlabeled target data .

Our AADA uses importance weight for sample selection, and adversarial training as the training scheme. Note that, different sampling methods do not compete with AADA as it can be combined with our method. For example, BvSB can be used as an alternative uncertainty measurement as opposed to entropy in (5).

Experimental Setting

Commonly in the active learning literature [34, 55], we simulate oracle annotations by using the ground-truth in all our experiments. We consider an adaptation task from SVHN to MNIST, where the former and latter are initially considered as labeled source and unlabeled target respectively. SVHN contains 73,257 RGB images and MNIST consists of 60,000 grayscale images, both from the digit classes of to . Not only differ in color statistics, the images from two datasets experience different local deformations, making the adaptation task challenging. For this task, we use the variant of LeNet architecture [15] and add an entropy minimization loss for regularization [33]

during training. For each round, we train the model for 60 epochs using Adam 

[24] optimizer with learning rate for 20 epochs each. The batch size is 128 and

. We set budget to 10 in each round and perform 30 rounds, eventually selecting 300 images in total from the target domain. We carry our experiments with five different random seeds and report the averaged accuracy after each round. We use PyTorch 

[38] for our implementation.

4.1 Comparison of Sampling Methods

We start from comparing different sampling method combined with adversarial training. As shown in Figure 2(a), importance weight almost always outperforms its active sampling counterparts and can achieve accuracy with 160 samples after 16 rounds, while the random selection baseline requires two times more annotations to achieve similar performance. Moreover, our proposed method consistently improves the performance when more samples are selected and annotated, whereas other baselines generate unstable performances. One reason for such observation is that the class distribution of the selected samples in each round is not uniform. If the selected targets are heavily biased towards few classes, the “mode collapse” issue due to adversarial training gives high test accuracy on those classes but low accuracy on others, causing the overall lower accuracy. However, sampling with importance weight makes the result more stable after each round. As reference, AADA performs similarly as random selection ( accuracy) with 1000 labeled target. The performance saturates at around accuracy with 5000 labeled target and achieves accuracy with all 73,257 labeled target.

(a) Different sampling strategies with adversarial training.
(b) Different sampling cues with adversarial training.
(c) Different training schemes with random sampling.
(d) Different training schemes with importance weight.
Figure 3: Ablation studies on digit classification (SVHN

MNIST). Each data point is the mean accuracy over five runs, and the error bar shows the standard deviation. We show that: (a) sampling using importance weight performs the best when using adversarial training, (b) combining diversity and uncertainty cues performs better for selecting samples, (c) fine-tuning is the best training scheme when random sampling is used, (d) when using importance weight for sampling, adversarial training is the best when there are less than 250 labeled target. Overall, our AADA which uses adversarial training and importance weigh provides the best performance when few labeled target are available.

4.2 Comparison of Different Cues

We perform an ablation study of the two components in the proposed importance weight (5). The diversity cue, , , uses the predictions from the discriminator , while the uncertainty cue uses the predictions from the classifier . As shown in Figure 2(b), using diversity cue outperforms that of uncertainty cue, while combining these two yields the best performance. However, the benefits of using different cues may depend on the characteristics of each dataset and will be discussed later.

4.3 Comparison of Training Schemes

We compare different training schemes and show the effectiveness of combining adversarial training with importance weight. First, we provide a study of four training schemes in Figure 2(c), all using random sampling. In this case, we find that adversarial training suffers from mode collapse problem and fine-tuning is the best option. Fine-tuning is also the most effective and widely-used method of transfer learning as discovered in the deep learning literature [63, 51].

However, once the imbalance sampling problem can be effectively addressed, using the proposed importance weight, we can benefit from adversarial training. Figure 2(d) demonstrates the effectiveness of combining adversarial training with importance weight. We can see that it outperforms all settings in Figure 2(c). Moreover, our AADA method demonstrates its effectiveness especially when very few labeled targets are available; on the other hand, when more and more labeled targets are available, fine-tuning seems to be a better option as the benefit of leveraging information from source domain has decreased (as explained in Section 1). In our experiment, using fine-tuning performs better than using adversarial training when there are more than 250 labeled target selected using importance weight.

Comparison with ALDA [47].

For the baseline of using joint training and importance weight, we train the classifier and the feature extractor with , and train the discriminator for separating labeled and unlabeled data. The two objectives are trained jointly but not adversarially. This can be seen as an extension of ALDA [47]

using deep learning framework, despite some differences such as 1) the use of joint training instead of updating a perceptron, and 2) selecting samples using our proposed importance weight instead of using the margins to the linear classifier.

Interestingly, this baseline (as shown in Figure 2(d)) is worse than the one using joint training and random sampling (as shown in Figure 2(c)). This is mainly due to the lack of diversity. Specifically, without the help of adversarial loss, the importance weight can be very confident thus lacks the ability to provide sufficient diverse samples. This problem also remains for the original ALDA [47] method. Again, as shown in Figure 2(d), our AADA outperforms this baseline by on average of the first 25 rounds, showing that adversarial training not only helps adapt the model but also collaborates with importance weight for sampling.

5 More Experimental Results

In this section, we conduct experiments on object recognition and object detection datasets. Here we focus on comparing different sampling methods and refer the readers to supplementary material for complete comparisons.

5.1 Object Recognition

We validate our idea on the Office domain adaptation dataset [46]. It consists of 31 classes and three domains: amazon (A), webcam (W), and dslr (D), each with images. Specifically, we select dslr (D) as the source domain and amazon (A) as the target one. We further split the target domain using the first images as and the rest as the test set to evaluate all methods. We utilize ResNet-18 [14] model (before the first fc

layer) pre-trained on ImageNet as the feature extractor

. On top of it, has one layer while has


with 256-256-2 channels. We train our model with SGD for 30 epochs with learning rate . The batch size is 32 and . Budget per-round is set to 50 and we perform 20 rounds in total. We start the first round with random selection for all the methods as a warm-up.

Figure 4 demonstrates different sampling baselines with adversarial training. Our AADA method performs competitively with BvSB and outperforms all other methods, suggesting that the uncertainty cue is more useful in this dataset. More specifically, AADA outperforms random selection by around from round 10 to round 20, and our AADA is able to achieve accuracy with 800 labeled targets while random selection requires 200 more to achieve similar performance. Note that BvSB is one of the variants of our method, which also deploys our adversarial training scheme and uncertainty measurement.

Figure 4: Object classification result (Office D A). We compare different sampling methods with adversarial training. BvSB and AADA perform the best with 81.3% and 80.7% mean accuracy of 20 rounds separately.

5.2 Object Detection

Now we focus on object detection task adapting from KITTI [11] to Cityscapes [6]. We use the same setting as [4]

, which only considers the car object and resizes images to 500 for shorter edge while keeping the aspect ratio. After discarding images without cars, we obtain 6,221 and 2,824 training images from KITTI and Cityscapes respectively, and we split 500 images from Cityscapes for testing. Mean average precision at 0.5 IoU (mAP@0.5) is our evaluation metric in this task 

[4, 19]. We adopt Faster-RCNN [43] with the ResNet-50 architecture combining with FPN [29] as the feature extractor, and perform image-level adaptation as proposed in [4]. We select images in each round and assume that the cost of labelling one image is the same.

We report our quantitative results in Table 1. Our baselines include adversarial training with other sampling methods and different training schemes with random sampling. Note that BvSB is not included here due to the fact that in the single object category detection scenario, it provides similar measurement as entropy. Overall, using adversarial training and importance weight (AADA) gives the best performance. Specifically, accuracy can be achieved with 100 labeled target selected by AADA, while other baselines require about twice as much annotations to achieve similar performance. We further illustrate images selected with AADA within two rounds in Figure 5. As can be seen in this figure, we are able to select diverse images with different semantic layout.

Training Sampling Number of Labeled Target
10 20 30 50 100 200
 Adversarial Imp. weight 49.4 53.3 54.6 57.4 60.4 62.3
 Adversarial K-means 49.1 51.7 53.8 56.8 59.2 60.9
 Adversarial Entropy 48.9 50.9 52.3 54.3 58.1 61.0
 Adversarial Random 47.4 49.8 51.6 55.2 58.6 61.7
 Joint Imp. weight 48.5 52.1 53.5 56.2 58.6 60.5
 Joint Random 45.5 48.8 51.8 54.9 59.0 61.6
 Fine-tuning Random 41.0 46.0 48.7 51.4 56.0 59.8
 Target only Random 29.0 38.5 42.1 48.3 53.3 58.8
Table 1: Object detection results (KITTI Cityscapes). Our AADA method (first row) outperforms all other baselines, including using adversarial training and other sample selection methods, as well as using different training schemes and random sampling.
Figure 5: Top 10 images selected in the third and the fourth rounds from the target domain (Cityscapes) using AADA. The ground-truth bounding boxes of cars are shown in yellow. Images selected in the third round have more cars and the semantic layouts are different w.r.t. that of the fourth round, showing that diverse samples are selected by AADA.

5.3 VisDA-18 Challenge

We investigate the VisDA-18 domain adaptation challenge [39, 40] as a special case. The source domain is composed of 78,222 synthetic images across 12 object categories rendered from 3D CAD models, while the target domain contains 5,534 real images. We consider the 12-way classification problem following the setting in [40] and the ImageNet pre-trained ResNet-18 [14] model is used as feature extractor. As mentioned in [40], without using ImageNet pre-training, the accuracy would be very low and unsupervised domain adaption methods do not work. However, ImageNet images are closer to the target domain, this raises our interest to investigate whether images from source domain still help in this scenario.

Figure 6: VisDA-18 result (synthetic real). Here we use fine-tuning as the training scheme and compare different sampling strategies. Using importance weight for sampling performs equally well as BvSB and k-center baselines, and outperforms k-means and random baselines. The mean accuracy of 20 rounds are 79.8% and 80.1% for importance weight and BvSB methods separately.

Our initial trial using adversarial training shows improvement when there is no labeled target . However, after having a few labeled targets, using adversarial training does not introduce further improvement (see supplementary material). We argue that, 1) the domain gap (from synthetic to real images) in this dataset is large, thus the benefit of aligning image features from target to source domain is less than adding annotated target images , and 2) due to the use of ImageNet pre-trained model, the target domain (images from MS-COCO [30]) is actually closer to the domain for pre-training (images from ImageNet [45]) than the source domain (synthetic images).

Based on the above observations, we use fine-tuning as our training scheme on VisDA-18, and compare different sampling strategies in Figure 6. We set and perform 20 rounds in total. Using importance weight for sampling performs on a par with BvSB and K-center, and outperforms K-means and random selection baselines.

6 Conclusion

We propose AADA, a unified framework for domain adaptation and active learning via adversarial training. When there are few labeled target available, the domain adversarial model helps improve the classification; meanwhile, the discriminator can be utilized to obtain the importance weight for active sample selection in the target domain. We conduct extensive ablation studies and analyses, and show improvements over other baselines with different training and sampling schemes on object recognition and detection tasks.


  • [1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • [2] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, pages 129–136, 2008.
  • [3] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351, 2016.
  • [4] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool. Domain adaptive faster r-cnn for object detection in the wild. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3339–3348, 2018.
  • [5] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. In Advances in neural information processing systems, pages 705–712, 1995.
  • [6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [7] H. Daumé III, A. Kumar, and A. Saha. Frustratingly easy semi-supervised domain adaptation. In

    Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing

    , pages 53–59. Association for Computational Linguistics, 2010.
  • [8] S. Dutt Jain and K. Grauman. Active image segmentation propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2864–2873, 2016.
  • [9] A. Freytag, E. Rodner, and J. Denzler. Selecting influential examples: Active learning with expected model output changes. In European Conference on Computer Vision, pages 562–577. Springer, 2014.
  • [10] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [11] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [13] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In NIPS, 2005.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [15] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
  • [16] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
  • [17] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu.

    Semisupervised svm batch mode active learning with applications to image retrieval.

    ACM Transactions on Information Systems (TOIS), 27(3):16, 2009.
  • [18] W.-N. Hsu and H.-T. Lin. Active learning by learning. In

    Twenty-Ninth AAAI Conference on Artificial Intelligence

    , 2015.
  • [19] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. arXiv preprint arXiv:1803.11365, 2018.
  • [20] A. J. Joshi, F. Porikli, and N. Papanikolopoulos. Multi-class active learning for image classification. 2009.
  • [21] C. Kading, A. Freytag, E. Rodner, P. Bodesheim, and J. Denzler. Active learning and discovery of object categories in the presence of unnameable instances. In CVPR, 2015.
  • [22] C.-C. Kao, T.-Y. Lee, P. Sen, and M.-Y. Liu. Localization-aware active learning for object detection. arXiv preprint arXiv:1801.05124, 2018.
  • [23] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Active learning with gaussian processes for object categorization. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
  • [24] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [25] K. Konyushkova, R. Sznitman, and P. Fua. Learning active learning from data. In Advances in Neural Information Processing Systems, pages 4225–4235, 2017.
  • [26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [27] D.-H. Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.
  • [28] D. D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings 1994, pages 148–156. Elsevier, 1994.
  • [29] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature Pyramid Networks for Object Detection. In CVPR, 2017.
  • [30] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [31] B. Liu and V. Ferrari. Active learning for human pose estimation. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 4373–4382. IEEE, 2017.
  • [32] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu.

    Transfer feature learning with joint distribution adaptation.

    In ICCV, 2013.
  • [33] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In NIPS, 2016.
  • [34] W. Luo, A. Schwing, and R. Urtasun. Latent structured active learning. In Advances in Neural Information Processing Systems, pages 728–736, 2013.
  • [35] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In NIPS, 2009.
  • [36] C. Mayer and R. Timofte. Adversarial sampling for active learning. arXiv preprint arXiv:1808.06671, 2018.
  • [37] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, 2011.
  • [38] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • [39] X. Peng, B. Usman, N. Kaushik, D. Wang, J. Hoffman, K. Saenko, X. Roynard, J.-E. Deschaud, F. Goulette, T. L. Hayes, et al. Visda: A synthetic-to-real benchmark for visual domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018.
  • [40] X. Peng, B. Usman, K. Saito, N. Kaushik, J. Hoffman, and K. Saenko. Syn2real: A new benchmark for synthetic-to-real visual domain adaptation. arXiv preprint arXiv:1806.09755, 2018.
  • [41] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, and H.-J. Zhang. Two-dimensional active learning for image classification. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
  • [42] P. Rai, A. Saha, H. Daumé III, and S. Venkatasubramanian. Domain adaptation meets active learning. In Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing, pages 27–32. Association for Computational Linguistics, 2010.
  • [43] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS, 2015.
  • [44] N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 441–448. Morgan Kaufmann Publishers Inc., 2001.
  • [45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • [46] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213–226. Springer, 2010.
  • [47] A. Saha, P. Rai, H. Daumé, S. Venkatasubramanian, and S. L. DuVall. Active supervised domain adaptation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 97–112. Springer, 2011.
  • [48] T. Scheffer, C. Decomain, and S. Wrobel.

    Active hidden markov models for information extraction.

    In International Symposium on Intelligent Data Analysis, pages 309–318. Springer, 2001.
  • [49] O. Sener and S. Savarese.

    Active learning for convolutional neural networks: A core-set approach.

  • [50] B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.
  • [51] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2014.
  • [52] Y. Shen, H. Yun, Z. C. Lipton, Y. Kronrod, and A. Anandkumar. Deep active learning for named entity recognition. arXiv preprint arXiv:1707.05928, 2017.
  • [53] M. Sugiyama, M. Krauledat, and K.-R. MÞller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.
  • [54] M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1433–1440, 2008.
  • [55] Q. Sun, A. Laddha, and D. Batra. Active learning for structured probabilistic models with histogram approximation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3612–3621, 2015.
  • [56] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured output space for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [57] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.
  • [58] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
  • [59] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [60] A. Vezhnevets, V. Ferrari, and J. M. Buhmann. Weakly supervised structured output learning for semantic segmentation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 845–852. IEEE, 2012.
  • [61] S. Vijayanarasimhan and A. Kapoor. Visual recognition and detection under bounded computational resources. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1006–1013. IEEE, 2010.
  • [62] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang.

    Representative sampling for text classification using support vector machines.

    In European Conference on Information Retrieval, pages 393–407. Springer, 2003.
  • [63] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
  • [64] W. Zhang, W. Ouyang, W. Li, and D. Xu. Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3801–3809, 2018.
  • [65] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon. Adversarial multiple source domain adaptation. In NIPS, 2018.
  • [66] J.-J. Zhu and J. Bento. Generative adversarial active learning. arXiv preprint arXiv:1702.07956, 2017.

Appendix A Supplementary Material

In this supplementary material, we include 1) comparison between AADA and ALDA [47] on digit classification, 2) performance on object detection after more selection rounds, 3) comparison of different training schemes on the Office dataset [46], and 4) comparison of adversarial training and fine-tuning on the VisDA dataset [39, 40].

a.1 Comparison to ALDA [47]

Although joint training with importance sampling is one way to extend the ALDA [47] method (as compared in Section 4.3 in the main paper), here we consider the original algorithm of online ALDA (O-ALDA) on digit classification. We first extract the features from our domain adversarial model, and train a perceptron classifier , a source classifier , and a domain separator separately. There are two main differences: 1) this algorithm is performed in an online version, , selecting one sample at a time then updating the classifier, and 2) if the selected image is similar to the source domain (determined by ), we use the pseudo-label from without cost, and hence the number of selected images maybe be larger than the actual budget. The results are shown in Table A1, and our method outperforms O-ALDA by 10-15%.

Method Number of Labeled Target
0 100 200 300 500 1000
Ours 76.5 94.1 95.1 95.6 96.9 97.5
O-ALDA [47] 76.5 79.0 81.4 82.7 84.1 87.7
Table A1: Comparison of AADA and O-ALDA [47] on digit classification (SVHN MNIST).

a.2 More Object Detection Results

Here we show the results on object detection after more sample selection rounds, which is an extension of Table 1 in the main paper. We perform 9 rounds in total with {10, 10, 10, 20, 50, 100, 100, 200, 500} for each round. We plot -axis in log scale for a better illustration in Figure A1. Our AADA improves over other baselines, including other sampling strategies with adversarial training and random sampling with different training schemes, when up to 1000 labeled targets are available.

Figure A1: Object detection result (KITTI Cityscapes) after 9 rounds. The -axis is shown in log scale. The left-most points represent the initial round where no labeled target is available. Our AADA outperforms all other baselines when up to 1000 labeled targets are available.

a.3 Comparison of Training Schemes on Office

In this section, we compare the results of adversarial training with different training schemes on the Office dataset [46] in Figure A2, as an extension of Section 5.1 in the main paper. With random selection, adversarial training is better than other baselines including fine-tuning, joint training, and train on target data only. When using importance weight for sampling, adversarial training outperforms fine-tuning baseline. In addition, sampling with the proposed importance weight improves the performance over random selection when either adversarial training or fine-tuning is used. Overall, our adversarial training with importance weight (AADA) performs the best comparing to other combinations of training schemes and sampling strategies.

Figure A2: Comparing different training schemes on the Office dataset (D A). Adversarial training with importance weight for sampling (AADA) outperforms other baselines with different training schemes.

a.4 Comparison of Training Schemes on VisDA

As described in Section 5.3 in the main paper, VisDA [39, 40] is a special case where the target domain is closer to images from ImageNet which is used for pre-training, and thus we utilize the fine-tuning strategy. In Figure A3, we further provide results of using adversarial training when few labeled targets are available. To show more fine-grained results, we sample 10 images per round, , , and perform 10 rounds of selection. In unsupervised domain adaptation setting, , no labeled target is available , using adversarial training on improves the test accuracy on the target domain from 57.0% to 62.5%, compared to the model trained only on labeled source without adaptation. However, after adding labeled target, the accuracy of the model using adversarial training decreases, as shown in blue and red curves in Figure A3, regardless of which sampling strategy is used. On the other hand, the accuracy of the model using fine-tuning increases when the number of labeled target increases, showing that VisDA is more suitable for fine-tuning due to its dataset property. Nevertheless, fine-tuning with our proposed importance weight still performs better than random sampling.

Figure A3: Comparing different training schemes on the VisDA dataset. Using adversarial training, the accuracy does not improve when more labeled targets are added, since the target domain in VisDA is closer to ImageNet images for pre-training. However, the accuracy improves when we use fine-tuning, in which using importance weight for sampling is better than random sampling.