1 Introduction
In general machine learning tasks, we usually assume the datasets, where the hypothesis was trained and tested, are from the same distribution. However, this assumption, in general, is not realistic in many practical scenarios. For example, appearance shifts caused by illumination, seasonal, or weather changes are significant challenges for computer visionbased systems. A vision system trained on one dataset but deployed on another may suffer from rapid performance drop. More severely, to train a highperformance vision system requires a large amount of labeled data, and getting such labels may be expensive. One approach to deal with this issue is
Domain Adaptation (DA), which aims to improve the learning performance of a target domain by leveraging the unlabeled data in the target domain as well as the labeled data from a different but related domain (source domain). Previous works have theoretically analyzed the learning guarantees of DA [2, 12]and have reported some empirical applications in natural language processing
[8] and computer vision [19].Most recent DA advancements are mostly based on the basic Covariate Shift assumption that the marginal distributions of source and target domain change while the conditional distribution (predictive relation) is preserved during the adaptation process. However, some recent works have revealed that this assumption may not hold, and in this case, one may still need some labeled data from the target domain in order to successfully transfer information from one domain to another. Specifically, [25] discussed the conditional shift problem showing that such a problem exists and can hinder the adaptation process. They proved that the risk on target domain is controlled by the source risk, the marginal distribution divergence, and disagreement between the two labeling distributions:
(1) 
Here , and refer to target risk, source risk and labeling function, respectively. In a typical unsupervised DA setting, it is not possible to measure the third term in Eq. 1. One possible way to measure this term is to query some data labels from target domain so that the learner can learn the conditional relations in the target domain. However, the label annotations usually is expensive. Notice that the convergence rate at the disagreement term would generally be [17] with slow convergence behaviour if the label is sampled from the target set with size , which is far sufficient to minimize the last term.
To alleviate such difficulties, one can use Active Learning (AL) technique for DA so that the learner can reduce the cost of acquiring labels by requesting labeling from the oracle. AL only tries to query the labels of the most informative examples, and has been shown, in some optimal cases, to achieve exponentiallylower labelcomplexity (number of queried labels) than passive learning [3]. From this perspective, we tried to break the general sampling with limited information in the target domain (a.k.a semisupervised domain adaptation approach). Most previous active learning approaches were rooted in uncertaintybased approaches. [5] pointed out that only focusing on the uncertainty might lead to sample bias. To overcome such bias problems, we also need to consider the diversity in the query process. Recently, [15, 14] proposed adversarial training techniques to query the most informative features via a critic function, which could overcome the sample bias problems.
Aiming to address all the aforementioned issues, we proposed a threestage discriminative active domain adaptation algorithm, which aims to actively query the most informative instances in the target domain to minimize the labeling disagreement term, under the same and small querying label budget.
In the first stage, we adopted the Wasserstein Distancebased adversarial training technique for unsupervised DA through training a critic function for learning the domain invariant feature. The critic could also be used to discriminate the target domain features for active querying. In the second stage, we derived a sampleefficient and straightforward active query strategy based on the network structure, for sampling the most informative samples in the target domain by controlling uncertainty and diversity for selecting the target instances. Finally in the third stage, we deployed a reweighting technique based on the prediction uncertainty for determining the importance of queried samples to retrain the network.
We then implemented extensive experiments on four benchmark datasets. The empirical results showed that our proposed algorithm could improve the classification accuracy with a small query budget. When the query budget is small, the proposed approach can have better performance than its (random) selection counterparts (reported in Table 5), which confirms the effectiveness of our algorithm.
2 Related Works
Domain Adaptation
A large number of efforts have been addressed toward DA [20]. As stated before, many of the previous advancements [2, 6, 18, 13] were based on the assumption that the conditional relations remain unchanged during the adaptation process. Some recent works proposed to tackle the conditional shifts problem. [10] adopted the Conditional Generative Adversarial Nets (CoGANs) to extract the crosscovariance between the source and target feature representations, and also measure the conditional entropy as an uncertainty measure to control the transferability. [22] proposed the Bayesian Neural Network with entropy and variable uncertainty measures to jointly match the marginal distribution () and conditional distribution ().
Active Learning
AL has been widely investigated by academia in the context of theory or applications. Recently, [15]
proposed a variational autoencoder based adversarial approach to query the informative unlabeled feature from the labeled ones and
[7] proposed discriminative active learning. [14] extended and adopted a critic network for querying the diverse features. Those above usually assumes that labeled and unlabeled data are from same distribution. Few works were proposed to implement active learning for enhancing domain adaptation two or more distributions.Active Learning for Domain Adaptation
[11] proposed a twodirection AL algorthim for DA: query the most informative from the target domain and remove the most strange features out of the source domain. [21]
proposed the active transfer technique for the model shift problem while assuming the shifts are smooth and implemented conditional distribution matching algorithm and offset algorithm to modelling the source and target tasks via comparing the Gaussian Distributions.
[24] proposed a distribution correction algorithm over kernel embeddings to handle the target shift. The last two methods held on the assumption that there existed an affine transformation of conditional distribution from the source to target. [16] proposed an active learning method using divergence and the importance sampling technique to query the target instances. However, the importance sampling, query strategy they adopted, assumed that , may not hold in many DA settings.3 Problem Setup
Notations and Basic Definitions
We consider a classification task, denote and as the input and output space. A learning algorithm is then provided with a labeled source dataset consisting of examples drawn from and an unlabeled target sample consisting of examples drawn from , where
is the joint distribution on
and is the marginal target distribution on , respectively. The expected source and target risk of over (respectively,), are the probabilities that
errs on the entire distribution (respectively, ): and , whereis the loss function. The goal of DA is to build a classifier
training on source domain with a low target risk .3.1 Optimal Transport and Wasserstein Distance
Optimal Transport (OT) theory and Wasserstein Distance were recently widely investigated in machine learning [1] especially in the domain adaptation area [4]. We follow [12] and define as the cost function for transporting one unit of mass to , then Wasserstein Distance could be computed by
where is joint probability measures on with marginals and referring to all the possible coupling functions. Throughout this paper, we shall use Wasserstein1 distance only (). According to KantorovichRubinstein theorem, let be a Lipschizcontinous function , we have
(2) 
3.2 Conditional Shift and Error Bound
From a probabilistic perspective, the general learning process of most previous DA approaches is to learn the joint distribution of the target domain through source domain joint distribution . Note that , to guarantee a successful transfer from source domain to target domain , the underlying assumption is . Recently, [22] showed that such condition is not sufficiently hold.
For the conditional shift situation, . [25] theoretically showed that such a conditional shift problem exists in many situations and that typically if we only try to minimize the source error together with the domain distances, the target error might increase, which shall hinder the adaptation process. Their analysis was based on
divergence, which is somehow hard to compute in deep learning based methods. In order to be coherent with our proposed work, we shall present it using Wasserstein Distance with the following Theorem
1.Theorem 1.
Let and be the source and target distributions and corresponding labeling function, if the hypothesis is 1Lipschtiz and the loss function is loss, then we have
(3) 
The proof is based on Lemma 1 of [13] and is symetric to the proof of Theorem 3 of [25]. Due to space limit, we show the sketch idea of proof,
Proof.
This theorem showed that error on the target domain is decided by source domain error, Wasserstein Distance between source and target, and the conditional distribution on both source and target domains. Here the third term is not measurable in the unsupervised domain adaptation setting. If the conditional distribution changes during the adaptation process, then the target error may diverge [25]. One direct approach to reduce the disagreement between and is to partially acquire the labeling function , i.e., the labels in the target domain.
Besides, the Wasserstein distance between the source and target distribution (second term in Eq. 3), is measured by total transportation cost between the source domain. Denote and by the corresponding distributions of unlabeled and labeled datasets, then the Wasserstein distance is denoted by:
Intuitively, if we can query some instances in the target domain () and move them from target into the source domain , we can reduce the total transportation cost between the two domains, the Wasserstein distance between the two domains.
Based on this, to minimize the RHS of Eq. 3 is equivalent to train a learner that: minimize the source error;
train a critic to estimate the empirical Wasserstein Distaince between the source and target domain and approximately find a feature extractor that can minimize the total transportation cost between the source and target domain in an adversarial way with the critic;
can query the labeling information in the target domain so that to minimize the disagreement of labeling function between the source and target domain the third term of Eq. 3.To this end, we argue that if the learner can actively query labeling information in the target domain, then, it can partially get the conditional information in the target domain. With the minority of labeled target instances in hand, it can learn to jointly minimize the error both on the source and target domain. Furthermore, to query the label is somehow slow. In order to reduce the annotation expense, we may expect the learner to query some informative instances using an active learning strategy. Also, if the queried instances in the target domain are informative enough, they will have a better representative property on the target domain. Then, the learner can have better generalization performance on the target domain. Take those above into consideration, we can formally propose the discriminative active domain adaptation method.
4 Active Discriminative Domain Adaptation
Our learning process mainly consists of three main stages. We will introduce them in details.
4.1 Stage 1: Domain Adversarial Training via Optimal Transport
For the first stage, we adopt Wasserstein Distance Guided Representation Learning [13] method for adversarial training. The network receives a pair of instances from the source and target domain. Denoted by and the feature extractor and classifier, parameterized by and by , respectively. The feature extractor is trained to learn invariant features, and the classifier is expected to learn the conditional prediction relations for predicting the instances from both source and target domain correctly. For the classification loss, we employ the traditional crossentropy loss: .
Then, there follows the domain critic network , parameterized by . It estimates the empirical Wasserstein Distance between the source and target domain through a pair of batched instances and ,
(4) 
The feature extractor is then trained to minimize the estimated Wasserstein Distance in an adversarial manner with the critic . Then, goal of first stage training is described by
(5) 
where is a tradeoff coefficient and is the gradient penalty term suggested by [9]. The source and target features (marginal distributions) could be aligned via such an adversarial training process. Then, based on this aligned marginal distribution, we can implement the active strategy to query the most informative target instances
4.2 Stage 2: Active Query with Wasserstein Critic
For the second stage, we hope the active leaner can find out the most informative features among the unlabeled target so that it could leverage from the labeling information of the target domain. The informative features, intuitively, are the ones most different from what the learner has already known. Intuitively, the hardest instances to adapt are those with least confidence, the most uncertain ones, to predict based on current classifier. As pointed out as previous work [5], only focus on the uncertainty shall lead to the sampling bias. In order to reduce the sampling bias, the active learner shall also search diversity some target samples. We therefore find the most informative target samples holding both uncertainty and diversity properties.
Prediction Uncertainty
The conditional prediction is learned by the classification network. To measure the uncertainty, we can borrow the idea from [10] to adopt entropy measure to quantify the uncertain of the classifier. The uncertainty entropy measure over an instance is denoted by
(6) 
where is the information entropy measure, is the output of classification network .
Diversity by Critic Function
If some instances, in terms of distribution distance measures, are very far from the unknown labeled ones, then they should contain most informative and diverse features from the known labeled ones. Recall that in the first stage, we match the marginal distribution between the source and target domain to achieve a domain invariant feature space with Wasserstein Distance. Then, for the target domain instances, the one with highest critic score is the one that have the highest transportation cost.
[15, 14] showed that such critic term indicates the diversity in the query process. Then, we can leverage from the trained Wasserstein Critic network to evaluate and find out the most informative (diverse) target features on the invariant feature space. That is, measuring the diversity of target instances via critic score. Consider the critic output of a target instance , if , then is far, Wasserstein Distance, from the source domain images and if , then is near to the source images.
Based on those above, if we hope to find out the most informative (uncertain and diverse) instances in the target domain, then we should query by controlling two terms:

uncertainty score defined by Eq. 6, which is indicates the uncertainty of the classifier to predict a label given the instance in the target domain

critic score by the the Wasserstein critic function, which indicates the diversity of the unlabeled target instance compared with the source labeled ones.
Then, we shall have the following objective
(7) 
where is a coefficient to regularize the Wasserstein critic term. So, for a query budget and of target set instances, the query process could be described as: looking for instances by solving Eq. 7 and query the labels of those instance from the oracle. Denote the queried set by . Then, uniting such small batch instances with the source domain and removing them from the target domain. The source and target datasets shall be updated as: , . We illustrate a general query workflow in Fig. 1.
4.3 Stage 3: DA training with new dataset
The goal of our proposed method is to leverage the most informative instances in the target domain to reinforce the adaptation process. General adversarial training methods for domain adaptation usually assign each instance with the same importance weight. In order to enforce the uncertainty information to the classifier, we hope to give higher weights to the instances with higher uncertainty scores during the supervised classification process.
Denote by a set of queried instances , we shall reweight the importance of each instance classes based on their uncertainty score. Denote by uncertainty vector over all classes. For each class , the weight is computed by,
(8) 
where is the number of instances with label , is the uncertainty score defined in Eq.6.
For a batch of queried instances, the weighted crossentropy loss could be computed by
Then, objective function for the third stage is,
(9) 
where and are sampled from the updated source and target datasets, is the classification loss on the original source set and is the weighted loss for the query set. Finally, we illustrate our Active Discriminative Domain Adaptation (AcDA) algorithm in Algorithm 1
5 Experiments and Results
We evaluate the performance of the proposed algorithm on four benchmark datasets and compared with some other approaches: Wasserstein Guided Domain Adaptation (WDGRL [13]), Domain Adversarial Neural Networks (DANN [6]), Adversarial Discriminative Domain Adaptation (ADDA [18]) and Conditional Adversarial Domain Adaptation (CDAN [10]). In order to show the benefits of active query method, we also compare the results with random selection process when the query budget is the same. All experiments are programmed by Pytorch.
5.1 Datasets and Implementations
We test our proposed algorithm on four benchmark datasets.
Digits Datasets
We test our algorithm on digits datasets with the experiments setting : USPS (U) MNIST (M) and MNIST MNISTM (MM). For USPS we resize the images to size . We train the network using training sets with size: MNIST/MNISTM(), USPS() and testing sets with size: MNIST/MNISTM (), USPS().
Method  M MM  M U  U M  avg. 

LeNet5  
DANN  
WDGRL  
ADDA  
Rand.  
AcDA 
Method  A W  A D  D A  W A  avg. 

ResNet50  
DAN  
DANN  
WGDRL  
Rand.  
AcDA 
Method  Ar Cl  Ar Pr  Ar Rw  Cl Ar  Cl Pr  Cl Rw  Pr Ar  Pr Cl  Pr Rw  Rw Ar  Rw Cl  Rw Pr  avg. 

ResNet50  
DANN  
WGDRL  
CDAN  
Rand.  
AcDA 
Method  C I  CP  I P  I C  P C  P I  avg. 

ResNet50  
DANN  
WDGRL  
CDAN  
Rand.  
AcDA 
Office31 dataset
is a standard benchmark for domain adaptation evaluations. It contains three different domains: Amazon (A), Dslr (D) and WebCam (W), with categories in each domain. We report the average results in Table 2.
Office Home dataset
is more challenging than Office31, contains four different domains: Art (Ar), Clipart (Cl), Prodcut (Pr) and Real World (Rw), with categories in each domain. We report the average results in Table 3.
ImageCLEF 2014 dataset
contains three domains, which are Caltech256(C), ILSVRC2012(I), and PascalVOC2012(P), with 12 common shared catagories. We report the average results in Table. 4
For digits datasets, we do not apply any dataaugmentation. For Office31, OfficeHome and ImageCLEF datasets, we apply the following preprocessing pipline: for training set, firstly resize the image to then, apply downgrade the size to , after that, apply the same random flipping strategy of [23]; for testing set, resize the images to then use CenterCrop to size .
CNN Archiecture and Implementations
For digits experiments, we adopt LeNet5
as feature extractor and trained from scratch. For the rest three realworld datasets, we implement ImageNet pretrained
ResNet50 as feature extractor. For the digits experiments, we train the network with minibatch size and for the rest three datasets with minibatch size . We adopt Adam optimizer for training the network. For stable training, we set , where and is the training progress. Also, we empirically set . To avoid overtraining, we also adopt earlystopping technique.5.2 Results and Analysis
We illustrate the TSNE visualization comparison of nonadaptation setting and our proposed approach AcDA. We can observe that our proposed method has a good alignment performance. We report the average results of our proposed algorithm and baselines using our data preprocessing pipeline on Digits, Office31, OfficeHome and ImageCLEF datasets in Table 1, 2, 3 and 4, respectively. In order to show the effectiveness of active query strategy, for a given budget, we also implemented random () selection method to query the labels for comparison. The name of such implementations are denoted by rand. and AcDA in each table. In Table 5, we also compared the performances under different budget.
Value of Target Labels
From the tests results on the four benchmark datasets, we could observe that the to randomly select some instances in the target domain could benefit the classification performance on the target domain. Our method is rooted in WDGRL, comparing accuracy performance between the random selection with WDGRL we could observe improvements with on Digits, on Office31, on OfficeHome and on ImageCLEF dataset which confirm the usefulness of label information for adaptation. Also, for each adaptation task on every dataset, we can observe that the proposed AcDA algorithm outperforms the random selection method in almost all the tasks. This also confirms that active query can outperform selection.
Effectiveness of Active Query
We compared the performance with active query and random random selection. We also implement the experiments with different query budgets (with , and ), the average on different dataset is reported in Table 5. we can observe that the accuracy will increase as the query budget increases. Also, for same query budget, we compare the accuracy of active query and random selection. We can observe that active query method can outperform the random query method with query budget and . That is, with smaller query budget, the active query strategy can have better performance than random selection. This confirms the effectiveness of active query strategy. When the query budget goes to
, we don’t observe distinguishable differences. One interpolation is that as the query budget increase, the more instances in the target domain will be labeled and those most informative ones will be covered with high probability. When the query budget is relatively small, the active strategy can exactly look for the most informative instances rather than uniformly (random) selecting some instances.
Digits  OfficeHome  ImageCLEF  

budget  Rand.  AcDA  Rand.  AcDA  Rand.  AcDA 
6 Conclusion
We proposed a threestage discrimative active algorithm to improve the domain adaptation performance. The first stage adopted general domain adversarial training. In the second stage, we proposed an endtoend query strategy combining uncertainty and diversity criteria to find out the most informative features in the target domain. Finally, in the third stage, we deployed a reweighting technique based on the prediction uncertainty for determining the importance of the queried samples to retrain the network. The empirical results confirmed the effectiveness of our active domain adaptation algorithm especially when the query budget is small.
References
 [1] (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §3.1.
 [2] (2010) A theory of learning from different domains. Machine learning 79 (12), pp. 151–175. Cited by: §1, §2.
 [3] (1994) Improving generalization with active learning. Machine learning 15 (2), pp. 201–221. Cited by: §1.
 [4] (2016) Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence 39 (9), pp. 1853–1865. Cited by: §3.1.
 [5] (2011) Two faces of active learning. Theoretical computer science 412 (19), pp. 1767–1781. Cited by: §1, §4.2.
 [6] (2016) Domainadversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §2, §5.
 [7] (2019) Discriminative active learning. arXiv preprint arXiv:1907.06347. Cited by: §2.
 [8] (2011) Domain adaptation for largescale sentiment classification: a deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 513–520. Cited by: §1.
 [9] (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §4.1.
 [10] (2018) Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1640–1650. Cited by: §2, §4.2, §5.
 [11] (2012) Active learning for domain adaptation in the supervised classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 50 (11), pp. 4468–4483. Cited by: §2.
 [12] (2017) Theoretical analysis of domain adaptation with optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 737–753. Cited by: §1, §3.1.

[13]
(2018)
Wasserstein distance guided representation learning for domain adaptation.
In
AAAI Conference on Artificial Intelligence
, Cited by: §2, §3.2, §3.2, §4.1, §5.  [14] (2019) Deep active learning: unified and principled method for query and training. arXiv preprint arXiv:1911.09162. Cited by: §1, §2, §4.2.
 [15] (2019) Variational adversarial active learning. arXiv preprint arXiv:1904.00370. Cited by: §1, §2, §4.2.

[16]
(201906)
Active adversarial domain adaptation.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
, pp. 1–4. Cited by: §2.  [17] (2018) Foundations of machine learning (second edition). MIT Press, Cambridge, Massachusetts. Cited by: §1.
 [18] (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176. Cited by: §2, §5.
 [19] (2018) Visual domain adaptation with manifold embedded distribution alignment. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 402–410. Cited by: §1.
 [20] (2018) Deep visual domain adaptation: a survey. Neurocomputing 312, pp. 135–153. Cited by: §2.

[21]
(201422–24 Jun)
Active transfer learning under model shift
. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China, pp. 1305–1313. External Links: Link Cited by: §2.  [22] (2019) Bayesian uncertainty matching for unsupervised domain adaptation. arXiv preprint arXiv:1906.09693. Cited by: §2, §3.2.
 [23] (2019) Universal domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2720–2729. Cited by: §5.1.
 [24] (2013) Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pp. 819–827. Cited by: §2.
 [25] (2019) On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pp. 7523–7532. Cited by: §1, §3.2, §3.2, §3.2.