Unsupervised Domain Adaptation without Source Data by Casting a BAIT

10/23/2020 ∙ by Shiqi Yang, et al. ∙ Universitat Autònoma de Barcelona 0

Unsupervised domain adaptation (UDA) aims to transfer the knowledge learned from labeled source domain to unlabeled target domain. Existing UDA methods require access to the data from the source domain, during adaptation to the target domain, which may not be feasible in some real-world situations. In this paper, we address Source-free Unsupervised Domain Adaptation (SFUDA), where the model has no access to any source data during the adaptation period. We propose a novel framework named BAIT to tackle SFUDA. Specifically, we first train the model on source domain. With the source-specific classifier head (referred to as anchor classifier) fixed, we further introduce a new learnable classifier head (referred to as bait classifier), which is initialized by the anchor classifier. When adapting the source model to the target domain, the source data are no more accessible and the bait classifier aims to push the target features towards the right side of the decision boundary of the anchor classifier, thus achieving the feature alignment. Experiment results show that proposed BAIT achieves state-of-the-art performance compared with existing normal UDA methods and several SFUDA methods.



There are no comments yet.


page 5

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Though achieving great success, typically deep neural networks demand a huge amount of labeled data for training. However, collecting labeled data is often laborious and expensive. It would, therefore, be ideal if the knowledge obtained on label-rich datasets can be transferred to unlabeled data. For example, after training on synthetic images, it would be beneficial to transfer the obtained knowledge to the domain of real-world images. However, deep networks are weak at generalizing to unseen domains, even when the differences are only subtle between the datasets 

oquab2014learning. In real-world situations, a typical factor impairing the model generalization ability is the distribution shift between data from different domains.

Domain adaptation methods aim to reduce the domain shift between source and target domain. Usually, the data from the target domain are unlabeled, in which case it is referred to as Unsupervised Domain Adaptation (UDA). Early works gong2012geodesic; pan2009survey

learn domain-invariant features to link the target domain to the source domain. Along with the growing popularity of deep learning, many works are leveraging its powerful representation learning ability for domain adaptation 

ganin2016domain; long2017deep; oquab2014learning; yosinski2014transferable. Those methods typically minimize the distribution discrepancy between two domains long2018transferable; long2015learning; long2016unsupervised, or deploy adversarial training using a discriminator pei2018multi; tzeng2015simultaneous; tzeng2017adversarial.

Figure 1: Illustration of the proposed method. After training on source data, we adapt the model to the target domain. During adaptation, we do not have access to source data anymore, we fix the anchor classifier and initialize a second classifier, called bait, with it. The main idea behind this classifier is shown at the right, where we achieve adaptation by alternately enforcing the bait classifiers to disagree with fixed anchor classifier (allowing us to identify potentially wrongly classified samples) and enforcing the feature extractor to make two classifiers reach consensus (leading to more compact clusters in the feature space).

However, a crucial requirement in the methodology of these methods is that they require access to the source domain data during the adaptation process to the target domain. This is infeasible for several real-world situations, for example deploying domain adaptation algorithms on mobile devices where the computation capacity is limited, or in situations where data-privacy rules limits access to the source domain. Because of its importance, the Source-free Unsupervised Domain Adaptation (SFUDA) setting, where the model is first trained on the source domain and then, when adapting to the target domain has no longer access to the source data, has started to obtain traction recently kundu2020universal; liang2020we; kim2020domain.

In this paper, we also investigate SFUDA setting. Compared to UDA methods, it is much more challenging to align the source and target distribution in the SFUDA setting. To address this challenge, we propose a method named BAIT, as shown in Fig. 1. First, we freeze the classifier head (called anchor classifier), which is trained only with source data. Instead of adapting the classifier to the target domain, we aim to align the features of the target domain with the fixed classifier. We then use the anchor classifier to initialize another classifier, which is deployed as bait to drive the target features towards the right side of the decision boundary of the anchor classifier. Besides, we also investigate how BatchNorm ioffe2015batch influences performance under SFUDA setting. As a general method, our approach can also be put on any compatible UDA methods for SFUDA. In the experiments, we show for two UDA methods, entropy minimizing and Batch Nuclear-Norm Maximization cui2020towards, that when combined with BAIT and trained without source data, they match or outperform the original reported results when trained with source data. Moreover, we show our proposed BAIT surpasses all existing UDA methods, even though these methods have source data at hand at all time, and also several SFUDA methods.

Related Works

Domain adaptation aims to reduce the shift between the source and target domains. Moment matching methods align feature distributions by minimizing the feature distribution discrepancy, including methods such as DAN 

long2015learning and DDC tzeng2014deep, which deploy Maximum Mean Discrepancy. CORAL sun2016return matches the second-order statistics of the source to target domain by recoloring whitened source data through statistics information from the target domain. Inspired by adversarial learning, DANN ganin2016domain formulates domain adaptation as an adversarial two-player game: the domain discriminator aims to distinguish the source from the target domain, and the feature generator aims to confuse the domain discriminator. CDAN long2018conditional trains a deep networks conditioned on several sources of information. DIRT-T shu2018dirt performs domain adversarial training with an added term that penalizes violations of the cluster assumption. Both ADR saito2017adversarial and MCD saito2018maximum optimize two classifiers to be consistent across domains.

Domain adaptation has also been tackled from other perspectives. RCA cicek2019unsupervised proposes a multi-classification discriminator. DAMN bermudez2020domain introduces a framework where each domain undergoes a different sequence of operations. AFN Xu_2019_ICCV shows that the erratic discrimination of target features stems from much smaller norms than those found in source features.

All these methods, however, require access to source data during adaptation. Recently, USFDA kundu2020universal explores the source-free setting, but it only focuses on the Universal Domain Adaptation task you2019universal, and their proposed method is complex, involving the generation of simulated labeled negative samples. Most relevant works for close-set UDA are SHOT liang2020we and SFDA kim2020domain , but SHOT needs to generate extra pseudo labels, while SFUA does not achieve satisfactory performance. Unlike them, our proposed method simply introduces an additional bait classifier to achieve feature alignment, thus alleviating the performance deterioration during adaptation under SFUDA setting.


We denote the labeled source domain data with samples as , where the is the corresponding label of , and the unlabeled target domain data with samples as . Unlike normal setting, the SFUDA leverages the model trained on the source data and only the unlabeled target domain data during adaptation, without source data anymore. This paper aims to provide an effective and general training framework for SFUDA. In the following sections, we will first introduce a baseline setting for SFUDA, and then illustrate our proposed method.

Baseline setting

Usually UDA methods eliminate the domain shift by aligning the feature distribution between the source and target domains. This is especially challenging under the SFUDA setting, where we never have simultaneous access to source and target data.

Figure 2: Illustration of training process. The top shows that the source-training model fails on target domain due to domain shift. The bottom illustrates our adaptation process. Bottom (a): mining the potential wandering features by whether the prediction entropy is above the threshold . Bottom (b): training by making disagreement with . Bottom (c): training feature extractor by adaptive loss and making and get consensus.

We decompose the neural network into two parts: a feature extractor and a classifier head . A successful alignment of the features means that the features produced by the feature extractor from both domains locate on the same and correct side of the decision boundary (determined by the classifier head ), which means classified correctly by the classifier head. Therefore, we propose to freeze the source-trained classifier . This implicitly allows us to store the relevant information from the source domain without actually accessing the source data.

We first train the baseline model on the labeled source data with standard cross-entropy loss, then fix the classifier head . Next, we train the model only on the unlabeled target domain data . The baseline model could be any compatible existing UDA method. According to the cluster assumption chapelle2005semi, minimizing the entropy of the predictions will push the target features far away from the decision boundary. In our paper, we choose the pipeline with BNM loss cui2020towards as baseline, where minimizing the BNM loss equals to decreasing the prediction entropy and increase the prediction diversity.

Source-free Domain Adaptation with BAIT

The baseline method, which minimizes the prediction entropy, can partly ensure that the target features match the source classifier, but there may still exist some difficult target samples which are misclassified by the source classifier. Due to the domain shift, the cluster of target features generated by the source-training feature extractor may cross the decision boundary. If we naively minimize the entropy loss on the feature extractor (as we do in the baseline model), some features may head to the right side of the decision boundary, but those target features which are located at the wrong side will move towards the wrong direction. Since the entropy minimization ensures that the decision boundary will not go through the data dense region, this will result in misclassification, if no other restriction are deployed

To tackle this challenge, we propose a general framework named BAIT, which introduces an extra classifier/decision boundary (hereafter we denote the fixed anchor classifier as and the extra classifier as , and feature extractor as ), as shown at the right of Fig. 1. Both classifiers and will correctly predict on some target features but misclassify some others. So optimizing the feature extractor to reach consensus on both and will push the target features towards the inner side of both decision boundaries. Hereafter we call features, which are close to decision boundaries and thus likely to cross them during adaptation, as wandering features. Typical wandering features are those which are nearby features which all belong to the same class, but at the wrong side of the decision boundary.

After training the model on the source data, we get a feature extractor , and an anchor classifier head . We fix in the subsequent training periods. is initialized by before the adaptation. In order to train the desired , we propose a 3-step training policy which alternately trains bait and feature extractor . Note that after training on the source domain, we never get access to the source domain data during adaptation. In order to better illustrate our method, hereafter we treat the weights of the classifier head as class prototypes111The classifier head only contains a single fully connected layer, and with l2 normalization on its weights. . The proposed algorithm has three steps:

Step 1: Wandering features mining During adaptation, we first try to find the potentially wandering features. Specifically, we split the current batch into two sets (we refer to them as wandering and non wandering features, as shown in Fig. 2 (a)) according to their prediction entropy:


where is the prediction of the source classifier ( represents the softmax operation). The threshold

is estimated as a percentile of the entropy of

in , set to 50% (i.e. the median).

1: unlabeled target data
2: network trained with source data
4:while not done do
5:     Sample batch from
6:     Calculate and from Eq. 1
7:      Eq. Source-free Domain Adaptation with BAIT
8:      Eq. Source-free Domain Adaptation with BAIT
9:end while
Algorithm 1 Unsupervised domain adaptation with BAIT

Step 2: casting the bait Here we only train bait classifier , and freeze the feature extractor . is initialized from anchor classifier before the adaptation. As illustrated in Fig. 2, the domain shift results in some wandering features, which may be in the wrong side. The purpose of this stage is to use the bait classifier/prototype to find the wandering features. We achieve this by maximizing the discrepancy with respect to on and minimizing it on (see Fig. 2b) L_cast(C_2)=∑_x∈¯WD_SKL(p_1(x),p_2(x)) - ∑_x∈WD_SKL(p_1(x),p_2(x)) where is the symmetric KL divergence: . As shown in Fig. 2 (b), given is initialized from , increasing disagreement between two classifiers will drive the prototype of to the wandering features. Intuitively, (the prototypes of) initialized from can be regarded as a bait to approach the wandering features. Minimizing aims to increase the disagreement between and , or in other words to make the decision boundaries of stay away from those of . Note that maximizing does not necessarily mean that the two classifiers have different predictions for all data.

Our motivation is to push the wandering features towards the right side of the decision boundary. However, if directly maximizing for all target data, i.e., removing the batch splitting in step 1, there may exist a trivial solution: the prototype of moves to a position far away from both the target feature and prototype of . And after the adaptation, some features which are early classified correctly may go to the wrong side of the decision boundary, leading to misclassification. This is the reason why we need step 1.

With step 1, we only use half of the features with higher prediction entropy to make disagreement between and , and keep the agreement for remaining target data. The philosophy behind Step 1 is: the target features with lower prediction entropy are more likely inside the decision boundary, i.e., be classified correctly, while the ones with higher prediction entropy are nearby the boundary with high possibility, thus wandering features. Training in Step 2 with this batch splitting can prevent bait classifier from going too far.

Step 3: features bite the bait In this stage, only the feature extractor is trained by minimizing the disagreement between the two classifiers, with the aim to make the two classifier reach consensus.

Specifically, we update the feature extractor by minimizing the discrepancy between the predictions of both classifiers and also minimizing the adaptive loss

L_bite(f)=∑_x∈¯TD_SKL(p_1(x),p_2(x))+α(L_ada(f,C_1,T)+L_ada(f,C_2,T)) where the is the adaptation loss, which can be simply the entropy minimizing. Instead, we adopt batch nuclear-norm maximization (BNM) cui2020towards as the adaptation loss, minimizing BNM loss means decreasing prediction entropy and increasing prediction diversity.

is the hyperparameter to balance the different objectives.

As shown in Fig. 2 (c), minimizing the disagreement will push the features towards prototypes of same class from two classifiers. Metaphorically, in this stage the wandering features and also other features bite the bait, indicating they are pulled into the region of consensus for and .

BN statistics As shown in several recent works li2018adaptive; chang2019domain; wang2019transferable

, difference in means and variances insides batch norm layers 

ioffe2015batch exists between domains. In our proposed method where we fix source trained classifier and have no access to source data during adaptation, we propose two BN layer specific policies:

(1) Fixing the scale and shift factor during adaptation. Since we adopt the source trained model to target domain with the source trained classifier fixed, it is reasonable to keep the same shift and scale factor as the source domain.

(2) Re-initializing the mean and variance statistics before adaptation. BN layer adopts moving average estimation for the statistics: , where is from current mini-batch data. If not re-initialized, BN layers will also consider source statistics, and it may hinder the training since we no more use source data and does not care the performance on the source.

In the experiment, we will show how these 2 policies affect the performance.

Our work has similarities with MCD saito2018maximum and ADR saito2017adversarial. They also use discrepancy between two classifiers to achieve feature alignment. Simultaneously access to both the source and target data all the time is a crucial requirement of these method (they mention a large drop in performance if the source data is not used). Instead in our work, with the anchor classifier (trained only on source data) fixed, we deploy a bait classifier to make target features match the anchor classifier, since we aim that all target features can be correctly classified by the anchor classifier. Thereby, we avoid the necessity of source domain data during the adaptation.

Overall, the whole adaptation process is illustrated in Algorithm 1. Note that the 3-step training happens in every mini-batch iteration during adaptation. We also try to link the BAIT to domain adaptation theory ben2010theory in supplementary material.


Experiment on Twinning moon dataset

We carry our experiment on the twinning moon dataset. For this data set, the source domain’s data sample is represented by two inter-twinning moons, which contains 300 samples each. We generate the data in the target domain by rotation the source data by , where the rotation degree can be regarded as the domain shift. First we train the mode only on the source domain, and test the model on all domains. As shown in Fig. 3 (a) and (b), due to the domain shift the model will perform worse on the target data. Then we adapt the model to the target domain with the anchor and bait classifiers, without access to any source data. As shown in Fig 3 (c), during adaptation the disagreement between the two classifiers will let them cover different regions. The data which has different predictions before and after adaptation are the expected wandering data. After adaptation the two decision boundaries222Note here the decision boundary is from the whole model, since the input are data instead of features. almost coincide, as shown in Fig. 3 (d).

Figure 3: Toy experiments on the twinning moon 2D dataset, the blue points are target data while others are source data. (a) After training model only on the source data, testing on source (a) and target (b) data. (c) In the middle of adaptation with only target data. The two borderlines denote two decision boundaries (with at bottom). (d) After adaptation, the two decision boundaries almost coincide.

Experiments on recognition benchmarks

We use three benchmark datasets. Office-31 saenko2010adapting contains 3 domains (Amazon, Webcam, DSLR) with 31 classes and 4,652 images. Office-Home venkateswara2017deep contains 4 domains (Real, Clipart, Art, Product) with 65 classes and a total of 15,500 images. VisDA peng2017visda is a more challenging datasets, with 12-class synthesis-to-real object recognition tasks, its source domain contains 152k synthetic images while the target domain has 55k real object images.

Our method is a general framework which can be applied to compatible UDA methods under SFUDA setting, in the experiments we report the results of BAIT based on two different adaptation losses. We denote the baseline with BNM loss cui2020towards as BNM, and the one with entropy minimizing loss as ENT. Accordingly we refer to our method as BNM+BAIT and ENT+BAIT. And we compare proposed method under source-free setting with state-of-the-art methods under normal setting where source data are available.

Method Source-free AD AW DW WD DA WA Avg
GFK gong2012geodesic 74.5 72.8 95.0 98.2 63.4 61.0 77.5
DAN long2015learning 78.6 80.5 97.1 99.6 63.6 62.8 80.4
DANN ganin2016domain 79.7 82.0 96.9 99.1 68.2 67.4 82.2
ADDA tzeng2017adversarial 77.8 86.2 96.2 98.4 69.5 68.9 82.9
MaxSquare chen2019domain 90.0 92.4 99.1 100.0 68.1 64.2 85.6
Simnet pinheiro2018unsupervised 85.3 88.6 98.2 99.7 73.4 71.8 86.2
GTA sankaranarayanan2018generate 87.7 89.5 97.9 99.8 72.8 71.4 86.5
MCD saito2018maximum 92.2 88.6 98.5 100.0 69.5 69.7 86.5
CBST zou2018unsupervised 86.5 87.8 98.5 100.0 70.9 71.2 85.8
CRST zou2019confidence 88.7 89.4 98.9 100.0 70.9 72.6 86.8
MDD zhang2019bridging 90.4 90.4 98.7 99.9 75.0 73.7 88.0
TADA wang2019transferable1 91.6 94.3 98.7 99.8 72.9 73.0 88.4
MDD+Implicit Alignment jiang2020implicit 92.1 90.3 98.7 99.8 75.3 74.9 88.8
DMRL wu2020dual 93.4 90.8 99.0 100.0 73.0 71.2 87.9
BDG yang2020bi 93.6 93.6 99.0 100.0 73.2 72.0 88.5
MCC jin2019minimum 95.6 95.4 98.6 100.0 72.6 73.9 89.4
ENT cui2020towards 86.0 87.9 98.4 100.0 67.0 63.7 83.8
BNM cui2020towards 90.3 91.5 98.5 100.0 70.9 71.6 87.1

USFDA kundu2020universal
- - - - - - 85.4
SHOT liang2020we 93.1 90.9 98.8 99.9 74.5 74.8 88.7
SFDA kim2020domain 92.2 91.1 98.2 99.5 71.0 71.2 87.2
ENT 80.2 86.4 95.6 99.0 64.2 60.2 80.9
ENT+BAIT (ours) 87.0 89.7 98.4 100.0 66.2 61.6 83.8
BNM 83.2 89.0 97.0 99.1 69.6 68.8 84.4
BNM+BAIT (ours) 91.0 93.0 99.0 100.0 75.0 75.3 88.9
Table 1: Accuracies (%) on Office-31 for ResNet50-based unsupervised domain adaptation methods. Source-free means source-free setting without access to source data during adaptation.
Method Source-free ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr Avg
DAN long2015learning 43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3
DANN ganin2016domain 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6
MCD saito2018maximum 48.9 68.3 74.6 61.3 67.6 68.8 57 47.1 75.1 69.1 52.2 79.6 64.1
SAFN Xu_2019_ICCV 52.0 71.7 76.3 64.2 69.9 71.9 63.7 51.4 77.1 70.9 57.1 81.5 67.3
Symnets zhang2019domain 47.7 72.9 78.5 64.2 71.3 74.2 64.2 48.8 79.5 74.5 52.6 82.7 67.6
MDD zhang2019bridging 54.9 73.7 77.8 60.0 71.4 71.8 61.2 53.6 78.1 72.5 60.2 82.3 68.1
TADA wang2019transferable1 53.1 72.3 77.2 59.1 71.2 72.1 59.7 53.1 78.4 72.4 60.0 82.9 67.6

MDD+Implicit Alignment jiang2020implicit
56.0 77.9 79.2 64.4 73.1 74.4 64.2 54.2 79.9 71.2 58.1 83.1 69.5
AADA+CCN yangmind 54.0 71.3 77.5 60.8 70.8 71.2 59.1 51.8 76.9 71.0 57.4 81.8 67.0
BDG yang2020bi 51.5 73.4 78.7 65.3 71.5 73.7 65.1 49.7 81.1 74.6 55.1 84.8 68.7
ENT cui2020towards 43.2 68.4 78.4 61.4 69.9 71.4 58.5 44.2 78.2 71.1 47.6 81.8 64.5
BNM cui2020towards 52.3 73.9 80.0 63.3 72.9 74.9 61.7 49.5 79.7 70.5 53.6 82.2 67.9
SHOT liang2020we 56.9 78.1 81.0 67.9 78.4 78.1 67.0 54.6 81.8 73.4 58.1 84.5 71.6
SFDA kim2020domain 48.4 73.4 76.9 64.3 69.8 71.7 62.7 45.3 76.6 69.8 50.5 79.0 65.7
ENT 50.0 59.3 75.1 55.8 67.3 67.0 55.7 48.4 73.2 67.1 55.3 78.4 62.7
ENT+BAIT (ours) 52.4 73.6 75.4 60.1 69.0 69.1 60.0 50.0 76.5 71.6 57.0 80.7 66.3
BNM 53.3 71.4 77.0 56.2 67.8 69.6 55.7 49.4 75.9 66.3 55.1 78.5 64.7
BNM+BAIT (ours) 56.8 78.2 81.1 68.4 77.1 75.1 66.4 56.0 81.8 74.3 59.8 83.4 71.5
Table 2: Accuracies (%) on Office-Home for ResNet50-based unsupervised domain adaptation methods. Source-free means source-free setting without access to source data during adaptation. Underline means second highest result.
Figure 4: Accuracy curves when training on target data. The blue line is from BNM baseline with LS and BN fixed, other two are from and from BNM+BAIT with LS and BN fixed also with initializing.

For the model details, we use the same network architecture as recent works liang2020we: we adopt the backbone of ResNet-50 he2016deep (for office datasets) or ResNet-101 (for VisDA) along with an extra fully connected (fc) layer as feature extractor, and a fc layer as classifier head. Here we specify the detail of training with BNM+BAIT.

in all experiments is set to 1. We adopt SGD with momentum 0.9, and the batch size is 64. As for the learning rate, for source domain training it is set to 0.001 for feature extractor and 0.01 for classifier, for training on the target all are set to 0.0001. We train 20 epochs on the source, and 50 epochs for the target since the learning rate is quite small. As for

, we also try 25% and 75% of the batch as wandering data, but all lead to a little lower results with 50%, we will show results on twining-moon dataset with different in the supplementary material.

We use label smoothing when training the model on the source data, which avoids over-confidence of the predictions muller2019does. At each iteration we train step 3 twice for office but only once for VisDA. All experiments are conducted on a single GTX2080ti. All reported results are from anchor classifier .

Quantitative Results. The results on the Office-31, Office-Home and VisDA dataset are shown in Tab. 1-3. In these tables, the top part shows results for the normal setting with access to source data during adaptation. The bottom shows results for the source-free setting. On two Office datasets, we can see that the performance of BNM drops significantly losing access to source data. However, BNM+BAIT obtains a significantly improvement. Surprisingly, it even performs better than BNM with access to the source data on both datasets. This also happens when we combine BAIT with the ENT method. Furthermore, our BNM+BAIT outperforms all normal UDA methods on Office-Home and VisDA. It is only lower than MCC (with access to source data) on Office-31. In addition, our method surpasses other recent SFUDA methods except SHOT on all three datasets. BNM+BAIT is on par with SHOT on Office-31 and Office-Home, but outperforms SHOT on the large VisDA by 2.2%. It is fair to compare BNM+BAIT with SHOT, since SHOT adopts IM loss, which plays the same role as the BNM loss. But unlike SHOT, ours does not require pseudo label generation.

Method (Synthesis Real) Source-free plane bcycl bus car horse knife mcycl person plant sktbrd train truck Per-class
DANN ganin2016domain 81.9 77.7 82.8 44.3 81.2 29.5 65.1 28.6 51.9 54.6 82.8 7.8 57.4
DAN long2015learning 87.1 63.0 76.5 42.0 90.3 42.9 85.9 53.1 49.7 36.3 85.8 20.7 61.1
ADR saito2017adversarial 94.2 48.5 84.0 72.9 90.1 74.2 92.6 72.5 80.8 61.8 82.2 28.8 73.5
CDAN long2018conditional 85.2 66.9 83.0 50.8 84.2 74.9 88.1 74.5 83.4 76.0 81.9 38.0 73.9
CDAN+BSP chen2019transferability 92.4 61.0 81.0 57.5 89.0 80.6 90.1 77.0 84.2 77.9 82.1 38.4 75.9
SAFN Xu_2019_ICCV 93.6 61.3 84.1 70.6 94.1 79.0 91.8 79.6 89.9 55.6 89.0 24.4 76.1
SWD lee2019sliced 90.8 82.5 81.7 70.5 91.7 69.5 86.3 77.5 87.4 63.6 85.6 29.2 76.4
MDD zhang2019bridging - - - - - - - - - - - - 74.6
Implicit Alignment jiang2020implicit - - - - - - - - - - - - 75.8
DMRL wu2020dual - - - - - - - - - - - - 75.5
DM-ADA xu2019adversarial - - - - - - - - - - - - 75.6
MCC jin2019minimum 88.7 80.3 80.5 71.5 90.1 93.2 85.0 71.6 89.4 73.8 85.0 36.9 78.8

SHOT liang2020we
92.6 81.1 80.1 58.5 89.7 86.1 81.5 77.8 89.5 84.9 84.3 49.3 79.6
SFDA kim2020domain 86.9 81.7 84.6 63.9 93.1 91.4 86.6 71.9 84.5 58.2 74.5 42.7 76.7

93.8 68.6 82.3 58.7 89.2 91.2 86.3 75.2 81.6 65.6 83.9 41.9 76.5

BNM+BAIT (ours)
93.1 79.0 81.6 61.7 91.5 94.7 88.3 78.0 87.2 89.8 83.5 51.9 81.7
*BNM+BAIT (ours) 94.0 80.2 81.0 61.0 91.8 95.0 87.5 79.2 87.3 89.0 83.1 52.4 81.8
Table 3: Accuracies (%) on VisDA-C for ResNet101-based unsupervised domain adaptation methods. Source-free means source-free setting without access to source data during adaptation. * refers to not fixing BN layers. Underline means second highest result.
Method LS BNf BNi ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr Avg
BNM (not fixing) 52.2 69.0 75.9 54.7 67.4 68.0 53.3 46.9 74.4 64.9 50.2 77.1 62.8
BNM 53.3 71.4 77.0 56.2 67.8 69.6 55.7 49.4 75.9 66.3 55.1 78.5 64.7
BNM+BAIT* 55.0 72.3 78.3 58.0 69.7 71.6 57.4 49.9 77.1 67.6 56.5 80.0 66.1
BNM 53.7 72.6 77.3 58.6 68.7 70.0 57.1 50.5 76.0 68.6 57.0 80.8 65.9
BNM+BAIT* 55.8 73.9 78.4 61.0 70.4 71.8 59.2 52.5 77.4 70.6 59.2 81.9 67.7
BNM 53.3 72.6 77.2 58.7 68.4 70.1 57.1 50.7 76.0 69.5 56.9 80.8 65.9
BNM+BAIT* 56.7 75.2 79.4 62.7 72.0 72.9 61.8 53.3 78.5 72.6 59.7 82.7 68.9
BNM+BAIT 57.7 77.0 80.2 63.5 72.7 75.5 61.9 55.0 80.0 71.7 59.7 83.0 69.8
BNM 52.6 73.2 77.2 62.0 73.5 70.9 61.3 50.5 77.5 69.7 55.3 81.9 67.1
BNM+BAIT 56.9 78.1 81.1 68.2 76.4 75.1 66.5 55.6 81.9 74.0 58.9 83.5 71.3
BNM+BAIT 56.8 78.2 81.1 68.4 77.1 75.1 66.4 56.0 81.8 74.3 59.8 83.4 71.5
Table 4: Ablation study on Office-Home dataset in source-free setting. Not fixing refers to not fixing the classifier during adaptation. LS means label smoothing, BNf means fixing the BN layers when training on target data, while BNi means re-initializing the mean and variance of BN layer. * means not using the wandering feature mining (see Eq. 1).
Figure 5: t-SNE visualization of target features on task AD of Office-31 dataset, which are output by feature extractor. The left is from the BNM and the right is from BNM+BAIT.

Ablation Study. We explore a wide variety of configurations for our method by applying the fixed classifier, label smoothing (LS) and two BN policies (BN). As reported in Tab. 4 on the Office-Home dataset, BNM with the learned classifier (the first row), denoted as BNM (not fixing), obtains the worst result (62.8%), and gets 1.9% improvement with classifier frozen (the second row), clearly indicating that fixing the classifier plays an important role to address source-free domain adaptation. Both LS and the two BN policies, i.e., fixing scale/shift factor or initialing BN statistics, further improve performance. Also potential wandering features mining gives improvement. Interest thing is that only initialing the BN layers can achieve almost the same performance as simultaneously fixing and initialing BN layers, this may imply that the learnable scale/shift factor do not change during training with the BN initialized before the adaptation. In conclusion, with our proposed BAIT the two pipelines can gain significant improvement under source free setting.

Accuracy Curve. Fig. 4 shows the accuracy curves of both BNM and BNM+BAIT during adapting to the target domain on Office-Home. The starting point is the accuracy after training on the source data. Classifier aims to find and push wandering features towards the right side of the boundary of , leading to faster convergence. In addition, classifier should not be to too far from those of . This is verified in Fig. 4, which shows that accuracy ascends quickly and the two classifiers have similar performance.

Visualization. We provide the t-SNE visualization of the target features. As shown in Fig. 5, the target features from our BAIT has more compact clusters.


In this paper, we investigate the Source-free Unsupervised Domain Adaptation (SFUDA) where the source data are not available during adaptation. With the source trained classifier fixed during adaptation, we propose a novel general framework named BAIT by using a bait classifier to draw the unlabeled target data towards the right side of the decision boundary of source classifier. The experiment results show that it improves the performance significantly under SFUDA setting, surpassing the existing normal UDA methods, which demand source data all the time.