Though achieving great success, typically deep neural networks demand a huge amount of labeled data for training. However, collecting labeled data is often laborious and expensive. It would, therefore, be ideal if the knowledge obtained on label-rich datasets can be transferred to unlabeled data. For example, after training on synthetic images, it would be beneficial to transfer the obtained knowledge to the domain of real-world images. However, deep networks are weak at generalizing to unseen domains, even when the differences are only subtle between the datasetsoquab2014learning. In real-world situations, a typical factor impairing the model generalization ability is the distribution shift between data from different domains.
Domain adaptation methods aim to reduce the domain shift between source and target domain. Usually, the data from the target domain are unlabeled, in which case it is referred to as Unsupervised Domain Adaptation (UDA). Early works gong2012geodesic; pan2009survey
learn domain-invariant features to link the target domain to the source domain. Along with the growing popularity of deep learning, many works are leveraging its powerful representation learning ability for domain adaptationganin2016domain; long2017deep; oquab2014learning; yosinski2014transferable. Those methods typically minimize the distribution discrepancy between two domains long2018transferable; long2015learning; long2016unsupervised, or deploy adversarial training using a discriminator pei2018multi; tzeng2015simultaneous; tzeng2017adversarial.
However, a crucial requirement in the methodology of these methods is that they require access to the source domain data during the adaptation process to the target domain. This is infeasible for several real-world situations, for example deploying domain adaptation algorithms on mobile devices where the computation capacity is limited, or in situations where data-privacy rules limits access to the source domain. Because of its importance, the Source-free Unsupervised Domain Adaptation (SFUDA) setting, where the model is first trained on the source domain and then, when adapting to the target domain has no longer access to the source data, has started to obtain traction recently kundu2020universal; liang2020we; kim2020domain.
In this paper, we also investigate SFUDA setting. Compared to UDA methods, it is much more challenging to align the source and target distribution in the SFUDA setting. To address this challenge, we propose a method named BAIT, as shown in Fig. 1. First, we freeze the classifier head (called anchor classifier), which is trained only with source data. Instead of adapting the classifier to the target domain, we aim to align the features of the target domain with the fixed classifier. We then use the anchor classifier to initialize another classifier, which is deployed as bait to drive the target features towards the right side of the decision boundary of the anchor classifier. Besides, we also investigate how BatchNorm ioffe2015batch influences performance under SFUDA setting. As a general method, our approach can also be put on any compatible UDA methods for SFUDA. In the experiments, we show for two UDA methods, entropy minimizing and Batch Nuclear-Norm Maximization cui2020towards, that when combined with BAIT and trained without source data, they match or outperform the original reported results when trained with source data. Moreover, we show our proposed BAIT surpasses all existing UDA methods, even though these methods have source data at hand at all time, and also several SFUDA methods.
Domain adaptation aims to reduce the shift between the source and target domains. Moment matching methods align feature distributions by minimizing the feature distribution discrepancy, including methods such as DANlong2015learning and DDC tzeng2014deep, which deploy Maximum Mean Discrepancy. CORAL sun2016return matches the second-order statistics of the source to target domain by recoloring whitened source data through statistics information from the target domain. Inspired by adversarial learning, DANN ganin2016domain formulates domain adaptation as an adversarial two-player game: the domain discriminator aims to distinguish the source from the target domain, and the feature generator aims to confuse the domain discriminator. CDAN long2018conditional trains a deep networks conditioned on several sources of information. DIRT-T shu2018dirt performs domain adversarial training with an added term that penalizes violations of the cluster assumption. Both ADR saito2017adversarial and MCD saito2018maximum optimize two classifiers to be consistent across domains.
Domain adaptation has also been tackled from other perspectives. RCA cicek2019unsupervised proposes a multi-classification discriminator. DAMN bermudez2020domain introduces a framework where each domain undergoes a different sequence of operations. AFN Xu_2019_ICCV shows that the erratic discrimination of target features stems from much smaller norms than those found in source features.
All these methods, however, require access to source data during adaptation. Recently, USFDA kundu2020universal explores the source-free setting, but it only focuses on the Universal Domain Adaptation task you2019universal, and their proposed method is complex, involving the generation of simulated labeled negative samples. Most relevant works for close-set UDA are SHOT liang2020we and SFDA kim2020domain , but SHOT needs to generate extra pseudo labels, while SFUA does not achieve satisfactory performance. Unlike them, our proposed method simply introduces an additional bait classifier to achieve feature alignment, thus alleviating the performance deterioration during adaptation under SFUDA setting.
We denote the labeled source domain data with samples as , where the is the corresponding label of , and the unlabeled target domain data with samples as . Unlike normal setting, the SFUDA leverages the model trained on the source data and only the unlabeled target domain data during adaptation, without source data anymore. This paper aims to provide an effective and general training framework for SFUDA. In the following sections, we will first introduce a baseline setting for SFUDA, and then illustrate our proposed method.
Usually UDA methods eliminate the domain shift by aligning the feature distribution between the source and target domains. This is especially challenging under the SFUDA setting, where we never have simultaneous access to source and target data.
We decompose the neural network into two parts: a feature extractor and a classifier head . A successful alignment of the features means that the features produced by the feature extractor from both domains locate on the same and correct side of the decision boundary (determined by the classifier head ), which means classified correctly by the classifier head. Therefore, we propose to freeze the source-trained classifier . This implicitly allows us to store the relevant information from the source domain without actually accessing the source data.
We first train the baseline model on the labeled source data with standard cross-entropy loss, then fix the classifier head . Next, we train the model only on the unlabeled target domain data . The baseline model could be any compatible existing UDA method. According to the cluster assumption chapelle2005semi, minimizing the entropy of the predictions will push the target features far away from the decision boundary. In our paper, we choose the pipeline with BNM loss cui2020towards as baseline, where minimizing the BNM loss equals to decreasing the prediction entropy and increase the prediction diversity.
Source-free Domain Adaptation with BAIT
The baseline method, which minimizes the prediction entropy, can partly ensure that the target features match the source classifier, but there may still exist some difficult target samples which are misclassified by the source classifier. Due to the domain shift, the cluster of target features generated by the source-training feature extractor may cross the decision boundary. If we naively minimize the entropy loss on the feature extractor (as we do in the baseline model), some features may head to the right side of the decision boundary, but those target features which are located at the wrong side will move towards the wrong direction. Since the entropy minimization ensures that the decision boundary will not go through the data dense region, this will result in misclassification, if no other restriction are deployed
To tackle this challenge, we propose a general framework named BAIT, which introduces an extra classifier/decision boundary (hereafter we denote the fixed anchor classifier as and the extra classifier as , and feature extractor as ), as shown at the right of Fig. 1. Both classifiers and will correctly predict on some target features but misclassify some others. So optimizing the feature extractor to reach consensus on both and will push the target features towards the inner side of both decision boundaries. Hereafter we call features, which are close to decision boundaries and thus likely to cross them during adaptation, as wandering features. Typical wandering features are those which are nearby features which all belong to the same class, but at the wrong side of the decision boundary.
After training the model on the source data, we get a feature extractor , and an anchor classifier head . We fix in the subsequent training periods. is initialized by before the adaptation. In order to train the desired , we propose a 3-step training policy which alternately trains bait and feature extractor . Note that after training on the source domain, we never get access to the source domain data during adaptation. In order to better illustrate our method, hereafter we treat the weights of the classifier head as class prototypes111The classifier head only contains a single fully connected layer, and with l2 normalization on its weights. . The proposed algorithm has three steps:
Step 1: Wandering features mining During adaptation, we first try to find the potentially wandering features. Specifically, we split the current batch into two sets (we refer to them as wandering and non wandering features, as shown in Fig. 2 (a)) according to their prediction entropy:
where is the prediction of the source classifier ( represents the softmax operation). The threshold
is estimated as a percentile of the entropy ofin , set to 50% (i.e. the median).
Step 2: casting the bait Here we only train bait classifier , and freeze the feature extractor . is initialized from anchor classifier before the adaptation. As illustrated in Fig. 2, the domain shift results in some wandering features, which may be in the wrong side. The purpose of this stage is to use the bait classifier/prototype to find the wandering features. We achieve this by maximizing the discrepancy with respect to on and minimizing it on (see Fig. 2b) L_cast(C_2)=∑_x∈¯WD_SKL(p_1(x),p_2(x)) - ∑_x∈WD_SKL(p_1(x),p_2(x)) where is the symmetric KL divergence: . As shown in Fig. 2 (b), given is initialized from , increasing disagreement between two classifiers will drive the prototype of to the wandering features. Intuitively, (the prototypes of) initialized from can be regarded as a bait to approach the wandering features. Minimizing aims to increase the disagreement between and , or in other words to make the decision boundaries of stay away from those of . Note that maximizing does not necessarily mean that the two classifiers have different predictions for all data.
Our motivation is to push the wandering features towards the right side of the decision boundary. However, if directly maximizing for all target data, i.e., removing the batch splitting in step 1, there may exist a trivial solution: the prototype of moves to a position far away from both the target feature and prototype of . And after the adaptation, some features which are early classified correctly may go to the wrong side of the decision boundary, leading to misclassification. This is the reason why we need step 1.
With step 1, we only use half of the features with higher prediction entropy to make disagreement between and , and keep the agreement for remaining target data. The philosophy behind Step 1 is: the target features with lower prediction entropy are more likely inside the decision boundary, i.e., be classified correctly, while the ones with higher prediction entropy are nearby the boundary with high possibility, thus wandering features. Training in Step 2 with this batch splitting can prevent bait classifier from going too far.
Step 3: features bite the bait In this stage, only the feature extractor is trained by minimizing the disagreement between the two classifiers, with the aim to make the two classifier reach consensus.
Specifically, we update the feature extractor by minimizing the discrepancy between the predictions of both classifiers and also minimizing the adaptive loss
L_bite(f)=∑_x∈¯TD_SKL(p_1(x),p_2(x))+α(L_ada(f,C_1,T)+L_ada(f,C_2,T)) where the is the adaptation loss, which can be simply the entropy minimizing. Instead, we adopt batch nuclear-norm maximization (BNM) cui2020towards as the adaptation loss, minimizing BNM loss means decreasing prediction entropy and increasing prediction diversity.
is the hyperparameter to balance the different objectives.
As shown in Fig. 2 (c), minimizing the disagreement will push the features towards prototypes of same class from two classifiers. Metaphorically, in this stage the wandering features and also other features bite the bait, indicating they are pulled into the region of consensus for and .
BN statistics As shown in several recent works li2018adaptive; chang2019domain; wang2019transferable
, difference in means and variances insides batch norm layersioffe2015batch exists between domains. In our proposed method where we fix source trained classifier and have no access to source data during adaptation, we propose two BN layer specific policies:
(1) Fixing the scale and shift factor during adaptation. Since we adopt the source trained model to target domain with the source trained classifier fixed, it is reasonable to keep the same shift and scale factor as the source domain.
(2) Re-initializing the mean and variance statistics before adaptation. BN layer adopts moving average estimation for the statistics: , where is from current mini-batch data. If not re-initialized, BN layers will also consider source statistics, and it may hinder the training since we no more use source data and does not care the performance on the source.
In the experiment, we will show how these 2 policies affect the performance.
Our work has similarities with MCD saito2018maximum and ADR saito2017adversarial. They also use discrepancy between two classifiers to achieve feature alignment. Simultaneously access to both the source and target data all the time is a crucial requirement of these method (they mention a large drop in performance if the source data is not used). Instead in our work, with the anchor classifier (trained only on source data) fixed, we deploy a bait classifier to make target features match the anchor classifier, since we aim that all target features can be correctly classified by the anchor classifier. Thereby, we avoid the necessity of source domain data during the adaptation.
Overall, the whole adaptation process is illustrated in Algorithm 1. Note that the 3-step training happens in every mini-batch iteration during adaptation. We also try to link the BAIT to domain adaptation theory ben2010theory in supplementary material.
Experiment on Twinning moon dataset
We carry our experiment on the twinning moon dataset. For this data set, the source domain’s data sample is represented by two inter-twinning moons, which contains 300 samples each. We generate the data in the target domain by rotation the source data by , where the rotation degree can be regarded as the domain shift. First we train the mode only on the source domain, and test the model on all domains. As shown in Fig. 3 (a) and (b), due to the domain shift the model will perform worse on the target data. Then we adapt the model to the target domain with the anchor and bait classifiers, without access to any source data. As shown in Fig 3 (c), during adaptation the disagreement between the two classifiers will let them cover different regions. The data which has different predictions before and after adaptation are the expected wandering data. After adaptation the two decision boundaries222Note here the decision boundary is from the whole model, since the input are data instead of features. almost coincide, as shown in Fig. 3 (d).
Experiments on recognition benchmarks
We use three benchmark datasets. Office-31 saenko2010adapting contains 3 domains (Amazon, Webcam, DSLR) with 31 classes and 4,652 images. Office-Home venkateswara2017deep contains 4 domains (Real, Clipart, Art, Product) with 65 classes and a total of 15,500 images. VisDA peng2017visda is a more challenging datasets, with 12-class synthesis-to-real object recognition tasks, its source domain contains 152k synthetic images while the target domain has 55k real object images.
Our method is a general framework which can be applied to compatible UDA methods under SFUDA setting, in the experiments we report the results of BAIT based on two different adaptation losses. We denote the baseline with BNM loss cui2020towards as BNM, and the one with entropy minimizing loss as ENT. Accordingly we refer to our method as BNM+BAIT and ENT+BAIT. And we compare proposed method under source-free setting with state-of-the-art methods under normal setting where source data are available.
|MDD+Implicit Alignment jiang2020implicit||92.1||90.3||98.7||99.8||75.3||74.9||88.8|
MDD+Implicit Alignment jiang2020implicit
For the model details, we use the same network architecture as recent works liang2020we: we adopt the backbone of ResNet-50 he2016deep (for office datasets) or ResNet-101 (for VisDA) along with an extra fully connected (fc) layer as feature extractor, and a fc layer as classifier head. Here we specify the detail of training with BNM+BAIT.
in all experiments is set to 1. We adopt SGD with momentum 0.9, and the batch size is 64. As for the learning rate, for source domain training it is set to 0.001 for feature extractor and 0.01 for classifier, for training on the target all are set to 0.0001. We train 20 epochs on the source, and 50 epochs for the target since the learning rate is quite small. As for, we also try 25% and 75% of the batch as wandering data, but all lead to a little lower results with 50%, we will show results on twining-moon dataset with different in the supplementary material.
We use label smoothing when training the model on the source data, which avoids over-confidence of the predictions muller2019does. At each iteration we train step 3 twice for office but only once for VisDA. All experiments are conducted on a single GTX2080ti. All reported results are from anchor classifier .
Quantitative Results. The results on the Office-31, Office-Home and VisDA dataset are shown in Tab. 1-3. In these tables, the top part shows results for the normal setting with access to source data during adaptation. The bottom shows results for the source-free setting. On two Office datasets, we can see that the performance of BNM drops significantly losing access to source data. However, BNM+BAIT obtains a significantly improvement. Surprisingly, it even performs better than BNM with access to the source data on both datasets. This also happens when we combine BAIT with the ENT method. Furthermore, our BNM+BAIT outperforms all normal UDA methods on Office-Home and VisDA. It is only lower than MCC (with access to source data) on Office-31. In addition, our method surpasses other recent SFUDA methods except SHOT on all three datasets. BNM+BAIT is on par with SHOT on Office-31 and Office-Home, but outperforms SHOT on the large VisDA by 2.2%. It is fair to compare BNM+BAIT with SHOT, since SHOT adopts IM loss, which plays the same role as the BNM loss. But unlike SHOT, ours does not require pseudo label generation.
|Method (Synthesis Real)||Source-free||plane||bcycl||bus||car||horse||knife||mcycl||person||plant||sktbrd||train||truck||Per-class|
|Implicit Alignment jiang2020implicit||-||-||-||-||-||-||-||-||-||-||-||-||75.8|
|BNM (not fixing)||52.2||69.0||75.9||54.7||67.4||68.0||53.3||46.9||74.4||64.9||50.2||77.1||62.8|
Ablation Study. We explore a wide variety of configurations for our method by applying the fixed classifier, label smoothing (LS) and two BN policies (BN). As reported in Tab. 4 on the Office-Home dataset, BNM with the learned classifier (the first row), denoted as BNM (not fixing), obtains the worst result (62.8%), and gets 1.9% improvement with classifier frozen (the second row), clearly indicating that fixing the classifier plays an important role to address source-free domain adaptation. Both LS and the two BN policies, i.e., fixing scale/shift factor or initialing BN statistics, further improve performance. Also potential wandering features mining gives improvement. Interest thing is that only initialing the BN layers can achieve almost the same performance as simultaneously fixing and initialing BN layers, this may imply that the learnable scale/shift factor do not change during training with the BN initialized before the adaptation. In conclusion, with our proposed BAIT the two pipelines can gain significant improvement under source free setting.
Accuracy Curve. Fig. 4 shows the accuracy curves of both BNM and BNM+BAIT during adapting to the target domain on Office-Home. The starting point is the accuracy after training on the source data. Classifier aims to find and push wandering features towards the right side of the boundary of , leading to faster convergence. In addition, classifier should not be to too far from those of . This is verified in Fig. 4, which shows that accuracy ascends quickly and the two classifiers have similar performance.
Visualization. We provide the t-SNE visualization of the target features. As shown in Fig. 5, the target features from our BAIT has more compact clusters.
In this paper, we investigate the Source-free Unsupervised Domain Adaptation (SFUDA) where the source data are not available during adaptation. With the source trained classifier fixed during adaptation, we propose a novel general framework named BAIT by using a bait classifier to draw the unlabeled target data towards the right side of the decision boundary of source classifier. The experiment results show that it improves the performance significantly under SFUDA setting, surpassing the existing normal UDA methods, which demand source data all the time.