code released for our ICML 2020 paper "Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation"
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a related but different well-labeled source domain to a new unlabeled target domain. Most existing UDA methods require access to the source data, and thus are not applicable when the data are confidential and not shareable due to privacy concerns. This paper aims to tackle a realistic setting with only a classification model available trained over, instead of accessing to, the source data. To effectively utilize the source model for adaptation, we propose a novel approach called Source HypOthesis Transfer (SHOT), which learns the feature extraction module for the target domain by fitting the target data features to the frozen source classification module (representing classification hypothesis). Specifically, SHOT exploits both information maximization and self-supervised learning for the feature extraction module learning to ensure the target features are implicitly aligned with the features of unseen source data via the same hypothesis. Furthermore, we propose a new labeling transfer strategy, which separates the target data into two splits based on the confidence of predictions (labeling information), and then employ semi-supervised learning to improve the accuracy of less-confident predictions in the target domain. We denote labeling transfer as SHOT++ if the predictions are obtained by SHOT. Extensive experiments on both digit classification and object recognition tasks show that SHOT and SHOT++ achieve results surpassing or comparable to the state-of-the-arts, demonstrating the effectiveness of our approaches for various visual domain adaptation problems.READ FULL TEXT VIEW PDF
code released for our ICML 2020 paper "Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation"
Deep neural networks have achieved remarkable success in a variety of applications across different fields but at the expense of laborious large-scale training data annotation. To avoid expensive data labeling, transfer learning[68, 19, 90] is developed to extract the knowledge from one or more source tasks which is then applied to a target task. As a typical example, unsupervised domain adaptation (UDA) tackles the problem setting where the learning task in the source domain is sufficiently similar or the same as that in the target domain but labeled data are only available in the source domain during training. Recently, UDA methods have been widely applied to boost performance of many tasks like object recognition [58, 92, 83, 19], semantic segmentation [112, 37, 114, 90], sentiment classification [29, 71], object detection [14, 52], and person re-identification [22, 96]
. Existing UDA methods mainly follow two paradigms to mitigate the gap between source and target domains. The first paradigm matches the statistical moments of different feature distributions at different orders to minimize the distributional divergence between domains[88, 106, 72]. For example, the widely used Maximum Mean Discrepancy (MMD)  measure minimizes the distance between weighted sums of all moments from the source and target domains. The second paradigm applies adversarial learning 
with an additional domain classifier to minimize the Proxy-distance  between the domains. All these methods require to access the source data during learning to adapt the model to the target domain.
To address this UDA setting, we propose a novel approach called Source HypOthesis Transfer (SHOT). SHOT follows common deep UDA methods [26, 58] to utilize an identical network architecture for different domains, consisting of a feature encoding module and a classification module (hypothesis). Like [92, 11]
, SHOT aims to learn a target-specific feature encoding module to generate target data representations that are well aligned with source data representations, but without accessing the source data or the target data labels. Intuitively, if the learned target data representations are aligned with the source ones, their classification results from the fixed source classifier (hypothesis) would be highly confident for a certain class, i.e., the classification outputs being close to one-hot vectors. We are then motivated to make SHOT adapt the feature encoding module by fine-tuning the source feature encoding module while freezing the source hypothesis, to maximize the mutual information between intermediate feature representations and outputs of the classifier, since information maximization[85, 38] can encourage the classifier to assign disparate one-hot outputs to different target feature representations.
Though target feature representations are encouraged to fit the source hypothesis via information maximization, some semantically wrong matching between target feature representations and source hypothesis may still occur, leading to wrong labels assigned to the target data. To alleviate this, we propose to fully exploit the knowledge in the unlabeled target domain by developing two new self-supervised learning schemes. First, considering pseudo labels generated by the source classifier for the target data may be noisy, we propose to attain per-class prototype representations for the target domain itself and apply the nearest prototype classifier to obtain more accurate pseudo labels as direct supervision. Secondly, inspired by RotNet 
that predicts the absolute rotation of a rotated image, we come up with a relative rotation prediction task to capture the image-specific self-supervision more precisely, i.e. requiring the model to estimate the relative rotation between one original image and its rotated version. The two self-supervisions are used to help discard irrelevant semantic information by exploiting the data distribution of the target domain, thus helping learn feature representations that better fit the source hypothesis. In this way, we obtain a target-specific feature encoding module with the source hypothesis as the shared classifier module across domains.
Since some low-confident predictions generated with the proposed hypothesis transfer strategy are possibly inaccurate, we further put forward a labeling transfer strategy as a following step, forming a complete two-stage framework called SHOT++ for UDA problems. Particularly, we sort the the confidence of the adapted predictions after SHOT and discover an adaptive threshold to automatically divide the whole target data into two splits, i.e., ‘easy’ split with high confidence and ‘hard’ split with low confidence. Empirically, these predictions of samples in the ‘easy’ split are reliable. Thus, we employ a popular semi-supervised learning algorithm, MixMatch , to enable the reliable labeling information from the ‘easy’ split to flow to the ‘hard’ split in the target domain itself. It is worth noting that such a labeling transfer strategy can also be applied to the original source model, or even a black-box predictor without knowing the network architecture.
Experimental results on multiple benchmark datasets clearly demonstrate the proposed SHOT and SHOT++ obtain competitive results with the state-of-the-art, or outperform the state-of-the-art for three different UDA cases, i.e., closed-set , partial-set , multi-source  problems. The superior results over prior arts in a semi-supervised domain adaptation (SSDA) scenario  further verify the versatility of the proposed methods. The main contributions of this work are summarized as follows.
We propose a novel framework, Source HypOthesis Transfer (SHOT), for unsupervised domain adaptation with only the source model provided, which is appealing for privacy protection without access to the source data.
SHOT exploits information maximization to learn a target-specific feature encoding module, which provides an implicit perspective on feature alignment.
SHOT further exploits the knowledge in the unlabeled target domain by developing two new kinds of self-supervisions as auxiliary tasks, which further improves the adaptation performance.
We further propose a new labeling transfer strategy by exploiting the confidence of predictions and enforcing the labeling information to flow from ‘easy’ samples to ‘hard’ samples, even allowing adaptation with a black-box source model.
Experiments on several benchmarks demonstrate our methods yield results comparable to or outperforming the state-of-the-arts for three unsupervised domain adaptation scenarios and even semi-supervised domain adaptation.
This paper extends our earlier work  in the following aspects. Within the hypothesis transfer framework developed in , we additionally propose one more self-supervision objective to predict the relative rotation, which facilitates the representation learning in the target domain. We also propose a new strategy named labeling transfer that only requires the labeling predictions in the target domain. Different from , it even allows adaptation with a black-box source model. Besides, it can be incorporated with the hypothesis transfer framework, yielding better adaptation results. We also expand the experimental evaluation by adding one more dataset for each UDA scenario (e.g., PACS  for multi-source UDA) and extending our methods further to semi-supervised domain adaptation. Finally, we provide a more detailed model analysis to evaluate the proposed approaches, including training stability, parameter sensitivity and qualitative study.
As a typical example of transfer learning , unsupervised domain adaptation (UDA) aims to exploit the knowledge in a different but related labeled dataset to help learn a discriminative model for the unlabeled dataset. Early UDA methods [105, 87] assume the covariate shift with the identical conditional distributions across domains and approximate the target empirical risk by estimating the weight of each source instance and re-weighting the source empirical risk. Later, most UDA methods resort to domain-invariant feature transformation [67, 60, 53] or feature space alignment [32, 25, 88] to pursue distribution alignment. However, the transferability of these shallow methods is restricted by task-specific structures .
Recently, deep neural networks are well explored to learn transferable representations for domain adaptation, in various visual applications like object recognition [32, 19, 59] and semantic segmentation [37, 114, 111, 90]. Based on the relationship of label spaces between source and target domains, UDA scenarios can be categorized into four cases, i.e., closed-set , partial-set , open-set , and universal . Among them, the closed-set UDA has received the most research attention, where the source and target label spaces are assumed to be identical. Existing deep closed-set UDA methods can be roughly divided into three distinct categories: discrepancy-based, reconstruction-based, and adversarial-based. Discrepancy-based approaches minimize a divergence criterion that measures the distance between the source and target data distributions, and some favoring choices include maximum mean discrepancy (MMD) , high-order central moment discrepancy , contrastive domain discrepancy , and the Wasserstein metric . Reconstruction-based approaches like  utilize reconstruction as an auxiliary task to pursue shared representations for both domains. In addition, some other reconstruction-based methods [3, 65] further seek domain-specific reconstruction and cycle consistency to improve the adaptation performance. Inspired by generative adversarial nets , adversarial-based approaches determine the distance between different data distributions based on binary classification performance, which in effect corresponds to the Proxy -distance or -divergence in the seminal theoretical framework . Different from marginal distribution alignment using one binary domain classifier in 
, following methods encourage joint distribution alignment by considering multiple class-wise domain classifiers or a semantic multi-output classifier [17, 43] instead of a feature-conditional domain discriminator 
, respectively. There are also some other studies investigating batch normalization[7, 95] and adversarial dropout [81, 48] within the network architecture to ensure feature invariance. Despite their efficacy, all these methods assume the target user’s access to the source domain, which is not unpractical since the source data may be private and confidential.
The concept of hypothesis transfer learning (HTL) is first presented by Kuzborskij and Orabona , also with a formal theory. Before it, there are a number of transfer learning works [101, 63, 91] that assume no explicit access to the source data and are empirically successful. Generally, HTL is an attractive and efficient framework that assumes access to a given number of source hypotheses and a small set of training samples from the target domain. However, like the famous fine-tuning strategy , HTL always requires at least a small set of labeled data in the target domain, limiting its applicability to the semi-supervised DA scenario. Inspired by HTL, several recent works [16, 54] assume absence of the source data and utilize the encoded information as source supervision for the UDA problem. In particular, besides target features,  requires predictions of target data, and 
requires the mean and variance per-class calculated on source features. Both methods adopt a shallow framework like HTL, which are restricted to the original feature structure. By contrast, our work fully exploits the end-to-end feature learning module, allowing more flexibility during adaptation. There are also two concurrent deep UDA methods[50, 73] that attempt not to access the source data during the adaptation process. Our approach differs from  as we do not need any additional components like a data generator or classifier within the training algorithm;  introduces the first federated DA setting where knowledge is transferred from the decentralized nodes to a new node without any supervision itself and proposes an adversarial-based solution to protect user privacy, but it may fail to tackle the vanilla UDA setting with only one source domain available.
Self-supervised learning  offers great feasibility for effectively utilizing unlabeled data by generating and predicting labels from these data. The self-supervised task is also known as pretext task. A typical workflow111https://cutt.ly/DfN3rFU is to train a model on one or multiple pretext tasks with unlabeled images and then fine-tune the trained model on a variety of practical downstream tasks. In addition, pretext tasks can also be jointly trained with supervised learning tasks on labeled data with shared weights like in [8, 107]109], relative position prediction , rotation prediction , solving jigsaw puzzles ; on the other hand, contrastive losses [35, 12] and clustering losses [9, 10] focus on the similarity of sample pairs in the representation space, which always provide better performance. Some recent studies [98, 89, 80] explore self-supervision for UDA problems and find it beneficial to accomplishing domain alignment. By contrast, this paper elegantly designs two different kinds of self-supervisions for UDA problems.
When the domain shift does not exist, the UDA problem naturally becomes a well-studied semi-supervised learning problem. Many ideas originally proposed for semi-supervised learning thus can also be employed to achieve or compensate domain alignment within UDA methods. Pseudo-labeling 
is a simple heuristic widely used in practice, which produces ‘pseudo-labels’ for unlabeled data using the prediction function itself during the course of training. Among UDA methods, directly incorporates pseudo-labeling as a regularization term, and  leverages pseudo labels in the adaptation module to achieve multi-modal distribution alignment. Entropy minimization  is a popular strategy that encourages the network to make ‘confident’ (low-entropy) predictions for all unlabeled data, which has been exploited in many previous UDA methods [57, 100]. Other favored semi-supervised techniques like tri-training and virtual adversarial training have been used in frameworks [82, 86], respectively. Recently,  directly employs MixMatch  and obtains promising results in the VisDA-2019 challenge. Different from prior works that treat the whole target domain as an unlabeled dataset, we focus on intra-domain semi-supervised learning where the labeled dataset consists of confident target data samples and the unlabeled dataset consists of remaining samples.
We aim to address the UDA problem with only a pre-trained source model, not requiring to access the source data. In particular, we consider the -way visual classification task. For a vanilla UDA task, we are given labeled samples from the source domain where , , and also unlabeled samples from the target domain where . The goal of UDA is to predict the labels in the target domain, where , and the source task is assumed to be the same with the target task . In this work, we aim to learn a target function and infer , with only and the source function available.
We address the above source data-absent UDA problem through the following steps. First, we train the classification model, consisting of a feature encoding module and a hypothesis module, from the source data and then transfer the source model to the target domain without accessing the source data. Then, we present a novel framework, Source HypOthesis Transfer (SHOT), to learn the target-specific feature encoding module using self-supervised learning and semi-supervised learning, with the source hypothesis fixed. Finally, using the predictions for the target domain, we further employ a semi-supervised learning algorithm to enforce labeling information propagation from confidently labeled target samples to the remaining target samples with low confidences. Applying such a labeling transfer strategy to SHOT yields SHOT++. Likewise, applying the labeling transfer strategy to ‘Source-model-only’ yields ‘Source-model-only++’, which can even deal with a black-box source model. In the following, we elaborate on each step in details.
We consider learning a deep source classification model by minimizing the following cross-entropy loss,
where denotes the -th element in the soft-max output of a -dimensional vector , and
denotes a one-hot encoding ofwhere is ‘1’ for the correct class and ‘0’ for the rest. To further lift the discriminability of the source model and facilitate the following target data alignment, we adopt the label smoothing technique for model training as it encourages learned feature representations to form tight and evenly separated clusters , which is useful for adaptation. Therefore, the source objective function is changed to
where is the smoothed label and is the smoothing parameter which is empirically set to 0.1.
As shown in Fig. 1, the source model parameterized by a deep neural network consists of two modules: the feature encoding module and the classifier module , i.e., , where is the dimension of the input feature. Most previous UDA methods align different domains by matching the data distributions in the feature space using MMD  or domain adversarial alignment . However, both strategies assume the source and target domains share the same feature encoder and need to access the source data during adaptation. This is not applicable in the tackled UDA setting here. By contrast, Adversarial Discriminative Domain Adaptation (ADDA)  relaxes the parameter-sharing constraint and is a new adversarial framework, which learns different mapping functions for the two domains. Also, Decision-boundary Iterative Refinement Training with a Teacher (DIRT-T)  first trains a parameter-sharing UDA framework as initialization and then fine-tunes the whole network by minimizing the cluster assumption violation via entropy minimization and virtual adversarial training. Both methods suggest that learning a domain-specific feature encoding module for is practicable and even works better than the parameter-sharing mechanism, which has also been proven effective in Domain-Specific Batch Normalization (DSBN) .
We therefore develop a new framework termed Source HypOthesis Transfer (SHOT) by learning the domain-specific feature encoding module for the target data while fixing the source classifier module (hypothesis), as the source hypothesis encodes the distribution information of the unseen source data. Namely, SHOT utilizes the same classifier module for different domain-specific feature encoding modules. It aims to learn the optimal target feature encoding module such that the output target features can fit the source feature distribution well and can be accurately classified by the source hypothesis directly. Note that SHOT merely utilizes the source data for just once to generate the source hypothesis, and does not need to access the source data any more, unlike prior methods (e.g., ADDA, DIRT-T, and DSBN).
Essentially, we expect to learn the optimal target feature encoder so that the target data distribution matches the source data distribution well. However, feature-level alignment does not work at all since it is impossible to estimate the distribution of without access to the source data. We view the challenging problem from another perspective: if there is no domain gap, what kind of outputs should be generated over the unlabeled target data? We argue the ideal outputs of target features should be similar to those of source features with the classifier shared for both domains. Since we train the source feature encoding module and classifier module via a supervised learning loss, the output of each source feature is fairly similar to one of the one-hot encodings. Therefore, we expect that the output of each target feature through is also similar to one of the one-hot encodings. Such an output alignment requirement is a necessary condition for feature alignment.
For this purpose, we adopt the information maximization (IM) loss [42, 85, 38] to make the classification outputs of target features individually certain and globally diverse. In practice, we minimize the following and that together constitute the IM loss ():
where is the -dimensional output of each target sample, is a -dimensional vector with all ones, and is the mean output embedding of the whole target domain. The IM loss would work better than conditional entropy minimization  widely used in prior UDA methods [94, 79] since IM can circumvent the trivial solution where all unlabeled data have the same one-hot encoding via the fair diversity-promoting objective . For convenience, we denote SHOT with the information maximization loss as SHOT-IM.
|(a) Source model only||(b) SHOT-IM|
Fig. 2 shows the t-SNE visualizations of features for a 5-way classification task learned by SHOT-IM and the ‘source model only’ method. Intuitively, the target feature representations are distributed in a mess for the ‘source model only’ method in Fig. 2(a), and using the IM loss indeed helps align the target data with the unseen source data well. However, the target data may be matched to the wrong source hypothesis to some extent in Fig. 2(b).
We argue that the harmful effects result from the inaccurate original network outputs. For instance, a target sample from the second class with the normalized network output [0.4, 0.3, 0.1, 0.1, 0.1] may be forced to have an expected output [1.0, 0.0, 0.0, 0.0, 0.0]. To alleviate such effects, we resort to self-supervised learning to exploit the knowledge in the unlabeled target domain to help learn structure-aware representations. Specifically, we develop two new self-supervisions as auxiliary tasks to be jointly trained with the main unsupervised task in Eq. (3) in a similar manner to prior methods [107, 89]. We first exploit self-supervision from the perspective of the loss function and design a novel self-supervised pseudo-labeling strategy. Different from pseudo-labeling  where pseudo labels conventionally generated by source hypotheses are still noisy due to domain shift, our self-supervised version considers the structure of the target domain and is able to provide accurate pseudo labels. The detailed learning procedure is provided in the following.
We first attain prototype representation (centriods) for each class in the target domain, similar to weighted k-means clustering,
where denotes the -th element in the soft-max output and denotes the previously learned target hypothesis. These centroids can robustly and more reliably characterize the distribution of different categories within the target domain.
We then obtain new pseudo labels via the nearest centroid classifier:
where measures the distance between and . We use the cosine distance by default.
Finally, we compute the target centroids based on the new pseudo labels:
We term as self-supervised pseudo labels since they are generated by the centroids obtained in an unsupervised manner. Actually, this solution to pseudo labels behaves like that in Minimum Centroid Shift (MCS)  where target-specific centroids and pseudo labels are alternately updated via optimizing the intra-class divergence minimization loss. In contrast, we employ the cross-entropy loss and just update the centroids and labels in Eq. (6) for one round since it is experimentally verified updating once gives sufficiently good pseudo labels. We provide the cross-entropy loss of self-supervised pseudo-labeling below,
where is a regularization parameter for the trade-off between and the main task in Eq. (3).
Also, we investigate the image-specific self-supervision in the unlabeled target domain. Rotation prediction in RotNet  is a favoring criterion in the self-supervised learning field, which aims to recognize one of four different 2d rotation (i.e., , , , and ) that is applied to the image that it gets as input. However, absolute rotation prediction is sensitive to some classification tasks. For example, in a main task aiming to distinguish digit ‘6’ from digit ‘9’, it is hard to determine which rotation category ‘9’ belongs to, since ‘9’ could also be a rotated ‘6’ with 180 degrees or a rotated ‘9’ with 0 degrees. To resolve this dilemma, we propose a new self-supervised learning task by predicting the relative rotation of each image pair. As shown in Fig. 3, the relative rotation predictor is represented by that takes the concatenated features of an image pair as input and maps them to one of four different rotation degrees.
For an image in the target domain , we first randomly sample an integral number from which corresponds to the rotation degree pool [, , , ]. Then we obtain the transformed image by rotating with the associated degree
. Finally, the probability score of the-th relative rotation degree predicted by is given by
where denotes the -th element in the soft-max output vector. Therefore, the self-supervised rotation prediction loss is defined as
where is a regularization parameter for the trade-off between and the main loss, i.e. Eq. (3).
We provide an illustrative example of the complete hypothesis transfer framework in Fig. 3. To summarize, given the source model and pseudo labels generated in Eq. (6) and randomly generated rotation labels as above, SHOT freezes the hypothesis from the source via and learns the feature encoding module with the full optimization objective as
After we obtain the predictions for all the samples in the target domain via SHOT in Eq. (10), we can measure the confidence scores of these predictions via the entropy function , where is a probability prediction vector. Observing the distribution of confidence scores, we find that there always exist some less confident (high-entropy) predictions that are possibly inaccurate. Fortunately, we can utilize the reliable labeling information from high confident predictions to improve the accuracy of these less confident ones. To this end, we propose a two-step method to enforce the information propagation from low-entropy predictions to high-entropy ones. In the first step, we divide the target domain into two splits according to the confidence scores and treat these two splits as a labeled subset and an unlabeled subset, respectively. In the second step, we readily employ a semi-supervised learning algorithm to learn the enhanced predictions for the unlabeled set here.
Regarding the choice of a semi-supervised learning algorithm in the second step, we simply adopt a popular and well-performing approach, MixMatch . The key point lies in how to divide the target domain into two splits. With average entropy, we first obtain the proportion of the labeled subset in the entire target domain by automatically computing
where denotes the entropy values of all the predictions in the target domain, where . Then for each class , we put the index with entropy values among the top smallest into the index pool of labeled split, where
and is the predicted label by SHOT in Eq. (10). In this manner, we get the labeled split, and the remaining samples constitute the unlabeled split. We call this strategy in Fig. 4 as labeling transfer since in this stage we only need the labeling information (predictions) while the feature encoding module is initialized with that learned in Eq. (10). Besides, the classification module is newly initialized from scratch and not frozen any more. So far, we develop a two-stage approach, called SHOT++, in which the first stage is SHOT in Eq. (10) and the second stage is the proposed labeling transfer strategy in Fig. 4.
We also provide an extension of the proposed SHOT approach for multi-source domain adaptation (MSDA) . For simplicity, we run SHOT and SHOT-IM on each source-target pair and then sum up the probabilistic scores obtained from each pair. Finally, we get the predictions of samples in the target domain via the argmax operation. As for labeling transfer, we split the target domain into two pieces for each pair, and learn the independent prediction scores.
We also provide an extension of the proposed SHOT approach for partial-set domain adaptation (PDA) . Looking at the diversity-promoting term in Eq. (10), it encourages the target domain to own a uniform label distribution. Though seemingly reasonable for solving closed-set UDA, it is not suitable for PDA. In reality, the target domain only contains some classes of all the classes in the source domain, making the label distribution sparse. Hence, we drop the second term for PDA by letting .
Besides, within the self-supervised pseudo-labeling strategy, we usually need to obtain centroids in the target domain. However, for the PDA task, there are some tiny centroids which should be considered as empty like in k-means clustering. Therefore, SHOT discards tiny centroids with size smaller than in Eq. (6) for PDA problems.
We further extend the proposed SHOT approach for semi-supervised domain adaptation (SSDA) . SSDA differs from UDA in that some labeled data exist in the target domain. Therefore, we adopt the supervised training loss in Eq. (2) for labeled target data and the complete loss in Eq. (10) for unlabeled target data. Besides, we also consider the labeled target data when computing the target-specific centriods. As for labeling transfer, we split the unlabeled target domain into two pieces and then add the labeled data into the labeled split.
Here we discuss some architecture choices for the neural network model to parameterize both the feature encoding module and the hypothesis. First, we need to look back at the expected network outputs for cross-entropy loss in Eq. (1). If , then maximizing means minimizing the distance between and , where is the -th weight vector in the last FC layer. Ideally, all the samples from the -th class would have a feature embedding near to . If unlabeled target samples are given the correct pseudo labels, it is easily understandable that source feature embeddings are similar to target ones via the pseudo-labeling term in Eq. (7). The intuition behind is quite similar to previous studies [60, 97] where a simplified MMD is exploited for multi-modal domain confusion. Since the weight norm matters in the inner distance within the soft-max output, we adopt weight normalization (WN)  to keep the norm of each weight vector the same in the FC classifier layer. Besides, as indicated in prior studies, batch normalization (BN)  can reduce the internal dataset shift since different domains share the same mean (zero) and variance which can be considered as first-order and second-order moments. Based on these considerations, we form the frameworks of SHOT and SHOT++ as shown in Figs. 14.
To testify their versatility, we evaluate our methods in three unsupervised DA scenarios (i.e. closed-set, partial-set, multi-source), and one semi-supervised DA scenario over several popular visual benchmarks as introduced below.
Digits is a widely used DA benchmark that focuses on digit recognition. We follow the protocol of 
and utilize three representative subsets: SVHN (S
), MNIST (M), and USPS (U). We train our model using the training sets of each domain and report the recognition results on the standard test set of the target domain.
Office  is a standard DA benchmark which contains three domains, i.e., Amazon (A), DSLR (D), and Webcam (W), and each domain includes 31 object classes in the office environment. Gong et al.  further extract 10 shared categories between Office and Caltech-256 (C) to form a new benchmark named Office-Caltech. Both Office and Office-Caltech are considered small-sized.
Office-Home  is a challenging medium-sized benchmark, which consists of four distinct domains, i.e., Artistic images (Ar), Clip Art (Cl), Product images (Pr), and Real-World images (Rw). There are totally 65 everyday object categories in each domain.
VisDA-C  is a challenging large-scale benchmark that mainly focuses on the 12-class synthesis-to-real object recognition task. The source domain contains 152 thousand synthetic (S) images generated by rendering 3D models while the target domain has 55 thousand real (R
) object images sampled from Microsoft COCO.
PACS  is a popular benchmark for multi-source domain adaptation. It contains four different domains, i.e., Art painting (A), Cartoon (C), Photo (P), and Sketch (S). There are totally 7 common categories in each domain.
Baseline methods. For vanilla unsupervised DA in digit recognition, we compare SHOT with ADDA , ADR , CDAN , CyCADA , CAT , SWD  and STAR ; for object recognition, we compare ours with DANN , DAN , SAFN , BSP , MDD , TransNorm , DSBN , BNM  and GVB-GD . For partial-set DA tasks, we compare ours with IWAN , SAN , ETN , DRCN , RTNet , BAUS , and TSCDA . For multi-source UDA, we compare ours with DCTN , MCD , WBN , MSDA- , CMSS , and CMSS . For SSDA, we mainly compare our methods with MME  and UODA . Note that results are directly cited from published papers if we follow the same setting. ‘Source-model-only’ (also called ‘src-only’) denotes using the entire model learned from the source domain for target label prediction. ‘labeled-data-only’ denotes using labeled target data only when learning the feature extractor . SHOT-IM is a special case of SHOT, where both self-supervised losses are ignored by letting in Eq. (10).
Network architecture. For the digit recognition task, we use the same architectures with CDAN , namely, the classical LeNet-5  network for USPSMNIST and a variant of LeNet for SVHNMNIST. More network details can be found in Appendix A of . For the object recognition task, we employ the pre-trained ResNet-50 or ResNet-101  models as the backbone, like [59, 23, 100, 72]. Following , we replace the original FC layer with a bottleneck layer (256 units) and a task-specific FC classifier layer in Fig. 1. Precisely, a BN layer is put after FC inside the bottleneck layer and a weight normalization layer is utilized in the task-specific FC layer.
Network hyper-parameters. We train the whole network through back-propagation, and the newly added layers are trained with a learning rate 10 times that of the pre-trained layers (backbone shown in Fig. 1). Concretely, we adopt mini-batch SGD with momentum 0.9, weight decay 1e and learning rate for the new layers and those layers learned from scratch for all experiments except for VisDA-C. We further adopt the same learning rate scheduler as [26, 59], where is the training progress changing from 0 to 1. Besides, we set the batch size to 64 for all the tasks. We utilize for all experiments except for Digits in Table I and SSDA in Table VIII. Concerning the labeling transfer strategy, only ‘source-model-only++’ for object recognition does not use the learned source model as initialization.
For Digits, we train the best source hypothesis using the test set of the source dataset as validation. For other datasets without train-validation splits, we randomly specify a 0.9/ 0.1
split in the source dataset and generate the best source hypothesis based on the validation split. The maximum number of epochs forDigits, Office, Office-Home, VisDA-C and Office-Caltech is empirically set as 30, 100, 50, 10, and 100, respectively. For learning in the target domain, we update the pseudo-labels epoch by epoch, and the maximum number of epochs is empirically set as 15. Regarding the second step in Section 3.4, we adopt the same learning setting as that of training SHOT and the default parameters within MixMatch . We utilize only for Digits. Besides, we randomly run our methods for three times with different random seeds via PyTorch, and report the mean accuracy. Note that we do not use any target augmentation such as the ten-crop ensemble  for evaluation.
For digit recognition, we evaluate our methods on three popular closed-set unsupervised domain adaptation tasks, i.e., SVHNMNIST, USPSMNIST, and MNISTUSPS. The classification accuracies of our methods and prior work are reported in Table I. Obviously, SHOT obtains the best mean accuracy for each task and also outperforms prior work in terms of the average accuracy. Compared with the baseline method source-model-only, SHOT-IM always achieves better results, and SHOT performs better than SHOT-IM due to the contribution of self-supervised learning in the target domain. Taking into consideration the labeling transfer strategy, all three methods are able to obtain enhanced classification results, indicating the effectiveness of intra-domain semi-supervised learning. It is also worth noting that SHOT++ even offers superior performance to the target-supervised result in MNISTUSPS. This may be because MNIST is much larger than USPS, which alleviates the domain shift well.
Next, we evaluate our methods on object recognition benchmarks including Office, Office-Home and VisDA-C under the vanilla closed-set DA setting. As shown in Table II, SHOT performs the best for two challenging tasks, DA and WA, and obtains an average accuracy 88.8% that is competitive to two state-of-the-art methods, MDD  and BNM . Similar to the observations in Table I, the labeling transfer strategy is beneficial to cross-domain object recognition, and SHOT++ obtains the same mean accuracy as previous state-of-the-art methods, TransNorm  and GVB-GD . This may be because SHOT needs a relatively large target domain to learn the target-specific module while and as the target domain are not big enough. Generally, SHOT obtains competitive performance even with no direct access to the source domain data.
As expected, on the medium-sized Office-Home dataset, our method SHOT++ significantly outperforms previously published state-of-the-art approaches, advancing the average accuracy from 70.4% in GVB-GD  to 73.0% in Table III. Besides, SHOT++ performs the best among 11 out of 12 separate tasks. For the transfer task ReAr, SHOT++ gets the second-best result 74.3% that is only lower than the best result 74.6% of GVB-GD. Generally, the hypothesis transfer strategy works well enough, seen from the outperforming results of SHOT over prior methods, and the labeling transfer strategy further lifts the avg. accuracy by nearly 1 point.
For the large-scale synthesis-to-real VisDA-C dataset, we follow the protocol in prior works [81, 100] and employ the most favoring backbone ResNet-101 . As shown in Table IV, SHOT++ achieves the best per-class accuracy and wins among 6 out of 12 tasks. Even when ignoring the second stage, namely, labeling transfer, SHOT can still obtain a promising per-class result 85.5%, higher than the prior state-of-the-art 82.7% in STAR . Carefully comparing SHOT with prior work, we find that SHOT performs well even for the most challenging class ‘truck’. Besides, using the intra-domain semi-supervised learning stage via MixMatch, the per-class results are improved but the accuracy of the hard class ‘truck’ decreases. This may be because large error in the labeled split affects the final results.
Following previous works [113, 21], we further adopt the ResNet-50  backbone to validate the effectiveness of our methods. Results are shown in Table V. With the hypothesis transfer strategy, SHOT beats the state-of-the-art method GVB-GD by 1.4% in terms of per-class accuracy. Benefited from the labeling transfer strategy, the per-class accuracy further grows from 76.7% (SHOT) to 77.2% (SHOT++) and again ranks the best for VisDA-C with the ResNet-50 backbone.
In Table V, we further fix three balancing parameters (i.e., ) to zero in turn and investigate the effectiveness of each component within SHOT in Eq. (10), including , , and . Firstly, the advantages of SHOT-IM over SHOT-IM () validate the effectiveness of the diversity term . Incorporated with the labeling transfer strategy, SHOT-IM++ also obtains a better per-class accuracy than its variant SHOT-IM++ (). Secondly, SHOT () performs worse than SHOT, indicating the effectiveness of the self-supervised pseudo labeling term in Eq. (7). Thirdly, SHOT () performs worse than SHOT, indicating the effectiveness of the self-supervised rotation prediction term in Eq. (9). Two latter conclusions can also be drawn by comparing SHOT () and SHOT () with SHOT-IM. Also, it seems contributes more than within SHOT. Finally, the benefits of the labeling transfer strategy are also easily validated by comparing the values in the second column with those in the fourth column.
Results of object recognition for MSDA. For the multi-source UDA setting, we adopt the protocol in  on Office-Caltech and PACS. For the two datasets, we specify a target subset and use other three subsets as three source domains, forming a multi-source UDA task. Likewise, SHOT does not access the source data but provided with multiple source models instead. The results of ours and previously published state-of-the-arts are shown in Table VI. It is clear that SHOT achieves better results than CMSS  in 3 out of 4 tasks on Office-Caltech and all of the 4 tasks on PACS, respectively. With the incorporation of labeling transfer, SHOT++ wins all these transfer tasks on the two datasets. Besides, the gap between SHOT and SHOT-IM is relatively small on Office-Caltech since the predictions learned by SHOT-IM are already good enough.
Results of object recognition for PDA. For the partial-set UDA setting, we follow the protocol in  on Office-Home and VISDA-C. In particular, there are totally 25 classes (the first 25 in the alphabetical order) out of 65 classes in the target domain for Office-Home, while the first 6 classes in the alphabetical order out of 12 classes are included in the target domain for VISDA-C. Results of our methods and previous state-of-the-art PDA methods [51, 15, 56, 76] are shown in Table VII. As explained in Section 3.6, is utilized in all of our methods here. Compared with previous methods, SHOT obtains the best average accuracy for both datasets as before. Besides, SHOT again outperforms SHOT-IM by 2.8% and 6.3% in terms of the average accuracy on two datasets, and SHOT++ further improves the average accuracy from 79.5% to 79.9% and 73.6% to 76.2%, respectively. Generally, both the hypothesis transfer strategy and the labeling transfer strategy are proven effective for the challenging PDA problem.
Results of object recognition for SSDA. For the semi-supervised domain adaptation setting, we follow the protocol in  on Office-Home under the one-shot setting where one labeled example per class is available in the target domain. As shown in Table VIII, SHOT outperforms UODA  and MME  in 10 out of 12 tasks and achieves the best average accuracy. Besides, SHOT is always superior to SHOT-IM, validating the effectiveness of self-supervision over the unlabeled target data. SHOT++ further improves the average accuracy from 65.4% to 66.1%, indicating the effectiveness of the labeling transfer strategy.
Special case. One may wonder whether SHOT works if we cannot train the source model by ourselves. To find the answer, we utilize the most popular off-the-shelf pre-trained ImageNet models ResNet-50  and consider a special PDA task (ImageNetCaltech) to evaluate the effectiveness of SHOT with the same basic setting as . Obviously, in Table IX, SHOT achieves a slightly higher mean accuracy than prior state-of-the-art ETN  even without access to the source data. It shows that the proposed hypothesis transfer strategy is indeed effective even without the design of model network architectures.
Ablation study on network components. As discussed in Section 3.8, we utilize batch normalization (BN) and weight normalization (WN) during training the source model and learning the target feature encoder. We report the ablation study about network components in Fig. 5 to validate their contribution. First, using BN or using WN results in the decreasing accuracy over the source training set, which may be useful for generalization since we find in the second bin, high accuracy on the source test set corresponds to low accuracy on the training set. Then, the higher accuracy the ‘src-only’ method obtains, the better results SHOT and its variants achieve. Generally, both BN and WN are beneficial to domain adaptation. Besides, the improvements brought by BN are larger than those brought by WN.
|(a) rotation prediction||(b) target accuracy|
|(a) accuracy of ArCl (UDA)||(b) accuracy of ArCl (PDA)|
Training stability. We investigate the accuracy and the values of three different objective functions within the optimization process in Fig. 6 on the UDA task ArCl. It can be easily seen that the values of and quickly decrease and converge after nearly 8 epochs. The value of rotation prediction loss also keeps decreasing but at a slow speed. As shown in Fig. 6(d), the accuracy varies following a very similar trend, i.e. growing up quickly and starting to converge after 6 epochs. Generally, the training procedure of SHOT is stable and effective.
Discussion on loss functions. To analyze the advantages of our proposed self-supervised loss functions in Eq. (7) and Eq. (9), we design one vanilla alternative for each function on the transfer task ArPr. As shown in Fig. 7(a), the proposed relative rotation prediction objective works better than the vanilla variant in terms of the rotation prediction accuracy. Besides, comparisons in terms of the semantic accuracy in Fig. 7(b) indicate that the proposed relative rotation prediction objective is also beneficial to UDA. Compared with SHOT w/ vanilla pseudo-labeling, SHOT always obtains better results along the training process, implying the superiority of the proposed self-supervised pseudo-labeling term.
Parameter sensitivity. To better understand the effects of , we test their performance sensitivity in the UDA task ArCl on Office-Home and show the results in Fig. 8(a). The accuracies around are not sensitive. Besides, we study the sensitivity of the threshold parameter for the PDA task ArCl on Office-Home in Fig. 8(b). It shows that the accuracies around are also not sensitive. Generally, the parameters within the proposed method i.e. SHOT are not sensitive.
|(a) Source-model-only||(b) SHOT-IM||(c) SHOT|
|(a) Source-model-only||(b) SHOT-IM||(c) SHOT|
Qualitative Study. We randomly select some samples in the source domain, the low-entropy target split, and the high-entropy target split to provide some intuitive insights about the labeling transfer strategy. Particularly, we pick up two images from three representative classes, i.e., ‘backpack’, ‘bike’, and ‘bucket’, for the UDA task ArCl on Office-Home, and show them in Fig. 9. It can be seen that the proposed strategy can well separate the easy samples from the hard samples in the target domain. Besides, the easy samples in the low-entropy target split are more trustworthy than the source samples for the hard samples in the high-entropy target split, making the proposed labeling transfer strategy understandable and effective.
Feature visualization. We provide the t-SNE visualizations 222https://lvdmaaten.github.io/tsne/ of the features learned by Source-model-only, SHOT-IM, and SHOT for the UDA task ArCl on Office-Home in Fig. 10 and Fig. 11, respectively. As expected, both SHOT-IM and SHOT help align the target features with the source features in Fig. 10. Carefully looking at the semantic labels in Fig. 11, we find that SHOT outperforms SHOT-IM by semantically aligning features from different domains.
In this paper, we have proposed a generic representation learning framework called source hypothesis transfer (SHOT) for source data-absent unsupervised domain adaptation. SHOT merely needs the well-trained source model and offers the feasibility of unsupervised domain adaptation without access to the source data which may be private or decentralized. Specifically, SHOT learns the optimal target-specific feature learning module to fit the source hypothesis by exploiting information maximization and self-supervised learning. We further present a labeling transfer strategy and apply it to enhance SHOT to SHOT++, which exploits the intra-domain information via a semi-supervised algorithm. Experiments for both digit classification and object recognition verify that SHOT and SHOT++ can achieve results comparable to or even better than the state-of-the-art for three different unsupervised domain adaptation scenarios as well as the semi-supervised domain adaptation problem. In the future, we plan to apply the proposed methods to other visual tasks like semantic segmentation  and object detection .
Deep clustering for unsupervised learning of visual features. In Proc. ECCV, pp. 132–149. Cited by: §2.3.
Domain Adaptation in Computer Vision Applications, pp. 1–35. Cited by: §1, §2.1.
Unsupervised domain adaptation by backpropagation. In Proc. ICML, pp. 1180–1189. Cited by: §1, §2.1, §3.2, §4.1, §4.2, §4.2, TABLE II, TABLE III, TABLE IV, TABLE V, TABLE VI, TABLE VIII.
Domain adaptation for large-scale sentiment classification: a deep learning approach. In Proc. ICML, pp. 513–520. Cited by: §1.