Deep learning methods have made a significant breakthrough with appreciable performance in a wide variety of applications under i.i.d. assumption. However, when training data and test data are not drawn from the same distribution, the trained model can not generalize well in test data. To deal with this domain shift problem, researchers resort to Unsupervised Domain Adaptation (UDA) (Saenko et al., 2010; Pan et al., 2010; Gong et al., 2012; Tzeng et al., 2014; Ganin and Lempitsky, 2015; Long et al., 2015, 2017a; Zhang et al., 2019). However, recent works (Zhao et al., 2019; Wu et al., 2019; Combes et al., 2020) have shown that UDA does not guarantee good generalizations on the target domain.
|Source data||Trained source model||Target labeled data|
|Unsupervised domain adaptation|
|Semi-supervised domain adaptation|
|Unsupervised model adaptation|
Especially when the marginal label distributions are distinct across domains, UDA methods provably hurt target generalization (Zhao et al., 2019). Besides, in many real-world applications, it is often feasible to at least obtain a small amount of labeled data from the target domain. Therefore, Semi-Supervised Domain Adaptation (SSDA) (Donahue et al., 2013; Saito et al., 2019; Kim and Kim, 2020), where the large amount of labeled source data and a small amount of labeled data from the target domain are available, has been given increasing attention.
In addition to utilizing a few labeled target samples, the major progress of SSDA has been developing improved methods for aligning representations between source and target in order to improve generalization. These methods span distribution alignment, for example by maximum mean discrepancy (MMD) (Tzeng et al., 2014; Long et al., 2015; Yan et al., 2017), domain adversarial training (Ganin and Lempitsky, 2015; Long et al., 2017a; Zhang et al., 2019), and cycle consistent image transformation (Liu and Tuzel, 2016; Hoffman et al., 2018). However, as revealed in a recent study (Saito et al., 2019), some UDA methods, e.g. DANN (Ganin and Lempitsky, 2015) and CDAN (Long et al., 2017a), show no improvement or yield worse results than SSDA methods when trained on a few labeled target samples and source samples. Therefore, recent works focus on better leveraging the labeled and unlabeled target domain via min-max entropy (Saito et al., 2019), meta-learning (Li and Hospedales, 2020) and joint learning invariant representations and risks (Li et al., 2020b).
Despite its promising performance, SSDA is not always applicable in real-world scenarios as the source data is not always accessible for protecting the privacy in the source domain (Liang et al., 2020a). For example, many companies only provide the learned models instead of their customer data due to data privacy and security issues. Besides, the source datasets like videos or high-resolution images may be so large that it is not practical or convenient to transfer or retain them to different platforms (Li et al., 2020a). To overcome the absence of source data, unsupervised model adaptation (UMA) is investigated in (Liang et al., 2020a; Li et al., 2020a). UMA is tougher than UDA and inherits the challenges of UDA that the generalization ability on target domain may be not improved. Besides, without source data, it is hard to reduce domain discrepancy that the features of target domain samples lie near the decision boundary which may lead mis-classification, as shown in Fig 1. To tackle these issues, in this paper we focus on a more realistic setting of Semi-supervised Source Hypothesis Transfer (SSHT), which has not been explored. The major differences among SSHT and other related problems are summarized in Table 1.
SSHT is a more challenging task compared with SSDA as the source data is not accessible. In SSDA, even though the source domain is discrepant from target domain, the source labels are accurate for maintaining the discriminability of adapted model. While in SSHT, the insufficient labeled target data may result in target features lying near the decision boundary and increasing the risk of mis-classification. Besides, the source data are usually imbalanced that the trained model is prone to categorize the samples of minority categories into majority ones, which exhibits small prediction diversity. Such biased model trained on source data may not be well improved when transferred to target domain with only a few labeled samples, leading to poor generalization on target domain.
To tackle the above issues, we provide Consistency and Diversity Learning (CDL), a simple but effective framework for SSHT by encouraging prediction consistency on the unlabeled target data and maintaining the prediction diversity when adapting model to target domain. With two random data augmentations on an unlabeled image, the consistency regularization is achieved via interpolation consistency(Zhang et al., 2017; Verma et al., 2019) or prediction consistency (Berthelot et al., 2019; Sohn et al., 2020). We prefer Fixmatch (Sohn et al., 2020)
, a simple but effective semi-supervised learning method. Fixmatch applies strong data augmentation(Cubuk et al., 2020) to produce a wider range of highly perturbed images. Then regarding the predictions of weakly augmented images as pseudo labels, the consistency is achieved by training the model to categorize the strongly augmented images into the pseudo labels. Such consistency regularization makes the model harder to memorize the few landmarks and therefore enhances the generalization ability of the learned model.
To maintain the prediction diversity, we integrate Batch Nuclear-norm Maximization (BNM) (Cui et al., 2020b) into our framework. As revealed in (Cui et al., 2020b), for a classification output matrix of a randomly selected batch data, the prediction discriminability and diversity could be separately measured by the Frobenius norm and rank of the matrix. As the nuclear-norm is an upperbound of the Frobenius-norm and a convex approximation of the matrix rank, encouraging Batch Nuclear-norm Maximization improves both discriminability and diversity. We argue that maintaining diversity is necessary since Fixmatch degrades diversity as it adopts only the samples with confident predictions higher than a predefined threshold for computing consistency regularization. Though such thresholding mechanism is helpful to mitigate the impact of incorrect pseudo labels, it will worsen the prediction diversity since samples of majority categories may exhibit larger prediction confidence.
We conduct extensive experiments on DomainNet, OfficeHome and Office-31. The experimental results show that the proposed CDL significantly outperforms state-of-the-art UMA methods and achieves comparable results against state-of-the-art SSDA methods. Ablation studies are presented to verify the contribution of each key component in our framework.
2. Related Work
2.1. Unsupervised Domain Adaptation
The most deep neural network based Unsupervised Domain Adaptation (UDA) methods have made a success without any target supervision, which can be mainly categorizes into cross-domain discrepancy minimization based methods(Tzeng et al., 2014; Long et al., 2015, 2017b) and adversarial adaptation methods (Ganin and Lempitsky, 2015; Long et al., 2017a; Zhang et al., 2019). The popular discrepancy measurement, Maximum Mean Discrepancy (MMD), is firstly applied to one Fully-Connected (FC) layer of AlexNet in DDC (Tzeng et al., 2014). Deep Adaptation Network (DAN) (Long et al., 2015) further minimizes the sum of MMDs defined on several FC layers and achieves a better domain alignment. For a better discriminability in target domain, JAN (Long et al., 2017b) aligns the marginal and conditional distribution jointly based on MMD. Researcher also propose other discprepancy measurements such as correlation distance (Sun and Saenko, 2016)
and Central Moment Discrepancy (CMD)(Zellinger et al., 2017) for UDA.
Inspired by adversarial learning, (Ganin and Lempitsky, 2015; Long et al., 2017a; Zhang et al., 2019) impose the Gradient Reverse Layer (GRL) to better align domain distributions. In Domain Adversarial Neural Network (DANN) (Ganin and Lempitsky, 2015), the authors introduce a domain discriminator and learn features that are indistinguishable to the domain discriminator. In CDAN (Long et al., 2017a)
, the authors propose a novel conditional domain discriminator conditioned on domain-specific feature representations and classifier predictions, and implement discrepancy reduction via adversarial learning. To bridge the gaps between the theory and algorithm for domain adaptation,(Zhang et al., 2019) present Margin Disparity Discrepancy (MDD) with rigorous generalization bounds, tailored to the distribution comparison with the asymmetric margin loss, and to the minimax optimization for easier training.
Some UDA methods focus on some characteristics of specific layer deep neural network for domain adaptation. In (Li et al., 2018)
, the authors assume that the neural network layer weights learn categorical information and the batch norm statistics learn transferable information, so they propose AdaBN to modulating all Batch Normalization statistics from the source to target domain. In AFN(Xu et al., 2019a), the authors reveal that the feature norms of target domain are much smaller than source domain and propose to adaptively increase the feature norms, which results in significant transfer gains. However, the prediction diversity is not explored that the model tends to push the examples near to the decision boundary, resulting error prediction accumulation. Batch Nuclear norm Maximization (BNM) (Cui et al., 2020b)
, adopted in this paper, maintains both discriminability and diversity, leading a promising result in several transfer learning tasks such as semi-supervised learning, domain adaptation and open domain recognition.
2.2. Semi-supervised Domain Adaptation
Semi-Supervised Domain Adaptation (SSDA) (Ao et al., 2017; Donahue et al., 2013; Yao et al., 2015; Saito et al., 2019; Xu et al., 2019b; Kim and Kim, 2020; Li and Hospedales, 2020; Li et al., 2020b) is an extension of UDA with a few labeled target labels which achieves much better performance. Exploiting the few target labels allows better domain alignment compared to purely unsupervised approaches. In (Donahue et al., 2013), the authors impose smoothness constrains on the classifier scores over the unlabeled target data and lead to a better adaptation in conventional learning method. In (Yao et al., 2015), the authors aim to learn a subspace to manifest the underlying difference and commonness between source and target domains, which reduces data distribution mismatch. In (Ao et al., 2017)
, the authors estimate the soft label of the given labeled target sample with the source model and interpolated with the hard label for target model supervision. Work(Xu et al., 2019b) uses stochastic neighborhood embedding (d-SNE) to transform features into a common latent space for few-shot supervised learning, and use metric learning to improve the feature discrimination on the target domain. In (Saito et al., 2019)
, the authors point out that the weight vector of each class is an estimated prototype, and the entropy on target samples represents the similarity between prototypes and target features. Based on this assumption, they firstly maximize the the entropy of unlabeled target samples to move the weight vectors towards target data, and then update the feature extractor by minimizing the entropy of unlabeled target samples, leading to higher discriminability. Recently, work(Kim and Kim, 2020) raises a novel perspective of intra-domain discrepancy and propose a framework that consists attraction, perturbation, and exploration schemes to address the discrepancy.
2.3. Model Adaptation
Domain adaptation usually requires the large-scale source data, which is not practical due to the risk of violation of privacy in source domain. Therefore, the Model Adaptation (MA) (Liang et al., 2020a; Li et al., 2020a; Yang et al., 2020; Kundu et al., 2020; Liang et al., 2021b) is proposed to handle the domain adaptation when source data is unavailable.
In (Liang et al., 2020a), the source data is only exploited to train source model. Then they fine-tune the pre-trained model to learn source-like target representation. The key assumption in (Liang et al., 2020a) is that pre-trained model consists of a feature encoding module and a hypothetical classifier module. By fixing the classifier module, the fine-tuned encoding module can produce the better representations of target data, as source hypothesis encodes the distribution information of unseen source data. In (Li et al., 2020a), the authors propose collaborative class conditional generative adversarial net, in which the prediction model is to be improved through generated target-style data. The prediction model can provide more accurate guidance for the generator that the generator and the prediction model can collaborate with each other. Liang et al (Liang et al., 2021a) develop two types of non-parametric classifiers, with an auxiliary classifier for target data to improve the quality of pseudo label when guiding the self-training process. In (Liang et al., 2020b), the authors propose an easy-to-hard labeling transfer strategy, to improve the accuracy of less-confident predictions in target domain. Yang et al (Yang et al., 2020) handle this problem by deploying an additional classifier to align target features with the corresponding class prototypes of source classifier. (Kundu et al., 2020) proposes a framework which exploits the knowledge of class-separability and enhances robustness to out-of-distribution samples. In (Liang et al., 2021b), the model provided as a black-box model to prevent generation techniques from leaking the individual information. These UMA methods inherit the challenges of UDA that the generalization ability on target domain may be not improved. Therefore, we raise SSHT to improve the generalization ability on target domain with just a few labeled target data.
3.1. Semi-supervised Source Hypothesis Transfer
Common notations and definitions of Semi-supervised Source Hypothesis Transfer (SSHT) are introduced here. Suppose that there are labeled data in source domain. Similarly, we have unlabeled data and a small set of labeled data in target domain. is usually much larger than , since the labeled data is more difficult to obtain.
Due to the data privacy, the source data in is unavailable in SSHT. While we can leverage the model trained with source data. The model consists of a feature extractor and a classifier, where the parameters and weights are available. The goal of SSHT is to adapt the source model to target domain with only a few labeled target samples and unlabeled target samples. To address the issue of Semi-supervised Source Hypothesis Transfer, we provide a simple but effective framework that consists of the consistency learning (CL) and diversity learning (DL) modules. The overall framework is shown in Fig2. Firstly, the unlabeled images are augmented with both weak and strong augmentations. We feed the augmented data into the network and adopt the prediction results of weakly augmented images as supervision to train the strongly augmented ones for achieving prediction consistency. We maintain the prediction diversity by batch nuclear-norm maximization on outputs of all unlabeled augmented images. The source model is adapted in an end-to-end manner, and the collaboration between consistency learning and diversity learning enforces the decision boundary move away from labeled target samples towards unlabeled samples, improving the generalization ability of adapted model.
3.2. Consistency Learning
The challenges of model adaptation is the absence of source data, which make it hard to estimate the distribution discrepancy between the two domains. Model adaptation without any labeled target sample is a complicated problem since the model would retain the decision boundary by the source information, and hard to be fine-tuned. With assistance of labeled target samples, the source model can learn some discriminative information in target domain. While the model may tends to overfit on labeled target data, resulting unreliable decision boundary.
To address the overfitting problem, some methods (Zhang et al., 2017; Verma et al., 2019; Cubuk et al., 2020; Sohn et al., 2020) have been proposed based on data augmentation in a semi-supervised learning manner. Typical consistency regularization based methods (Sajjadi et al., 2016; Laine and Aila, 2016) adopt the following loss:
where is an unlabeled image. The and are different random augmentations. denotes parameters of model.
Besides, self training with pseudo-labeling is also a useful technique for semi-supervised learning. FixMatch (Sohn et al., 2020) is a combination of the two approaches to SSL: consistency regularization and pseudo-labeling. FixMatch utilizes a separate weak and strong augmentation when performing consistency regularization. Specifically, for each unlabeled sample in target domain, the weak augmentation and strong augmentation are defined as:
The weak data augmentation includes image flipping and image translation. And the strong data augmentation utilizes the technique proposed in (Cubuk et al., 2020). The consistency regularization incoporated with pseudo-labeling is implemented as treating the prediction of weakly augmented images as pseudo label and enforcing the prediction of strongle augmented ones towards the pseudo label. However, the pseudo labels may contain wrong labels, resulting in the error accumulation. Therefore, to mitigate the impact of incorrect pseudo labels, only samples with highly confident predictions are selected for consistency regularization. The consistency regularization loss on unlabeled images is defined as:
where the is the threshold, and is the one-hot vector of . denotes the cross-entropy between two distributions and . By optimizing the consistency loss , the decision boundary will be pushed far from the labeled samples. Then the decision boundary enforces the model to be insensitive to the image perturbations and powerful in classifying unlabeled samples.
To ensure the discriminability of model, we adopt the typical cross-entropy loss for the labeled target data . The classification loss is defined as:
The loss minimized by FixMatch is simply where is a fixed scalar hyper-parameter denoting the relative weight of the unlabeled loss.
3.3. Diversity learning
Though the selection mechanism is effective to mitigate the impact of incorrect pseudo labels, it will worsen the prediction diversity. Therefore we integrate an effective technique to maintain the discriminability and diversity of prediction. In domain adaptation, entropy minimization (Grandvalet et al., 2005) is widely adopted to enhance discriminability. However, simply minimizing entropy makes the trained model tend to classify samples near the decision boundary of the majority categories. Such unreliable classifiers will misclassify samples of minor categories which exhibits reduced prediction diversity. Though there are a few labeled target data in SSDA, it is insufficient to increase prediction diversity.
To maintain the discriminability and diversity of prediction, we adopt Batch Nuclear-Norm Maximization (BNM) (Cui et al., 2020b). Diversity could be measured by the number of response categories, which is the rank of the prediction matrix. And since the nuclear-norm is the convex approximation of the matrix rank, maximizing Batch Nuclear-norm will enlarge the rank, increasing the diversity. BNM is performed on the matrix of the classification responses for a batch unlabeled samples, without any supervision.
The loss function of BNM is defined as follow:
where the is the output matrix with respect to the input matrix , and is the batch size of random samples.
denotes the nuclear-norm, which is the sum of all the singular values in the matrix. In our settings, we have two augmented images,and . Then the total loss for diversity learning is combined as follows:
where denotes the calculation of concatenation. Minimizing the diversity loss can enforce the model to push the decision boundary into low density regions without losing diversity. In (Cui et al., 2020b), the authors reveal that the key insight of BNM may be sacrificing a certain level of the prediction hit-rate on majority categories, to enhance the prediction hit-rate on minority categories. Thus the diversity of prediction is retained. To maintain the discriminability, we minimize the diversity loss with the classification loss and consistency loss, and then model tends to produce more diverse and accurate prediction.
The total loss of the proposed CDL is defined as follows:
where the and control the trade-off between classification loss, consistency loss and diversity loss. The classification loss provides accurate supervision for training model with high discriminability. The consistency regularization loss prevents the model from overfitting on insufficient labeled target data, gaining better discriminability over unlabeled data. The diversity loss could maintaining both the discriminability and diversity. The total loss encourages the trained model to generalize well on target domain.
In this section, we conduct extensive experiments on typical domain adaptation benchmarks to verify the effectiveness of our method. For different tasks with the same source domain, we train a unique source model with the same source data. Then the source data are not used during adaptation. The results of recent state-of-the-art domain adaptation methods are presented for comparisons or as references since most of the methods are not applicable in the absence of source data during the adaptation process.
4.1. Datasets and settings
DomainNet (Peng et al., 2019) is a recent benchmark dataset for large-scale domain adaptation with 345 classes across six domains. Following MME (Saito et al., 2019), 7 scenario by selecting 4 domains (Real, Clipart, Painting, Sketch) and 126 classes are adopted here for fair comparison. The dataset is a new benchmark to evaluate semi-supervised domain adaptation methods.
Office-Home (Venkateswara et al., 2017) is a typical domain adaptation benchmark dataset, which consists of 15,500 images in 65 categories, mostly from an office or home environment. The images are sampled from four distinct areas including Art, Clipart, Product, and Real_World with 65 classes. The methods are evaluated on 12 scenarios in total.
|Method||R C||R P||P C||C S||S P||R S||P R||MEAN|
|S+T (He et al., 2016)||60.0||62.2||59.4||55.0||59.5||50.1||73.9||60.0|
|DANN (Ganin and Lempitsky, 2015)||59.8||62.8||59.6||55.4||59.9||54.9||72.2||60.7|
|ADR (Saito et al., 2017)||60.7||61.9||60.7||54.4||59.9||51.1||74.2||60.4|
|CDAN (Long et al., 2017a)||69.0||67.3||68.4||57.8||65.3||59.0||78.5||66.5|
|ENT (Grandvalet et al., 2005)||71.0||69.2||71.1||60||62.1||61.1||78.6||67.6|
|MME (Saito et al., 2019)||72.2||69.7||71.7||61.8||66.8||61.9||78.5||68.9|
|MixMatch (Berthelot et al., 2019)||72.6||68.8||68.7||62.7||67.1||65.5||78.7||69.2|
|Meta-MME (Li and Hospedales, 2020)||73.5||70.3||72.8||62.8||68.0||63.8||79.2||70.1|
|BNM (Cui et al., 2020b)||72.7||70.2||72.5||63.9||68.8||63.0||80.3||70.2|
|GVBG (Cui et al., 2020c)||73.3||68.7||72.9||65.3||66.6||68.5||79.2||70.6|
|HDA (Cui et al., 2020a)||73.9||69.1||73.0||66.3||67.5||69.5||79.7||71.3|
|MME+ELP (Huang et al., 2020)||74.9||72.1||74.4||64.3||69.7||64.9||81.0||71.6|
|APE (Kim and Kim, 2020)||76.6||72.1||76.7||63.1||66.1||67.8||79.4||71.7|
|TML (Liu et al., 2020)||75.8||74.5||75.1||64.3||69.7||64.4||82.6||72.3|
|ATDOC (Liang et al., 2021a)||76.9||72.5||74.2||66.7||70.8||64.6||81.2||72.4|
Office-31 (Saenko et al., 2010) is a standard domain adaptation dataset which contains 4110 images from 31 categories with three domains: Amazon (A), with images collected from amazon.com, Webcam (W) and DSLR (D), with images shot by web camera and digital SLR camera respectively. Following TML (Liu et al., 2020), we evaluate the methods on two scenarios W A and D A for fair comparison.
4.2. Implementation details
|Method||A C||A P||A R||C A||C P||C R||P A||P C||P R||R A||R C||R P||MEAN|
|S+T (He et al., 2016)||54.0||73.1||74.2||57.6||72.3||68.3||63.5||53.8||73.1||67.8||55.7||80.8||66.2|
|DANN (Ganin and Lempitsky, 2015)||54.7||68.3||73.8||55.1||67.5||67.1||56.6||51.8||69.2||65.2||57.3||75.5||63.5|
|CDAN (Long et al., 2017a)||59.2||74.1||74.1||60.5||74.5||70.7||61.4||58.1||76.8||67.1||61.4||80.7||68.2|
|BNM (Cui et al., 2020b)||62.2||78.6||78.9||65.0||78.0||77.8||65.2||60.4||80.3||69.0||63.4||84.2||71.9|
|ENT (Grandvalet et al., 2005)||61.3||79.5||79.1||64.7||79.1||76.4||63.9||60.5||79.9||70.2||62.6||85.7||71.9|
|MME (Saito et al., 2019)||63.6||79||79.7||67.2||79.3||76.6||65.5||64.6||80.1||71.3||64.6||85.5||73.0|
|APE (Kim and Kim, 2020)||63.9||81.1||80.2||66.6||79.9||76.8||66.1||65.2||82||73.4||66.4||86.2||74.0|
All the experiments are implemented with Pytorch(Paszke et al., 2017). For fair comparisons, we use the same backbones adopted in previous SSDA and UMA methods. For SSDA, ResNet-34 (He et al., 2016)
pre-trained on ImageNet(Deng et al., 2009) is widely adopted. Thus in the SSHT, we train the model based on pre-trained ResNet-34 in source domain to obtain the source model the same with UMA methods (Liang et al., 2020a; Li et al., 2020a). Following (Liu et al., 2020), we use Vgg-16 (Simonyan and Zisserman, 2014)
pre-trained on ImageNet(Deng et al., 2009) on two scenarios W A and D A of Office-31 to evaluate methods.
All the SSDA and SSHT tasks are in the three-shot setting. For the UMA, we use the pre-trained ResNet-50 (He et al., 2016) as the backbone, and then train the model on source domain. Following (Liang et al., 2020a; Li et al., 2020a), we split the labeled source data into a training set and a validation set, with the ratio of . The provided model is trained on the training set, and be validated on validation set to avoid overfitting to the source data. The methods such like ENT (Grandvalet et al., 2005), MME (Saito et al., 2019) and BNM (Cui et al., 2020b) are implemented with the same hyper-parameters as (Cui et al., 2020b)
. We use the SGD optimizer with learning rate 0.005, nesterov momentum 0.9, and weight decay 0.0005. We setto 2.5 and
to 1.0 for all datasets. We set batch size to 48, 96, 48 in Office-Home, DomainNet and Office-31, respectively. We train the proposed CDL with 30 epochs in total. The thresholdis set to 0.8 for selecting samples with highly confident predictions. More details can be seen in our released codes.
4.3. Compared methods
|METHOD||W A||D A||MEAN|
|CDAN (Long et al., 2017a)||74.4||71.4||72.9|
|S+T (He et al., 2016)||73.2||73.3||73.3|
|ADR (Saito et al., 2017)||73.3||74.1||73.7|
|DANN (Ganin and Lempitsky, 2015)||75.4||74.6||75.0|
|ENT (Grandvalet et al., 2005)||75.4||75.1||75.3|
|MME (Saito et al., 2019)||76.3||77.6||77.0|
|TML (Liu et al., 2020)||76.6||77.6||77.1|
|Method||A C||A P||A R||C A||C P||C R||P A||P C||P R||R A||R C||R P||MEAN|
|S (He et al., 2016)||44.6||67.3||74.8||52.7||62.7||64.8||53.0||40.6||73.2||65.3||45.4||78.0||60.2|
|DANN (Ganin and Lempitsky, 2015)||45.6||59.3||70.1||47.0||58.5||60.9||46.1||43.7||68.5||63.2||51.8||76.8||57.6|
|DAN (Long et al., 2015)||43.6||57.0||67.9||45.8||56.5||60.4||44.0||43.6||67.7||63.1||51.5||74.3||56.3|
|CDAN (Long et al., 2017a)||50.7||70.6||76.0||57.6||70.0||70.0||57.4||50.9||77.3||70.9||56.7||81.6||65.8|
|SAFN (Xu et al., 2019a)||52.0||71.7||76.3||64.2||69.9||71.9||63.7||51.4||77.1||70.9||57.1||81.5||67.3|
|MDD (Zhang et al., 2019)||54.9||73.7||77.8||60.0||71.4||71.8||61.2||53.6||78.1||72.5||60.2||82.3||68.1|
|SHOT (Liang et al., 2020a)||57.3||78.5||81.4||67.9||78.5||78.0||68.1||56.1||82.1||73.4||59.6||84.4||72.1|
|ATDOC (Liang et al., 2021a)||58.3||78.8||82.3||69.4||78.2||78.2||67.1||56.0||82.7||72.0||58.2||85.5||72.2|
|SHOT++ (Liang et al., 2020b)||58.1||79.5||82.4||68.6||79.9||79.3||68.6||57.2||83.0||74.3||60.4||85.1||73.0|
|METHOD||A C||A P||A R||C A||C P||C R||MEAN|
|ENT (w/ data)(Grandvalet et al., 2005)||61.3||79.5||79.1||64.7||79.1||76.4||73.4|
|ENT (w/ model)(Grandvalet et al., 2005)||58.3||78.0||78.5||63.4||77.4||75.1||71.8|
|MME (w/ data)(Saito et al., 2019)||63.6||79.0||79.7||67.2||79.3||76.6||74.2|
|MME (w/ model)(Saito et al., 2019)||51.4||69.5||67.4||54.7||68.5||63.6||62.5|
|BNM (w/ data)(Cui et al., 2020b)||62.2||78.6||78.9||65.0||78.1||77.8||73.4|
|BNM (w/ model) (Cui et al., 2020b)||61.0||78.8||80.2||65.6||78.9||78.0||73.8|
|CDL (w/ data)||63.0||81.0||80.1||67.2||80.6||80.0||75.3|
|CDL (w/ model)||63.0||80.2||80.1||68.7||82.0||78.8||75.4|
|Method||A C||A P||A R||C A||C P||C R||P A||P C||P R||R A||R C||R P||MEAN|
|CDL (w/o CL)||60.9||79.0||80.3||66.1||79.0||78.9||66.6||61.7||80.4||70.3||64.3||85.2||72.7|
|CDL (w/o DL)||54.9||77.4||75.5||62.4||76.8||73.6||62.3||57.5||76.4||68.1||59.0||83.3||68.9|
SSDA. We compare our method with SSDA methods and some UDA methods compared in previous works(Saito et al., 2019; Kim and Kim, 2020). DANN (Ganin and Lempitsky, 2015) is a popular method employing a domain classifier to match feature distribution. ADR (Saito et al., 2017) utilizes adversarial dropout regularization to encourage the generator to output more discriminative features for the target domain. CDAN (Long et al., 2017a) performs distribution alignment by a class-conditioned domain discriminator. All the above methods are implemented and evaluated under the SSDA setting. S+T (He et al., 2016) is a vanilla model trained with the labeled source and labeled target data without using unlabeled target data. BNM (Cui et al., 2020b) is a method using nuclear-norm maximization in each batch samples for maintaining discriminability and diversity of prediction. ENT (Grandvalet et al., 2005) could be applied to SSDA by the entropy minimization. MME (Saito et al., 2019) adopts a minimax game on the entropy of unlabeled data. APE(Kim and Kim, 2020) aligns features via alleviation of the intra-domain discrepancy. MixMatch (Berthelot et al., 2019) is a method to deal with semi-supervised-learning, and can also be applied on SSDA. Meta-MME (Li and Hospedales, 2020) incorporates meta-learning to search for better initial condition in domain adaptation. MME+ELP (Huang et al., 2020) tackles the problem of lacking discriminability by using effective inter-domain and intra-domain semantic information propagation. GVBG (Cui et al., 2020c) proposes a novel gradually vanishing bridge to connect either source or target domain to intermediate domain. HDA (Cui et al., 2020a)
devises a heuristic framework to conduct domain adaptation. TML(Liu et al., 2020)
proposes a novel reinforcement learning based selective pseudo-labeling method to deal with SSDA, which employes deep Q-learning to train an agent to select more representative and accurate pseudo-labeled samples for model training. ATDOC(Liang et al., 2021a) develops two types of non-parametric classifiers, with an auxiliary classifier for target data to improve the quality of pseudo label. For fair comparison, all the methods have the same backbone architecture with our method.
Unsupervised model adaptation. Except for DANN (Ganin and Lempitsky, 2015), ATDOC (Liang et al., 2021a), and CDAN (Long et al., 2017a), we compare our method with other UDA methods such as DAN (Long et al., 2015), MDD (Zhang et al., 2019), SAFN (Xu et al., 2019a), SHOT (Liang et al., 2020a), and SHOT++ (Liang et al., 2020b). DAN (Long et al., 2015) utilizes a multi-kernel selection method for better mean embedding matching and adapts in multiple layers to learn more transferable features. MDD (Zhang et al., 2019) is a measurement with rigorous generalization bounds, tailored to the distribution comparison with the asymmetric margin loss, and to the minimax optimization for easier training. SAFN (Xu et al., 2019a) proposes a norm adaptation to well discriminate the source and target features. SHOT (Liang et al., 2020a) addresses unsupervised model adaptation with self-supervision learning. And the SHOT++ (Liang et al., 2020b) proposes a labeling transfer strategy to improve the accuracy of less-confident predictions on the basis of SHOT.
Comparison with SSDA methods. The results of our CDL in the SSHT setting is compared with other methods which could access the source data. The comparison results on DomainNet and Office-Home are shown in Table 2 and Table 3, respectively. As for DomainNet, our CDL outperforms state-of-the-art method ATDOC (Liang et al., 2021a) by 0.7% in average. In the task P R, our CDL significantly outperforms the ATDOC by 1.9%. Specifically, CDL outperforms ATDOC in 6 transfer tasks over 7 tasks. In general comparison with others, our method achieve the best results in 3 tasks. Although our method shows weakness in some tasks such like R S, it outperforms other methods in average. As shown in Table 3, we can observe that our method CDL achieves comparable results against state-of-the-art SSDA methods on Office-Home, moreover, shows the best accuracy in 6 tasks over 12 tasks. We also evaluate our method in Office-31 for the setting in (Liu et al., 2020). The comparison results on Office-31 in Table 4 shows that our method CDL based on model outperforms significantly the other methods based on source data in both two scenarios, and it outperforms the state-of-the-art TML by 1.0% in average. It is worthy of noting that the accurate labeled source data are accessible for SSDA methods, making it easier to transfer compared with SSHT. Despite the absence of source data, the superiority of CDL over state-of-the-art SSDA methods validates the effectiveness of CDL.
Comparison with UMA methods. The difference between the SSHT and UMA is that SSHT has extra labeled data for model adaptation. We compare our CDL on Office-Home with previous methods tailored or applicable for UMA. The results in Table 5 show that our CDL outperforms state-of-the-art method SHOT++ by 2.7% in average. Our CDL yields the great improvement by effectively learning invariant representation with a few target supervisions. It is worthy of noting that CDL outperforms SHOT++ in 11 transfer tasks over the total 12 tasks. The superiority of CDL over UMA methods validates that even with few labeled target data, the performance can be significantly improved.
Effectiveness of adaptation. To validate that our method is effective to the SSHT problem, we evaluate our method on six closed-set SSDA tasks without source data. The results are shown in Table 6. SSL+CL denotes semi-supervised learning with consistency learning. ENT (w/ data) denotes adaption from source data. ENT (w/ model) stands for adaptation from source model. Our CDL is the framework aims to address the SSHT problem, and can also be applied on SSDA problem. Compare the SSL+CL with others, it proves that the adaptation based on data or model is superior than only semi-supervised learning with consistency learning in target domain. It is shown that BNM can handle the difference between data and model while accuracies of others are decreased. And our CDL shows superiority than others in both SSDA and SSHT.
4.5. Ablation Study
Since our CDL framework comprises a simple combination of consistency learning (CL) and diversity learning (DL), we perform an extensive ablation study to better understand why it is able to perform favorably against state-of-the-art methods in SSDA and UMA. We evaluate two variants of CDL: (1) CDL (w/o CL), which denotes that we adapt the model without learning the consistency of unlabeled images, only by optimizing the classification loss of labeled images and the loss of diversity. (2) CDL (w/o DL), which only optimizes the loss of consistency learning and classification loss of labeled images in the training process. The results of ablation study are shown in Table 7. We can observe that the two components are designed reasonably and when any one of the two components is removed, the performance degrades. It is noteworthy that the CDL (w/o CL) outperforms the full CDL method on two tasks, showing the effectiveness of maintaining diversity in model adaptation. Our CDL combines the CL and DL and obtains 1.6% improvement in average, which validates the effectiveness of CDL.
4.6. Further remarks
Effectiveness of maintaining diversity. To validate that our method could maintain the diversity in model adaptation, we compared our method with the our variant CDL (w/o DL) and entropy minimization. We show the diversity ratio in Office-Home on tasks of A C and P A in Fig 4. The diversity is measured by the number of predicted categories in randomly sampled batch. Thus the diversity ratio is calculated as the predicted diversity divided by the ground truth diversity. As shown in Fig 3(a), the diversity ratio of CDL is larger than others, and the CDL (w/o DL) shows the comparable diversity loss in task A C. As shown in Fig 3(b), the CDL (w/o DL) shows low diversity ratio, while our CDL still maintain the large diversity ratio in the harder task P A.
Parameter sensitivity. We evaluate the effects of the parameters and in SSHT task, which control the trade off between consistency loss, diversity loss and classification loss. We evaluate several combination of and in two tasks A C and C A on Office-Home. As shown in Figure 3, we see that appropriate combination of and results in good transfer performance in model adaptation. This justifies our motivation of learning invariant representation with encouraging consistency and maintaining diversity by the proposed method, as a good trade-off among them can promote transfer performance.
In this paper, we propose a novel Semi-supervised Source Hypothesis Transfer (SSHT) task to fully utilize a few labeled target data and inherit knowledge of source model. The insufficient labeled target data may increase the risk of mis-classification in target domain and reduce the prediction diversity. To tackle these issues, we present Consistency and Diversity Learning (CDL) framework for SSHT. By encouraging consistency regularization between two random augmentations of unlabeled data, the model can generalize well in target domain. In addition, we further integrate Batch Nuclear-norm Maximization (BNM) to enhance the diversity. Experimental results on multiple domain adaptation benchmarks show that our method outperforms existing state of the art SSDA methods and unsupervised model adaptation methods.
We conduct the experiment on Visda-2017 for UDA, and we test MME and CDL in SSHT setting. The result is shown in Table 8.
Ao et al. (2017)
Shuang Ao, Xiang Li,
and Charles Ling. 2017.
Fast generalized distillation for semi-supervised
domain adaptation. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
- Berthelot et al. (2019) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249 (2019).
- Combes et al. (2020) Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoff Gordon. 2020. Domain adaptation with conditional distribution matching and generalized label shift. arXiv preprint arXiv:2003.04475 (2020).
- Cubuk et al. (2020) Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In
- Cui et al. (2020a) Shuhao Cui, Xuan Jin, Shuhui Wang, Yuan He, and Qingming Huang. 2020a. Heuristic Domain Adaptation. In Advances in Neural Information Processing Systems.
- Cui et al. (2020b) Shuhao Cui, Shuhui Wang, Junbao Zhuo, Liang Li, Qingming Huang, and Qi Tian. 2020b. Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3941–3950.
- Cui et al. (2020c) Shuhao Cui, Shuhui Wang, Junbao Zhuo, Chi Su, Qingming Huang, and Qi Tian. 2020c. Gradually vanishing bridge for adversarial domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12455–12464.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
- Donahue et al. (2013) Jeff Donahue, Judy Hoffman, Erik Rodner, Kate Saenko, and Trevor Darrell. 2013. Semi-supervised domain adaptation with instance constraints. In Proceedings of the IEEE conference on computer vision and pattern recognition. 668–675.
Yaroslav Ganin and
Victor Lempitsky. 2015.
Unsupervised domain adaptation by backpropagation. In
International conference on machine learning. PMLR, 1180–1189.
- Gong et al. (2012) Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. 2012. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2066–2073.
- Grandvalet et al. (2005) Yves Grandvalet, Yoshua Bengio, et al. 2005. Semi-supervised learning by entropy minimization.. In CAP. 281–296.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Hoffman et al. (2018) Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. 2018. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning. PMLR, 1989–1998.
- Huang et al. (2020) Zhiyong Huang, Kekai Sheng, Weiming Dong, Xing Mei, Chongyang Ma, Feiyue Huang, Dengwen Zhou, and Changsheng Xu. 2020. Effective Label Propagation for Discriminative Semi-Supervised Domain Adaptation. arXiv preprint arXiv:2012.02621 (2020).
- Kim and Kim (2020) Taekyung Kim and Changick Kim. 2020. Attract, Perturb, and Explore: Learning a Feature Alignment Network for Semi-supervised Domain Adaptation. In European Conference on Computer Vision. Springer, 591–607.
- Kundu et al. (2020) Jogendra Nath Kundu, Naveen Venkat, R Venkatesh Babu, et al. 2020. Universal source-free domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4544–4553.
- Laine and Aila (2016) Samuli Laine and Timo Aila. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016).
- Li et al. (2020b) Bo Li, Yezhen Wang, Shanghang Zhang, Dongsheng Li, Trevor Darrell, Kurt Keutzer, and Han Zhao. 2020b. Learning Invariant Representations and Risks for Semi-supervised Domain Adaptation. arXiv preprint arXiv:2010.04647 (2020).
- Li and Hospedales (2020) Da Li and Timothy Hospedales. 2020. Online meta-learning for multi-source and semi-supervised domain adaptation. In European Conference on Computer Vision. Springer, 382–403.
- Li et al. (2020a) Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. 2020a. Model adaptation: Unsupervised domain adaptation without source data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9641–9650.
- Li et al. (2018) Yanghao Li, Naiyan Wang, Jianping Shi, Xiaodi Hou, and Jiaying Liu. 2018. Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80 (2018), 109–117.
- Liang et al. (2020a) Jian Liang, Dapeng Hu, and Jiashi Feng. 2020a. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning. PMLR, 6028–6039.
- Liang et al. (2021a) Jian Liang, Dapeng Hu, and Jiashi Feng. 2021a. Domain Adaptation with Auxiliary Target Domain-Oriented Classifier. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Liang et al. (2021b) Jian Liang, Dapeng Hu, Ran He, and Jiashi Feng. 2021b. Distill and Fine-tune: Effective Adaptation from a Black-box Source Model. arXiv preprint arXiv:2104.01539 (2021).
- Liang et al. (2020b) Jian Liang, Dapeng Hu, Yunbo Wang, Ran He, and Jiashi Feng. 2020b. Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer. arXiv preprint arXiv:2012.07297 (2020).
- Liu et al. (2020) Bingyu Liu, Yuhong Guo, Jieping Ye, and Weihong Deng. 2020. Selective Pseudo-Labeling with Reinforcement Learning for Semi-Supervised Domain Adaptation. arXiv preprint arXiv:2012.03438 (2020).
- Liu and Tuzel (2016) Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. arXiv preprint arXiv:1606.07536 (2016).
- Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning transferable features with deep adaptation networks. In International conference on machine learning. PMLR, 97–105.
- Long et al. (2017a) Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. 2017a. Conditional adversarial domain adaptation. arXiv preprint arXiv:1705.10667 (2017).
- Long et al. (2017b) Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2017b. Deep transfer learning with joint adaptation networks. In International conference on machine learning. PMLR, 2208–2217.
- Pan et al. (2010) Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. 2010. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22, 2 (2010), 199–210.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
- Peng et al. (2019) Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. 2019. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1406–1415.
- Saenko et al. (2010) Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. 2010. Adapting visual category models to new domains. In European conference on computer vision. Springer, 213–226.
- Saito et al. (2019) Kuniaki Saito, Donghyun Kim, Stan Sclaroff, Trevor Darrell, and Kate Saenko. 2019. Semi-supervised domain adaptation via minimax entropy. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8050–8058.
- Saito et al. (2017) Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate Saenko. 2017. Adversarial dropout regularization. arXiv preprint arXiv:1711.01575 (2017).
- Sajjadi et al. (2016) Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. 2016. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. arXiv preprint arXiv:1606.04586 (2016).
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
- Sohn et al. (2020) Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685 (2020).
- Sun and Saenko (2016) Baochen Sun and Kate Saenko. 2016. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision. Springer, 443–450.
- Tzeng et al. (2014) Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014).
- Venkateswara et al. (2017) Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. 2017. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5018–5027.
- Verma et al. (2019) Vikas Verma, Kenji Kawaguchi, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. 2019. Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825 (2019).
- Wu et al. (2019) Yifan Wu, Ezra Winston, Divyansh Kaushik, and Zachary Lipton. 2019. Domain adaptation with asymmetrically-relaxed distribution alignment. In International Conference on Machine Learning. PMLR, 6872–6881.
- Xu et al. (2019a) Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. 2019a. Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1426–1435.
- Xu et al. (2019b) Xiang Xu, Xiong Zhou, Ragav Venkatesan, Gurumurthy Swaminathan, and Orchid Majumder. 2019b. d-sne: Domain adaptation using stochastic neighborhood embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2497–2506.
- Yan et al. (2017) Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Yong Xu, and Wangmeng Zuo. 2017. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2272–2281.
- Yang et al. (2020) Shiqi Yang, Yaxing Wang, Joost van de Weijer, and Luis Herranz. 2020. Unsupervised Domain Adaptation without Source Data by Casting a BAIT. arXiv preprint arXiv:2010.12427 (2020).
- Yao et al. (2015) Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, and Tao Mei. 2015. Semi-supervised domain adaptation with subspace learning for visual recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2142–2150.
- Zellinger et al. (2017) Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. 2017. Central moment discrepancy (cmd) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811 (2017).
- Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
- Zhang et al. (2019) Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. 2019. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning. PMLR, 7404–7413.
- Zhao et al. (2019) Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. 2019. On learning invariant representations for domain adaptation. In International Conference on Machine Learning. PMLR, 7523–7532.