Domain adaptation (DA) is a representation learning methodology that transfers knowledge from a label-sufficient source domain to a label-scarce target domain. While most of early methods are focused on unsupervised DA (UDA), several studies on semi-supervised DA (SSDA) are recently suggested. In SSDA, a small number of labeled target images are given for training, and the effectiveness of those data is demonstrated by the previous studies. However, the previous SSDA approaches solely adopt those data for embedding ordinary supervised losses, overlooking the potential usefulness of the few yet informative clues. Based on this observation, in this paper, we propose a novel method that further exploits the labeled target images for SSDA. Specifically, we utilize labeled target images to selectively generate pseudo labels for unlabeled target images. In addition, based on the observation that pseudo labels are inevitably noisy, we apply a label noise-robust learning scheme, which progressively updates the network and the set of pseudo labels by turns. Extensive experimental results show that our proposed method outperforms other previous state-of-the-art SSDA methods.READ FULL TEXT VIEW PDF
When encountered an image representing a single object, humans can easily recognize its identity regardless of domain characteristics. For example, we can instantly figure out that all images in Fig. 1 represent a “bicycle” even though there obviously exists contextual disparity (or domain shift
) among the images. Meanwhile, deep neural networks trained on a single domain are known to be fragile to the domain shift due to the strong dependency upon training data. One simple yet naive solution is to prepare a large amount of training data for each domain, but tremendous expenses are compelled as well. In addition, tagging a label for every image in the target domain is particularly costly and time-consuming if the number of classes becomes larger. To overcome this problem, various representation learning approaches named domain adaptation (DA) have been proposed in recent years.
The goal of DA is to enhance the performance of classifying images in a label-scarce domain (target domain) by leveraging knowledge of a label-sufficient domain (source domain). Majority of early methods[5, 17, 33, 11, 27, 4, 15, 7, 18, 3] are devoted to unsupervised domain adaptation (UDA), which assumes all target images are unlabeled while source images are fully labeled. Recently, a pioneering study  on semi-supervised domain adaptation is introduced, which assumes a few labeled target images are additionally given (e.g., one or three examples per each class). In the study, a few-shot feature embedding scheme  is incorporated to enhance the effectiveness of labeled target images. In addition, by means of the minimax entropy-based learning scheme, the method outperforms other UDA methods, which are trained with SSDA setups (i.e., additional supervisions on the few labeled target images). One of empirical discoveries reported in  is that training with additional labeled data in the target domain can considerably enhance the performance even though the quantity of those data is extremely small. This implies that the few labeled target images serve as critical clues to resolve SSDA problems. However, in spite of the significance of the labeled target images, their usage in the existing SSDA methods is limited to embedding them into ordinary supervised losses, such as cross entropy loss.
In this paper, we propose a new SSDA method that exploits the labeled target images more actively by treating them as ‘golden’ samples for SSDA. To this end, we employ the few labeled target images for selectively assigning pseudo labels to unlabeled target images. Training with pseudo labels 
requires careful treatments since incorrect pseudo labels may result in performance degradation. Our strategy to deal with pseudo labels is composed of two major components. First, to acquire pseudo labels with high reliability, we propose to select and utilize restricted amounts of pseudo labels based on an analysis in the feature space. Here, the basis of our reasoning is that deep features that lead to correct pseudo labels are usually clustered with those of labeled target images. Second, based on the observation that pseudo labels are inevitably noisy (i.e., containing incorrect labels), we propose to apply a label noise-robust learning scheme that alternately updates pseudo labels and deep networks. By means of this alternate updating scheme, the network and the set of pseudo labels are progressively optimized. The overall pipeline of the proposed SSDA method is illustrated in Fig. 2. Experimental results on LSDAC , Office-Home , and Office  datasets demonstrate that our method outperforms other previous state-of-the-art methods.
In this section, we review existing studies that are related to our work. First, we introduce previous domain adaptation methods for image classification. Second, we review learning schemes that are robust to noisy labels and clarify our strategy to apply those methods to SSDA.
Existing domain adaptation methods for image classification can be categorized into unsupervised and semi-supervised domain adaptation approaches. Both approaches consider the case that source and target domains share the same set of image categories, whereas the quantity of labels in the target domain is much smaller than that in the source domain.
Most of early studies are focused on UDA, which assumes that all images in the target domain are unlabeled. As a pioneering method for UDA, Ganin and Lempitsky  propose an adversarial learning approach to aligning feature distributions of source and target domains. Through the adversarial learning, the feature extractor is trained to deceive the domain classifier by making features of the target domain be indistinguishable from those of the source domain. The adversarial learning process is implemented by inserting the gradient reversal layer (GRL) between the feature extractor and the domain classifier. This adversarial learning mechanism is widely adapted to other UDA approaches [17, 27, 4, 15, 18, 3] to aligning feature spaces. Different from those feature-level adaptation approaches, there are several pixel-level adaptation approaches that augment the scales of training sets by transferring images across the two domains [33, 11, 7]. A common limitation of UDA methods is that the adaptation performance is severely degraded for adaptation scenarios involving a large domain shift. This is due to the harsh experimental setups of UDA that target labels are not given at all.
Recently, to address the domain adaptation problem in a more practical and realistic way, SSDA methods received a great attention. Unlike the UDA schemes, SSDA assumes that a few target labels (e.g., one or three examples for each class) are additionally given for domain adaptation. As a pioneering approach for SSDA, Saito et al.  propose a minimax entropy-based method. In the study, the few-shot feature embedding scheme  and the minimax entropy-based learning schemes are incorporated for SSDA. The empirical results in  show that additional supervisions on the few labeled target images can fairly increase the performance of domain adaptation methods [25, 26, 5, 17, 8], implying the importance of those data. However, in spite of the significance of labeled target images, the use of those data is restricted to embedding ordinary supervised losses. Unlike those previous methods, in this paper, we propose to further utilize the labeled target images to select reliable pseudo labels for unlabeled target images. Training deep neural networks with pseudo labels  is one of the self-training mechanisms, and it requires careful treatments since incorrect pseudo labels can severely degrade the performance. To figure out pseudo labels with high reliability, we conduct feature analysis by exploiting both the labeled and the unlabeled target images. The details of this process are explained in Sec. III-A.
Training deep neural networks requires large-scale datasets, which are composed of images and corresponding label annotations. However, collecting clean labels for large-scale datasets is costly, and in practice there often exist noisy labels. By ‘noisy’, we mean the labels may contain incorrect annotations, and learning with noisy labels is a challenging issue that is recently addressed by numerous studies [30, 13, 6, 19, 35, 31] . There are various existing approaches for learning with noisy labels, such as embedding label noise-robust loss functions
. There are various existing approaches for learning with noisy labels, such as embedding label noise-robust loss functions[6, 35], applying the joint optimization framework , and filtering out noisy labels [13, 19]. Those methods are verified on image classification datasets, which contain intentionally generated noisy labels.
Our motivation of adapting the label noise-robust learning scheme to SSDA is derived from the fact that pseudo labels are inevitably noisy. To enhance the performance of the network trained on pseudo labels, we incorporate the joint optimization framework , which is demonstrated to be robust to noisy labels of large-scale datasets. The key idea of the framework is to progressively update the network and the set of noisy labels by turns, pursuing positive interactions between the two components. The detailed descriptions of our label noise-robust learning scheme, which is motivated by , is introduced in Sec. III-B. To the best of our knowledge, this is the first trial to adapt the label noise-robust learning scheme to self-training with pseudo labels.
The goal of semi-supervised domain adaptation is to train a classification model that is oriented to a target domain by using image sets in both domains. In the source domain, we are given source images and the corresponding labels . In the target domain, unlabeled images and a small number of labeled images are given. In SSDA, the classification model is trained on , , and tested on . The classification model is composed of a feature extractor and a classifier , where and are weight vectors of the feature extractor and the classifier, respectively. For an input image
are weight vectors of the feature extractor and the classifier, respectively. For an input image, its feature vector and output prediction encoded by the model are denoted as and , respectively. Thus, .
As illustrated in Fig. 2, our proposed method is composed of three stages. The first stage is to train a baseline model to generate pseudo labels. In this paper, we adopt the minimax entropy-based approach  to train the baseline models for all experiments. The weight vectors of the feature extractor and the classifier of the trained baseline model are denoted as and , respectively. The next two stages of our proposed method are explained in the following two subsections.
By using the baseline model that is acquired in the previous stage, we apply forward pass operations to unlabeled target images to obtain . We call as a ‘soft’ pseudo label (i.e., an output prediction vector) and as a ‘hard’ pseudo label for the th unlabeled target image, and they are given as follows:
In the above equations, is the output probability of the
is the output probability of theth class and denotes the number of classes. We empirically found that adopting the entire pseudo labels for training is not helpful and even degrades the performance. Our speculation regarding this problem is that training data whose pseudo labels are incorrect may degrade the accuracy, and thus acquiring pseudo labels with high reliability is a very important issue. Based on this observation, we propose a selective pseudo labeling approach that utilizes restricted amounts of pseudo labels by focusing on their reliabilities.
The key idea of our selective pseudo labeling approach is illustrated in Fig. 3. As depicted in the figure, deep features which lead to correct pseudo labels are closely located with those of labeled target images in the feature space. For each class, let be the feature of the th labeled target image whose label is , and be the feature of the th unlabeled target image whose hard pseudo label is (i.e., ). Here, we drop the categorical index for notational convenience. For the th unlabeled sample, we define its feature distance as follows:
where denotes the l1-norm function and indicates the number of labeled target images for each class. In our experiments, one or three target images are given for each class, i.e., =1 (1-shot) or =3 (3-shot). The feature distance becomes larger if the unlabeled target feature is located far from the labeled target features in the feature space and vice versa. Based on our assumption that is inversely proportional to the reliability, we sort the unlabeled features in an ascending order. This procedure is independently conducted for each class. After the sorting process, for each class, we assign pseudo labels to the first samples. Here, is a hyper-parameter that adjust the ratio of selecting pseudo labels, and we set to as default. Through these procedures, we obtain the pseudo labeled target image set , where indicates the index set of selected pseudo labels. The overall procedure of our selective pseudo labeling approach is illustrated in Fig. 4.
In Table I, the reliabilities of selected pseudo labels are compared with those of baseline pseudo labels without applying the selective pseudo labeling approach. Here, it is worth noting that the numerical values in Table I are not the final accuracy of the image classifier, but the ratio of correct pseudo labels in terms of percentage. For various adaptation scenarios in Table I, our proposed selective approach consistently enhances the reliabilities of pseudo labels. In particular, its effectiveness becomes prominent when applied for adaptive scenarios with a large domain gap such as Clipart to Sketch (C to S). This indicates that the proposed selective pseudo labeling approach is fairly effective for challenging scenarios as well.
|Net||Clipart to Sketch (C to S)||Painintg to Real (P to R)|
The final stage of our proposed method is to conduct SSDA along with the pseudo labels that are obtained by the previous stage. Although the pseudo labels are carefully determined via the selective approach, they are not completely reliable since the pseudo labels are noisy. Based on our observation that pseudo labels are inevitably noisy, we propose a label noise-robust learning approach, which is motivated by the joint optimization framework for learning with noisy labels .
Given the set of unlabeled target images with pseudo labels (), we implement the supervised loss function as follows:
where is the standard cross entropy loss function. Note that is a fixed pseudo label and is a variable output prediction during updating the network. In a similar way to , the set of pseudo labels is updated by forward passing operations using the updated network with a momentum of 0.9 after every validation phase. By means of this alternating learning process, the network and the set of pseudo labels are progressively updated. This procedure that jointly updates the network and the pseudo labels is continued until the validation accuracy is converged. We call this learning process as ‘progressive self-training’ since the network is progressively optimized along with the pseudo labels.
The overall training is conducted in conjunction with the baseline SSDA method, which is the minimax entropy-based approach . By letting and be the loss functions for the feature extractor and the classifier, respectively, the overall training objective functions are given as follows:
In the above equations, is the standard cross entropy loss for labeled source and target images and indicates the entropy  for unlabeled target images. The standard Stochastic Gradient Descent (SGD) algorithm is used for training on the loss functions. The hyper-parameter
for unlabeled target images. The standard Stochastic Gradient Descent (SGD) algorithm is used for training on the loss functions. The hyper-parameteris set to 0.1 for all experiments. The overall training procedure is summarized in Algorithm 1.
We used three representative benchmark datasets for experiments. LSDAC is a benchmark dataset for large-scale domain adaptation, which involves 6 domains with 345 classes. To make a fair comparison with previous methods, we followed the settings in , which addresses 7 adaptation scenarios from 4 domains (Real, Clipart, Painting, and Sketch) with 126 classes. Office-Home contains 4 domains (Real, Clipart, Art, and Product) with 65 classes and we conducted evaluations on 12 adaptation scenarios, which involve all possible scenarios. Office involves 3 domains (Amazon, Webcam, and DSLR) with 31 classes and we evaluated on 2 scenarios, which are Webcam to Amazon and DSLR to Amazon. Since the domain disparities between Webcam and DSLR are negligible, we considered two domain adaptation scenarios that involve large domain shifts and sufficient amount of training data.
For each adaptation scenario, one or three examples per class are used as labeled target training data, and we denote these two settings as ‘1-shot’ and ‘3-shot’, respectively. For fair comparison, we used labeled target image sets, which are reported in . The rest of unlabeled target images and all labeled source images were used for training. To verify the effectiveness of the proposed method across various network models, we conducted comparative evaluations on 6 backbone architectures. To be specific, we employed AlexNet, VGG-16, and ResNet-34 as the primary network models. Further results on other models beyond the three architectures are reported in Sec. IV-D.
|Net||Method||R to C||R to P||P to C||C to S||S to P||R to S||P to R||MEAN|
All experiments in this paper are implemented in PyTorch
All experiments in this paper are implemented in PyTorch by using an NVIDIA TITAN X GPU (Pascal architecture). For training baseline models (i.e., the first stage of our method), we followed the setups reported in . The self-training phase using the selected pseudo labels (Eq. (5), (6)) is resumed from the baseline models until the validation accuracy is converged. Learning rates are initialized before resuming the training process, and are decayed according to the annealing strategy proposed in . For comparative evaluations, we report the quantitative evaluation results of the following 6 previous methods. S+T[1, 23] is a method that trains a network with supervisions on labeled source and target images without using unlabeled target images. DANN, ADR, CDAN, and ENT are unsupervised domain adaptation methods, which are trained with additional supervisions on labeled target images. MME is our baseline method that is specialized to the SSDA scheme.
The quantitative evaluation results on the LSDAC dataset is reported in Table II. For the 7 adaptation scenarios, the proposed method outperforms other previous methods except only one case (P to R with VGG16). It is worth noting that our method achieves significant performance increasements over the baseline method when the domain gap is large (e.g., S to P and R to S adaptation scenarios). This implies that our method is particularly robust to challenging conditions. Another empirical observation is that the 1-shot accuracies of our method are competitive to or even higher than those of 3-shot accuracies of other previous methods. This indicates that our method requires less target labels than other methods for the same performance. Therefore, our method can be used as an alternative to collecting labeled images in the target domain. This advantage of our method would be very useful for image classification tasks involving a large number of classes since the expense of annotating labels is proportional to the number of classes. The evaluation results on the Office-Home and the Office datasets are reported in Table III. Our method shows better performances than other methods in terms of average accuracies.
Overall, the strength of our method can be summarized as the following three major aspects. First, our proposed method outperforms other previous methods across various datasets and network architectures. This indicates that our method can be broadly adopted to various SSDA scenarios, not limited to a certain dataset or network. Second, our method achieves considerable performance enhancements over the previous methods, especially for large-scale domain adaptation datasets such as the LSDAC dataset. Third, our method is particularly robust to challenging domain adaptation scenarios (e.g., S to P and R to S adaptation scenarios in the LSDAC dataset), implying that our proposed method can be used for enhancing performance for more difficult adaptation conditions involving large domain shifts.
To verify the effectiveness of each module in our method, we conducted ablation studies. The ablation studies were done on the two adaptation scenarios in the LSDAC dataset, which are C to S involving a large domain gap and P to R with a relatively small domain gap. In Table IV, the accuracies depending on the ratio of selecting pseudo labels ( in Sec. III-A) are reported. The results in Table IV demonstrate that the accuracy has a tendency to be maximized when is around 0.2. Meanwhile, the performance is degraded if the magnitude of is larger or smaller than . This indicates that selecting moderate amounts of pseudo labels is encouraged. If , the entire pseudo labels are adopted for self-training. Thus, this setup corresponds to the training strategy without applying the selective pseudo labeling stage in Sec. III-A. By comparing the results of with those of , it can be validated that the selective pseudo labeling stage obviously enhances the accuracy. In addition, this result demonstrates our initial assumption that employing a restricted number of pseudo labels with high reliability leads to better performance than adopting the entire pseudo labels. On the other hand, adopting too small amount of pseudo labels leads to relatively low accuracies. This empirical observation implies that a moderate number of pseudo labels are desirable for self-training. Based on these analysis and empirical studies, we set the default value of to 0.2 for all experiments.
The second ablation study is to investigate the effectiveness of the label noise-robust learning approach (Sec. III-B). To this end, we compared our method with a vanilla learning approach by using hard pseudo labels in Eq. (2) without applying the progressive updating scheme. The comparative results are presented in Table V and it can be confirmed that the performance of the label noise-robust learning approach is better than that of the vanilla learning approach. This indicates that the proposed learning approach can effectively prevent incorrect pseudo labels from misleading the network during the training phase.
Lastly, we conducted comparative evaluations on three additional backbone architectures to verify the robustness of the proposed method across various network models. We adopted ResNet-101, DenseNet-121 to test on deeper network models. In addition, we employed MobileNet-v2  to confirm the performance on a light-weight network model. The evaluation results on the three models are reported in Table VI. Our proposed method surpasses other previous SSDA methods including the baseline method. This consistency of performance enhancements indicates that the proposed method can be broadly applied for SSDA without demanding any preference on a certain network architecture. To train a single DA scenario, it took around 4 to 6 hours until convergence. The computational time for testing is dependent on the backbone architecture, and the measurements for the 6 network models are reported in Table VII.
|C to S||36.7||37.6||38.9||38.7||37.4|
|P to R||57.5||59.3||60.2||59.8||59.6|
|C to S||41.7||42.8||43.5||43.1||42.4|
|P to R||60.5||62.1||63.3||62.9||62.0|
|Whether applied||C to S||P to R|
|1.84 ms (544 FPS)||2.41 ms (414 FPS)||1.82 ms (550 FPS)|
|2.46 ms (406 FPS)||1.83 ms (547 FPS)||1.82 ms (550 FPS)|
In this paper, we have introduced a novel semi-supervised domain adaptation method for image classification. The major idea of our method is to exploit the labeled target images to find out reliable pseudo labels for the unlabeled target images. In addition, based on the observation that the set of pseudo labels may contain incorrect labels, a learning approach that is robust to noisy labels is applied. Experimental results on the three representative domain adaptation datasets show that our method outperforms other methods, especially for the challenging adaptation scenarios involving large domain shifts. For the three primary backbone architectures (AlexNet, VGG-16, ResNet-34), the SSDA method outperforms the previous state-of-the-art method by 2.7%, 0.9%, and 2.2% for LSDAC, Office-Home, and Office datasets, respectively. Though we validated the proposed method on image classification only, we expect that our method could be further expanded to other computer vision tasks such as domain adaptive object detection
datasets, respectively. Though we validated the proposed method on image classification only, we expect that our method could be further expanded to other computer vision tasks such as domain adaptive object detection and semantic segmentation in the future.
We thank the anonymous reviewers for their valuable comments. This work was supported by Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2019-0-00524, Development of precise content identification technology based on relationship analysis for maritime vessel/structure).
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3339–3348. Cited by: §V.
Unsupervised domain adaptation by backpropagation. In
Proc. International Conference on Machine Learning (ICML), Cited by: §I, §II-A, §II-A, §IV-B.
Proc. Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §II-B.
Duplex generative adversarial network for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning (ICML), Vol. 3, pp. 2. Cited by: §I, §II-A.
A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §I.