Semi-Supervised Domain Adaptation via Selective Pseudo Labeling and Progressive Self-Training

Domain adaptation (DA) is a representation learning methodology that transfers knowledge from a label-sufficient source domain to a label-scarce target domain. While most of early methods are focused on unsupervised DA (UDA), several studies on semi-supervised DA (SSDA) are recently suggested. In SSDA, a small number of labeled target images are given for training, and the effectiveness of those data is demonstrated by the previous studies. However, the previous SSDA approaches solely adopt those data for embedding ordinary supervised losses, overlooking the potential usefulness of the few yet informative clues. Based on this observation, in this paper, we propose a novel method that further exploits the labeled target images for SSDA. Specifically, we utilize labeled target images to selectively generate pseudo labels for unlabeled target images. In addition, based on the observation that pseudo labels are inevitably noisy, we apply a label noise-robust learning scheme, which progressively updates the network and the set of pseudo labels by turns. Extensive experimental results show that our proposed method outperforms other previous state-of-the-art SSDA methods.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

12/07/2020

Selective Pseudo-Labeling with Reinforcement Learning for Semi-Supervised Domain Adaptation

Recent domain adaptation methods have demonstrated impressive improvemen...
08/08/2020

Hard Class Rectification for Domain Adaptation

Domain adaptation (DA) aims to transfer knowledge from a label-rich and ...
11/16/2020

Manual-Label Free 3D Detection via An Open-Source Simulator

LiDAR based 3D object detectors typically need a large amount of detaile...
05/26/2022

Pick up the PACE: Fast and Simple Domain Adaptation via Ensemble Pseudo-Labeling

Domain Adaptation (DA) has received widespread attention from deep learn...
02/06/2022

Low-confidence Samples Matter for Domain Adaptation

Domain adaptation (DA) aims to transfer knowledge from a label-rich sour...
11/10/2017

A Fully Convolutional Tri-branch Network (FCTN) for Domain Adaptation

A domain adaptation method for urban scene segmentation is proposed in t...
03/29/2022

SimT: Handling Open-set Noise for Domain Adaptive Semantic Segmentation

This paper studies a practical domain adaptive (DA) semantic segmentatio...

Code Repositories

I Introduction

When encountered an image representing a single object, humans can easily recognize its identity regardless of domain characteristics. For example, we can instantly figure out that all images in Fig. 1 represent a “bicycle” even though there obviously exists contextual disparity (or domain shift[20]

) among the images. Meanwhile, deep neural networks trained on a single domain are known to be fragile to the domain shift due to the strong dependency upon training data. One simple yet naive solution is to prepare a large amount of training data for each domain, but tremendous expenses are compelled as well. In addition, tagging a label for every image in the target domain is particularly costly and time-consuming if the number of classes becomes larger. To overcome this problem, various representation learning approaches named domain adaptation (DA) have been proposed in recent years

[34].

Fig. 1: A set of images in LSDAC dataset[22] to illustrate the notion of domain shift. The above images are examples in Sketch, Real, and Painting domains, respectively.
Fig. 2: The overall pipeline of the proposed SSDA method.

The goal of DA is to enhance the performance of classifying images in a label-scarce domain (target domain) by leveraging knowledge of a label-sufficient domain (source domain). Majority of early methods

[5, 17, 33, 11, 27, 4, 15, 7, 18, 3] are devoted to unsupervised domain adaptation (UDA), which assumes all target images are unlabeled while source images are fully labeled. Recently, a pioneering study [25] on semi-supervised domain adaptation is introduced, which assumes a few labeled target images are additionally given (e.g., one or three examples per each class). In the study, a few-shot feature embedding scheme [1] is incorporated to enhance the effectiveness of labeled target images. In addition, by means of the minimax entropy-based learning scheme, the method outperforms other UDA methods, which are trained with SSDA setups (i.e., additional supervisions on the few labeled target images). One of empirical discoveries reported in [25] is that training with additional labeled data in the target domain can considerably enhance the performance even though the quantity of those data is extremely small. This implies that the few labeled target images serve as critical clues to resolve SSDA problems. However, in spite of the significance of the labeled target images, their usage in the existing SSDA methods is limited to embedding them into ordinary supervised losses, such as cross entropy loss.

In this paper, we propose a new SSDA method that exploits the labeled target images more actively by treating them as ‘golden’ samples for SSDA. To this end, we employ the few labeled target images for selectively assigning pseudo labels to unlabeled target images. Training with pseudo labels [16]

requires careful treatments since incorrect pseudo labels may result in performance degradation. Our strategy to deal with pseudo labels is composed of two major components. First, to acquire pseudo labels with high reliability, we propose to select and utilize restricted amounts of pseudo labels based on an analysis in the feature space. Here, the basis of our reasoning is that deep features that lead to correct pseudo labels are usually clustered with those of labeled target images. Second, based on the observation that pseudo labels are inevitably noisy (i.e., containing incorrect labels), we propose to apply a label noise-robust learning scheme

[30] that alternately updates pseudo labels and deep networks. By means of this alternate updating scheme, the network and the set of pseudo labels are progressively optimized. The overall pipeline of the proposed SSDA method is illustrated in Fig. 2. Experimental results on LSDAC [22], Office-Home [32], and Office [24] datasets demonstrate that our method outperforms other previous state-of-the-art methods.

The rest of this paper is organized as follows. In Section II, previous studies that are related to our work are introduced. In Section III, the details of our proposed method are explained. In Section IV, experimental setups and results are reported, and concluding remarks are given in Section V.

Ii Related Work

In this section, we review existing studies that are related to our work. First, we introduce previous domain adaptation methods for image classification. Second, we review learning schemes that are robust to noisy labels and clarify our strategy to apply those methods to SSDA.

Ii-a Domain Adaptation for Image Classification

Existing domain adaptation methods for image classification can be categorized into unsupervised and semi-supervised domain adaptation approaches. Both approaches consider the case that source and target domains share the same set of image categories, whereas the quantity of labels in the target domain is much smaller than that in the source domain.

Most of early studies are focused on UDA, which assumes that all images in the target domain are unlabeled. As a pioneering method for UDA, Ganin and Lempitsky [5] propose an adversarial learning approach to aligning feature distributions of source and target domains. Through the adversarial learning, the feature extractor is trained to deceive the domain classifier by making features of the target domain be indistinguishable from those of the source domain. The adversarial learning process is implemented by inserting the gradient reversal layer (GRL) between the feature extractor and the domain classifier. This adversarial learning mechanism is widely adapted to other UDA approaches [17, 27, 4, 15, 18, 3] to aligning feature spaces. Different from those feature-level adaptation approaches, there are several pixel-level adaptation approaches that augment the scales of training sets by transferring images across the two domains [33, 11, 7]. A common limitation of UDA methods is that the adaptation performance is severely degraded for adaptation scenarios involving a large domain shift. This is due to the harsh experimental setups of UDA that target labels are not given at all.

Recently, to address the domain adaptation problem in a more practical and realistic way, SSDA methods received a great attention. Unlike the UDA schemes, SSDA assumes that a few target labels (e.g., one or three examples for each class) are additionally given for domain adaptation. As a pioneering approach for SSDA, Saito et al. [25] propose a minimax entropy-based method. In the study, the few-shot feature embedding scheme [1] and the minimax entropy-based learning schemes are incorporated for SSDA. The empirical results in [25] show that additional supervisions on the few labeled target images can fairly increase the performance of domain adaptation methods [25, 26, 5, 17, 8], implying the importance of those data. However, in spite of the significance of labeled target images, the use of those data is restricted to embedding ordinary supervised losses. Unlike those previous methods, in this paper, we propose to further utilize the labeled target images to select reliable pseudo labels for unlabeled target images. Training deep neural networks with pseudo labels [16] is one of the self-training mechanisms, and it requires careful treatments since incorrect pseudo labels can severely degrade the performance. To figure out pseudo labels with high reliability, we conduct feature analysis by exploiting both the labeled and the unlabeled target images. The details of this process are explained in Sec. III-A.

Fig. 3: A toy example to illustrate our motivation of the selective pseudo labeling approach. Generally, in the feature space, labeled source and target features are well assorted by the class boundary, whereas unlabeled target features are not. Thus, assigning pseudo labels to all unlabeled samples may generate numerous incorrect pseudo labels. Our motivation of selecting reliable pseudo labels is based on the observation that unlabeled target features leading to correct pseudo labels are located close to labeled target features in the feature space. For instance, in the figure, (B) is a correctly classified example whereas (C) is an incorrectly classified example. The feature distance between (A) and (B) is 0.91 and the feature distance between (A) and (C) is 1.64 (ResNet-34). By selectively assigning pseudo labels to unlabeled target images with relatively small feature distances, we can enhance the reliability of the set of pseudo labels. The example images are from the LSDAC dataset [22]. Best viewed in color.

Ii-B Learning with Noisy Labels

Training deep neural networks requires large-scale datasets, which are composed of images and corresponding label annotations. However, collecting clean labels for large-scale datasets is costly, and in practice there often exist noisy labels. By ‘noisy’, we mean the labels may contain incorrect annotations, and learning with noisy labels is a challenging issue that is recently addressed by numerous studies [30, 13, 6, 19, 35, 31]

. There are various existing approaches for learning with noisy labels, such as embedding label noise-robust loss functions

[6, 35], applying the joint optimization framework [30], and filtering out noisy labels [13, 19]. Those methods are verified on image classification datasets, which contain intentionally generated noisy labels.

Our motivation of adapting the label noise-robust learning scheme to SSDA is derived from the fact that pseudo labels are inevitably noisy. To enhance the performance of the network trained on pseudo labels, we incorporate the joint optimization framework [30], which is demonstrated to be robust to noisy labels of large-scale datasets. The key idea of the framework is to progressively update the network and the set of noisy labels by turns, pursuing positive interactions between the two components. The detailed descriptions of our label noise-robust learning scheme, which is motivated by [30], is introduced in Sec. III-B. To the best of our knowledge, this is the first trial to adapt the label noise-robust learning scheme to self-training with pseudo labels.

Fig. 4: An overview of the proposed selective pseudo labeling pipeline, which is explained in Sec. III-A. The above procedure is conducted for each class in the target domain. The above figure illustrates the example of the ‘airplane’ class in the Clipart domain of the LSDAC dataset[22]. Best viewed in color.

Iii Proposed Method

The goal of semi-supervised domain adaptation is to train a classification model that is oriented to a target domain by using image sets in both domains. In the source domain, we are given source images and the corresponding labels . In the target domain, unlabeled images and a small number of labeled images are given. In SSDA, the classification model is trained on , , and tested on . The classification model is composed of a feature extractor and a classifier , where and

are weight vectors of the feature extractor and the classifier, respectively. For an input image

, its feature vector and output prediction encoded by the model are denoted as and , respectively. Thus, .

As illustrated in Fig. 2, our proposed method is composed of three stages. The first stage is to train a baseline model to generate pseudo labels. In this paper, we adopt the minimax entropy-based approach [25] to train the baseline models for all experiments. The weight vectors of the feature extractor and the classifier of the trained baseline model are denoted as and , respectively. The next two stages of our proposed method are explained in the following two subsections.

Iii-a Selective Pseudo Labeling Approach

By using the baseline model that is acquired in the previous stage, we apply forward pass operations to unlabeled target images to obtain . We call as a ‘soft’ pseudo label (i.e., an output prediction vector) and as a ‘hard’ pseudo label for the th unlabeled target image, and they are given as follows:

(1)
(2)

In the above equations,

is the output probability of the

th class and denotes the number of classes. We empirically found that adopting the entire pseudo labels for training is not helpful and even degrades the performance. Our speculation regarding this problem is that training data whose pseudo labels are incorrect may degrade the accuracy, and thus acquiring pseudo labels with high reliability is a very important issue. Based on this observation, we propose a selective pseudo labeling approach that utilizes restricted amounts of pseudo labels by focusing on their reliabilities.

The key idea of our selective pseudo labeling approach is illustrated in Fig. 3. As depicted in the figure, deep features which lead to correct pseudo labels are closely located with those of labeled target images in the feature space. For each class, let be the feature of the th labeled target image whose label is , and be the feature of the th unlabeled target image whose hard pseudo label is (i.e., ). Here, we drop the categorical index for notational convenience. For the th unlabeled sample, we define its feature distance as follows:

(3)

where denotes the l1-norm function and indicates the number of labeled target images for each class. In our experiments, one or three target images are given for each class, i.e., =1 (1-shot) or =3 (3-shot). The feature distance becomes larger if the unlabeled target feature is located far from the labeled target features in the feature space and vice versa. Based on our assumption that is inversely proportional to the reliability, we sort the unlabeled features in an ascending order. This procedure is independently conducted for each class. After the sorting process, for each class, we assign pseudo labels to the first samples. Here, is a hyper-parameter that adjust the ratio of selecting pseudo labels, and we set to as default. Through these procedures, we obtain the pseudo labeled target image set , where indicates the index set of selected pseudo labels. The overall procedure of our selective pseudo labeling approach is illustrated in Fig. 4.

In Table I, the reliabilities of selected pseudo labels are compared with those of baseline pseudo labels without applying the selective pseudo labeling approach. Here, it is worth noting that the numerical values in Table I are not the final accuracy of the image classifier, but the ratio of correct pseudo labels in terms of percentage. For various adaptation scenarios in Table I, our proposed selective approach consistently enhances the reliabilities of pseudo labels. In particular, its effectiveness becomes prominent when applied for adaptive scenarios with a large domain gap such as Clipart to Sketch (C to S). This indicates that the proposed selective pseudo labeling approach is fairly effective for challenging scenarios as well.

 

Net Clipart to Sketch (C to S) Painintg to Real (P to R)
1-shot 3-shot 1-shot 3-shot
AlexNet 35.261.6 41.064.8 57.783.8 60.785.8
VGG-16 51.272.5 54.676.4 72.288.6 75.092.3

 

TABLE I: Reliability of pseudo labels in terms of accuracy (%) on the LSDAC dataset [22]. Before After applying the selective pseudo labeling approach. Note that each measurement in this table is not the final accuracy, but the correctness of pseudo labels.

Iii-B Label Noise-Robust Learning via Progressive Self-Training

The final stage of our proposed method is to conduct SSDA along with the pseudo labels that are obtained by the previous stage. Although the pseudo labels are carefully determined via the selective approach, they are not completely reliable since the pseudo labels are noisy. Based on our observation that pseudo labels are inevitably noisy, we propose a label noise-robust learning approach, which is motivated by the joint optimization framework for learning with noisy labels [30].

Given the set of unlabeled target images with pseudo labels (), we implement the supervised loss function as follows:

(4)

where is the standard cross entropy loss function. Note that is a fixed pseudo label and is a variable output prediction during updating the network. In a similar way to [30], the set of pseudo labels is updated by forward passing operations using the updated network with a momentum of 0.9 after every validation phase. By means of this alternating learning process, the network and the set of pseudo labels are progressively updated. This procedure that jointly updates the network and the pseudo labels is continued until the validation accuracy is converged. We call this learning process as ‘progressive self-training’ since the network is progressively optimized along with the pseudo labels.

The overall training is conducted in conjunction with the baseline SSDA method, which is the minimax entropy-based approach [25]. By letting and be the loss functions for the feature extractor and the classifier, respectively, the overall training objective functions are given as follows:

(5)
(6)
(7)
(8)

In the above equations, is the standard cross entropy loss for labeled source and target images and indicates the entropy [25]

for unlabeled target images. The standard Stochastic Gradient Descent (SGD) algorithm is used for training on the loss functions. The hyper-parameter

is set to 0.1 for all experiments. The overall training procedure is summarized in Algorithm 1.

Iv Experiments

Iv-a Datasets

We used three representative benchmark datasets for experiments. LSDAC[22] is a benchmark dataset for large-scale domain adaptation, which involves 6 domains with 345 classes. To make a fair comparison with previous methods, we followed the settings in [25], which addresses 7 adaptation scenarios from 4 domains (Real, Clipart, Painting, and Sketch) with 126 classes. Office-Home[32] contains 4 domains (Real, Clipart, Art, and Product) with 65 classes and we conducted evaluations on 12 adaptation scenarios, which involve all possible scenarios. Office[24] involves 3 domains (Amazon, Webcam, and DSLR) with 31 classes and we evaluated on 2 scenarios, which are Webcam to Amazon and DSLR to Amazon. Since the domain disparities between Webcam and DSLR are negligible, we considered two domain adaptation scenarios that involve large domain shifts and sufficient amount of training data.

Iv-B Experimental Setups

For each adaptation scenario, one or three examples per class are used as labeled target training data, and we denote these two settings as ‘1-shot’ and ‘3-shot’, respectively. For fair comparison, we used labeled target image sets, which are reported in [25]. The rest of unlabeled target images and all labeled source images were used for training. To verify the effectiveness of the proposed method across various network models, we conducted comparative evaluations on 6 backbone architectures. To be specific, we employed AlexNet[14], VGG-16[29], and ResNet-34[9] as the primary network models. Further results on other models beyond the three architectures are reported in Sec. IV-D.

0:  , , , , ,
0:  ,
  , ,
  while  and not converged do
     update by SGD on in Eq. (5)
     update by SGD on in Eq. (6)
     if  then
        update with a momentum of 0.9
     end if
     
  end while
  ,
  return  
Algorithm 1 Semi-supervised Domain Adaptation with the Label Noise-Robust Learning Approach

 

Net Method R to C R to P P to C C to S S to P R to S P to R MEAN
1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot
AlexNet S+T 43.3 47.1 42.4 45.0 40.1 44.9 33.6 36.4 35.7 38.4 29.1 33.3 55.8 58.7 40.0 43.4
DANN 43.3 46.1 41.6 43.8 39.1 41.0 35.9 36.5 36.9 38.9 32.5 33.4 53.6 57.3 40.4 42.4
ADR 43.1 46.2 41.4 44.4 39.3 43.6 32.8 36.4 33.1 38.9 29.1 32.4 55.9 57.3 39.2 42.7
CDAN 46.3 46.8 45.7 45.0 38.3 42.3 27.5 29.5 30.2 33.7 28.8 31.3 56.7 58.7 39.1 41.0
ENT 37.0 45.5 35.6 42.6 26.8 40.4 18.9 31.1 15.1 29.6 18.0 29.6 52.2 60.0 29.1 39.8
MME 48.9 55.6 48.0 49.0 46.7 51.7 36.3 39.4 39.4 43.0 33.3 37.9 56.8 60.7 44.2 48.2
Proposed 54.2 58.3 48.8 51.7 49.0 55.1 38.9 43.5 44.7 48.4 37.5 41.2 60.2 63.3 47.6 51.6
VGG-16 S+T 49.0 52.3 55.4 56.7 47.7 51.0 43.9 48.5 50.8 55.1 37.9 45.0 69.0 71.7 50.5 54.3
DANN 43.9 56.8 42.0 57.5 37.3 49.2 46.7 48.2 51.9 55.6 30.2 45.6 65.8 70.1 45.4 54.7
ADR 48.3 50.2 54.6 56.1 47.3 51.5 44.0 49.0 50.7 53.5 38.6 44.7 67.6 70.9 50.2 53.7
CDAN 57.8 58.1 57.8 59.1 51.0 57.4 42.5 47.2 51.2 54.5 42.6 49.3 71.7 74.6 53.5 57.2
ENT 39.6 50.3 43.9 54.6 26.4 47.4 27.0 41.9 29.1 51.0 19.3 39.7 68.2 72.5 36.2 51.1
MME 60.6 64.1 63.3 63.5 57.0 60.7 50.9 55.4 60.5 60.9 50.2 54.8 72.2 75.3 59.2 62.1
Proposed 64.5 68.0 63.7 64.9 60.5 64.4 53.7 57.4 62.5 63.4 52.7 57.5 73.0 74.9 61.5 64.4
ResNet-34 S+T 55.6 60.0 60.6 62.2 56.8 59.4 50.8 55.0 56.0 59.5 46.3 50.1 71.8 73.9 56.9 60.0
DANN 58.2 59.8 61.4 62.8 56.3 59.6 52.8 55.4 57.4 59.9 52.2 54.9 70.3 72.2 58.4 60.7
ADR 57.1 60.7 61.3 61.9 57.0 60.7 51.0 54.4 56.0 59.9 49.0 51.1 72.0 74.2 57.6 60.4
CDAN 65.0 69.0 64.9 67.3 63.7 68.4 53.1 57.8 63.4 65.3 54.5 59.0 73.2 78.5 62.5 66.5
ENT 65.2 71.0 65.9 69.2 65.4 71.1 54.6 60.0 59.7 62.1 52.1 61.1 75.0 78.6 62.6 67.6
MME 70.0 72.2 67.7 69.7 69.0 71.7 56.3 61.8 64.8 66.8 61.0 61.9 76.1 78.5 66.4 68.9
Proposed 72.4 73.9 69.4 71.5 71.6 73.9 61.7 63.3 66.7 69.0 62.5 65.1 78.8 80.4 69.0 71.0

 

TABLE II: Quantitative evaluation results on LSDAC dataset in terms of accuracy (%).

All experiments in this paper are implemented in PyTorch

[21] by using an NVIDIA TITAN X GPU (Pascal architecture). For training baseline models (i.e., the first stage of our method), we followed the setups reported in [25]. The self-training phase using the selected pseudo labels (Eq. (5), (6)) is resumed from the baseline models until the validation accuracy is converged. Learning rates are initialized before resuming the training process, and are decayed according to the annealing strategy proposed in [5]. For comparative evaluations, we report the quantitative evaluation results of the following 6 previous methods. S+T[1, 23] is a method that trains a network with supervisions on labeled source and target images without using unlabeled target images. DANN[5], ADR[26], CDAN[17], and ENT[8] are unsupervised domain adaptation methods, which are trained with additional supervisions on labeled target images. MME[25] is our baseline method that is specialized to the SSDA scheme.

Iv-C Experimental Results and Analysis

The quantitative evaluation results on the LSDAC dataset is reported in Table II. For the 7 adaptation scenarios, the proposed method outperforms other previous methods except only one case (P to R with VGG16). It is worth noting that our method achieves significant performance increasements over the baseline method when the domain gap is large (e.g., S to P and R to S adaptation scenarios). This implies that our method is particularly robust to challenging conditions. Another empirical observation is that the 1-shot accuracies of our method are competitive to or even higher than those of 3-shot accuracies of other previous methods. This indicates that our method requires less target labels than other methods for the same performance. Therefore, our method can be used as an alternative to collecting labeled images in the target domain. This advantage of our method would be very useful for image classification tasks involving a large number of classes since the expense of annotating labels is proportional to the number of classes. The evaluation results on the Office-Home and the Office datasets are reported in Table III. Our method shows better performances than other methods in terms of average accuracies.

Overall, the strength of our method can be summarized as the following three major aspects. First, our proposed method outperforms other previous methods across various datasets and network architectures. This indicates that our method can be broadly adopted to various SSDA scenarios, not limited to a certain dataset or network. Second, our method achieves considerable performance enhancements over the previous methods, especially for large-scale domain adaptation datasets such as the LSDAC dataset. Third, our method is particularly robust to challenging domain adaptation scenarios (e.g., S to P and R to S adaptation scenarios in the LSDAC dataset), implying that our proposed method can be used for enhancing performance for more difficult adaptation conditions involving large domain shifts.

 

Net Method Office-Home Office
1-shot 3-shot 1-shot 3-shot
AlexNet S+T 44.1 50.0 50.2 61.8
DANN 45.1 50.3 55.8 64.8
ADR 44.5 49.5 50.6 61.3
CDAN 41.2 46.2 49.4 60.8
ENT 38.8 50.9 48.1 65.1
MME 49.2 55.2 56.5 67.6
Proposed 50.3 55.3 59.0 69.8
VGG-16 S+T 57.4 62.9 68.7 73.3
DANN 60.0 63.9 69.8 75.0
ADR 57.4 63.0 69.4 73.7
CDAN 55.8 61.8 65.9 72.9
ENT 51.6 64.8 70.6 75.3
MME 62.7 67.6 73.4 77.0
Proposed 63.9 68.6 76.4 78.1

 

TABLE III: Quantitative evaluation results on Office-Home and Office datasets in terms of accuracy (%). Each measurement is a mean accuracy averaged over all adaptation scenarios in each dataset (12 and 2 scenarios for Office-Home and Office datasets, respectively).

Iv-D Ablation Studies and Further Analysis

To verify the effectiveness of each module in our method, we conducted ablation studies. The ablation studies were done on the two adaptation scenarios in the LSDAC dataset, which are C to S involving a large domain gap and P to R with a relatively small domain gap. In Table IV, the accuracies depending on the ratio of selecting pseudo labels ( in Sec. III-A) are reported. The results in Table IV demonstrate that the accuracy has a tendency to be maximized when is around 0.2. Meanwhile, the performance is degraded if the magnitude of is larger or smaller than . This indicates that selecting moderate amounts of pseudo labels is encouraged. If , the entire pseudo labels are adopted for self-training. Thus, this setup corresponds to the training strategy without applying the selective pseudo labeling stage in Sec. III-A. By comparing the results of with those of , it can be validated that the selective pseudo labeling stage obviously enhances the accuracy. In addition, this result demonstrates our initial assumption that employing a restricted number of pseudo labels with high reliability leads to better performance than adopting the entire pseudo labels. On the other hand, adopting too small amount of pseudo labels leads to relatively low accuracies. This empirical observation implies that a moderate number of pseudo labels are desirable for self-training. Based on these analysis and empirical studies, we set the default value of to 0.2 for all experiments.

The second ablation study is to investigate the effectiveness of the label noise-robust learning approach (Sec. III-B). To this end, we compared our method with a vanilla learning approach by using hard pseudo labels in Eq. (2) without applying the progressive updating scheme. The comparative results are presented in Table V and it can be confirmed that the performance of the label noise-robust learning approach is better than that of the vanilla learning approach. This indicates that the proposed learning approach can effectively prevent incorrect pseudo labels from misleading the network during the training phase.

Lastly, we conducted comparative evaluations on three additional backbone architectures to verify the robustness of the proposed method across various network models. We adopted ResNet-101[9], DenseNet-121[12] to test on deeper network models. In addition, we employed MobileNet-v2 [28] to confirm the performance on a light-weight network model. The evaluation results on the three models are reported in Table VI. Our proposed method surpasses other previous SSDA methods including the baseline method. This consistency of performance enhancements indicates that the proposed method can be broadly applied for SSDA without demanding any preference on a certain network architecture. To train a single DA scenario, it took around 4 to 6 hours until convergence. The computational time for testing is dependent on the backbone architecture, and the measurements for the 6 network models are reported in Table VII.

 

0.01 0.05 0.20 0.50 1.00
1-shot
C to S 36.7 37.6 38.9 38.7 37.4
P to R 57.5 59.3 60.2 59.8 59.6
3-shot
C to S 41.7 42.8 43.5 43.1 42.4
P to R 60.5 62.1 63.3 62.9 62.0

 

TABLE IV: Accuracy variations on in Sec. III-A using AlexNet.

 

Whether applied C to S P to R
1-shot 3-shot 1-shot 3-shot
Yes 38.9 43.5 60.2 63.3
No 37.7 42.1 58.3 61.9

 

TABLE V: Ablation study on applying the label noise-robust learning approach in Sec. III-B using AlexNet.

 

Method ResNet-101 DenseNet-121 MobileNet-v2
1-shot 3-shot 1-shot 3-shot 1-shot 3-shot
S+T 55.9 59.1 58.6 61.8 51.3 54.6
ENT 62.1 67.0 62.2 69.7 53.7 61.7
MME 66.3 68.4 68.3 70.5 60.9 64.3
Proposed 68.0 69.2 70.4 72.1 63.5 66.3

 

TABLE VI: Further evaluation results on various network architectures. Each measurement is a mean accuracy (%) averaged over the 7 adaptation scenarios in LSDAC dataset.

 

AlexNet VGG-16 ResNet-34
1.84 ms (544 FPS) 2.41 ms (414 FPS) 1.82 ms (550 FPS)
ResNet-101 DenseNet-121 MobileNet-v2
2.46 ms (406 FPS) 1.83 ms (547 FPS) 1.82 ms (550 FPS)

 

TABLE VII: Computation time required for testing an image.

V Conclusion

In this paper, we have introduced a novel semi-supervised domain adaptation method for image classification. The major idea of our method is to exploit the labeled target images to find out reliable pseudo labels for the unlabeled target images. In addition, based on the observation that the set of pseudo labels may contain incorrect labels, a learning approach that is robust to noisy labels is applied. Experimental results on the three representative domain adaptation datasets show that our method outperforms other methods, especially for the challenging adaptation scenarios involving large domain shifts. For the three primary backbone architectures (AlexNet[14], VGG-16[29], ResNet-34[9]), the SSDA method outperforms the previous state-of-the-art method by 2.7%, 0.9%, and 2.2% for LSDAC[22], Office-Home[32], and Office[24]

datasets, respectively. Though we validated the proposed method on image classification only, we expect that our method could be further expanded to other computer vision tasks such as domain adaptive object detection

[2] and semantic segmentation[10] in the future.

Acknowledgement

We thank the anonymous reviewers for their valuable comments. This work was supported by Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2019-0-00524, Development of precise content identification technology based on relationship analysis for maritime vessel/structure).

References

  • [1] W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang (2019) A closer look at few-shot classification. arXiv preprint arXiv:1904.04232. Cited by: §I, §II-A, §IV-B.
  • [2] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive faster r-cnn for object detection in the wild. In

    Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 3339–3348. Cited by: §V.
  • [3] J. Choi, M. Jeong, T. Kim, and C. Kim (2019) Pseudo-labeling curriculum for unsupervised domain adaptation. In Proc. British Machine Vision Conference (BMVC), Cited by: §I, §II-A.
  • [4] G. French, M. Mackiewicz, and M. Fisher (2017) Self ensembling for visual domain adaptation. arXiv preprint arXiv:1706.05208. Cited by: §I, §II-A.
  • [5] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In

    Proc. International Conference on Machine Learning (ICML)

    ,
    Cited by: §I, §II-A, §II-A, §IV-B.
  • [6] A. Ghosh, H. Kumar, and P. Sastry (2017) Robust loss functions under label noise for deep neural networks. In

    Proc. Thirty-First AAAI Conference on Artificial Intelligence

    ,
    Cited by: §II-B.
  • [7] R. Gong, W. Li, Y. Chen, and L. Gool (2019) DLOW: domain flow for adaptation and generalization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
  • [8] Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Proc. Advances in neural information processing systems (NeurIPS, pp. 529–536. Cited by: §II-A, §IV-B.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §IV-B, §IV-D, §V.
  • [10] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2017) CYCADA: cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213. Cited by: §V.
  • [11] L. Hu, M. Kan, S. Shan, and X. Chen (2018)

    Duplex generative adversarial network for unsupervised domain adaptation

    .
    In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
  • [12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Cited by: §IV-D.
  • [13] Y. Kim, J. Yim, J. Yun, and J. Kim (2019) NLNL: negative learning for noisy labels. In Proc. IEEE International Conference on Computer Vision (ICCV), pp. 101–110. Cited by: §II-B.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Proc. Neural Information Processing Systems (NeurIPS), Cited by: §IV-B, §V.
  • [15] V. Kurmi, S. Kumar, and V. Namboodiri (2018) Attending to discriminative certainty for domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
  • [16] D. Lee (2013)

    Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks

    .
    In Workshop on challenges in representation learning (ICML), Vol. 3, pp. 2. Cited by: §I, §II-A.
  • [17] M. Long, Z. Cao, J. Wang, and M. Jordan (2018) Conditional adversarial domain adaptation. In Proc. Neural Information Processing Systems (NeurIPS), Cited by: §I, §II-A, §II-A, §IV-B.
  • [18] X. Ma, T. Zhang, and C. Xu (2019) GCAN: graph convolutional adversarial network for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
  • [19] C. G. Northcutt, T. Wu, and I. L. Chuang (2017) Learning with confident examples: rank pruning for robust classification with noisy labels. arXiv preprint arXiv:1705.01936. Cited by: §II-B.
  • [20] S. J. Pan and Q. Yang (2009)

    A survey on transfer learning

    .
    IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §I.
  • [21] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §IV-B.
  • [22] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang (2019) Moment matching for multi-source domain adaptation. In Proc. IEEE International Conference on Computer Vision (ICCV), pp. 1406–1415. Cited by: Fig. 1, §I, Fig. 3, Fig. 4, TABLE I, §IV-A, §V.
  • [23] R. Ranjan, C. D. Castillo, and R. Chellappa (2017) L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507. Cited by: §IV-B.
  • [24] K. Saenko, B. Kulis, M. Fritz, and T. Darrell (2010) Adapting visual category models to new domains. In Proc. European Conference on Computer Vision (ECCV), pp. 213–226. Cited by: §I, §IV-A, §V.
  • [25] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko (2019) Semi-supervised domain adaptation via minimax entropy. In Proc. IEEE Conference on International Conference on Computer Vision (ICCV), Cited by: §I, §II-A, §III-B, §III, §IV-A, §IV-B, §IV-B.
  • [26] K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2017) Adversarial dropout regularization. arXiv preprint arXiv:1711.01575. Cited by: §II-A, §IV-B.
  • [27] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
  • [28] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520. Cited by: §IV-D.
  • [29] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §IV-B, §V.
  • [30] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa (2018) Joint optimization framework for learning with noisy labels. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5552–5560. Cited by: §I, §II-B, §II-B, §III-B, §III-B.
  • [31] A. Vahdat (2017) Toward robustness against label noise in training deep discriminative neural networks. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 5596–5605. Cited by: §II-B.
  • [32] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan (2017) Deep hashing network for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5018–5027. Cited by: §I, §IV-A, §V.
  • [33] R. Volpi, P. Morerio, S. Savarese, and V. Murino (2018) Adversarial feature augmentation for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
  • [34] M. Wang and W. Deng (2018) Deep visual domain adaptation: a survey. Neurocomputing 312, pp. 135–153. Cited by: §I.
  • [35] Z. Zhang and M. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In Proc. Advances in Neural Information Processing Systems (NeurIPS, pp. 8778–8788. Cited by: §II-B.