Combating Domain Shift with Self-Taught Labeling

07/08/2020 ∙ by Jian Liang, et al. ∙ National University of Singapore 0

We present a novel method to combat domain shift when adapting classification models trained on one domain to other new domains with few or no target labels. In the existing literature, a prevailing solution paradigm is to learn domain-invariant feature representations so that a classifier learned on the source features generalizes well to the target features. However, such a classifier is inevitably biased to the source domain by overlooking the structure of the target data. Instead, we propose Self-Taught Labeling (SeTL), a new regularization approach that finds an auxiliary target-specific classifier for unlabeled data. During adaptation, this classifier is able to teach the target domain itself by providing unbiased accurate pseudo labels. In particular, for each target data, we employ the memory bank to store the feature along with its soft label from the domain-shared classifier. Then we develop a non-parametric neighborhood aggregation strategy to generate new pseudo labels as well as confidence weights for unlabeled data. Though simply using the standard classification objective, SeTL significantly outperforms existing domain alignment techniques on a large variety of domain adaptation benchmarks. We expect that SeTL can provide a new perspective of addressing domain shift and inspire future research of domain adaptation and transfer learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite making remarkable progress in classification tasks over the past decades, deep neural network models still suffer poor generalization performance to another new domain (e.g. classifying real-world object images using a classification model trained on simulated object images

Peng et al. (2017)), due to the well known dataset shift Quionero-Candela et al. (2009) or domain shift Tommasi et al. (2016) problem. Hence, lots of research efforts have been devoted to developing domain adaptation (DA) methods Gong et al. (2012); Ganin et al. (2016); Hoffman et al. (2016); Tsai et al. (2018) to make the source model more adaptable to the new target domains.

Figure 1: The pipeline of our proposed SeTL for UDA. Different from existing methods that mostly rely on feature-level domain alignment, SeTL addresses domain shift by discovering an extra classifier for target data during adaptation. SeTL introduces a memory module and develops neighborhood aggregation to help build the domain-specific classifier , and expects to generate unbiased accurate pseudo labels together with confidence weights for unlabeled data.

In this paper, we mainly focus on unsupervised domain adaptation (UDA) for object recognition where no labeled data are available in the target domain. Recently, deep domain adaptation approaches have almost dominated this field with promising results Long et al. (2015); Ganin et al. (2016); Long et al. (2018); Lee et al. (2019a); Kang et al. (2019); Cicek and Soatto (2019), which try to learn domain-invariant feature representations that achieve small error on the source domain. They expect the learned representations together with the classifier learned from the source domain can generalize to the target domain. Since marginal distribution alignment in Ganin and Lempitsky (2015); Long et al. (2015) is not sufficient to guarantee successful domain adaptation Zhao et al. (2019), pseudo labels on the target domain, providing conditional information, are employed to align class-conditional distributions Long et al. (2018); Cicek and Soatto (2019). However, as shown in Fig. 1, the learned classifier is inevitably biased to the labeled source data, making generated pseudo labels on the target domain inaccurate and unreliable.

To tackle this issue, we propose a new approach called Self-Taught Labeling (SeTL) that discovers a target-specific classifier to produce reliable predictions rather than simply relying on biased ones from the source classifier. Intuitively, with unbiased accurate pseudo labels for unlabeled target data, one can implicitly and semantically align the data features from different domains through a standard classification loss, so as to get rid of tedious feature-level domain alignment. Different from most-favored feature-level alignment and pixel-level transfer Hoffman et al. (2018); Sankaranarayanan et al. (2018), this provides a new perspective for DA problems. Since no labeled data is available in the target domain, SeTL introduces a memory module to store the historical information (i.e., features and classifier predictions) of unlabeled target samples as self-supervision. Through the memory module, SeTL performs neighborhood aggregation to obtain both pseudo labels and their corresponding confidences, which directly promotes message-passing within the neighborhood in the target domain, without introducing any extra parameters.

Specifically, for each target sample, SeTL retrieves a few nearest neighbors based on their feature similarity and aggregates their associated classifier predictions into the pseudo label for the target sample. SeTL uses the pseudo labels and confidence weights derived from the aggregated prediction as self-teaching supervision over the unlabeled data. This provides a regularization to the source classification loss and helps the data feature adaptation. This aggregation strategy works well since it can leverage the target samples with high confidence (i.e. source-like samples) in the memory bank to help learn a reliable classifier. SeTL is general and can be applied to various DA tasks.

Despite its simplicity, we find that SeTL achieves competitive or better results than state-of-the-art on multiple domain adaptation benchmarks. Besides, SeTL can also be seamlessly integrated into existing domain adaptation methods and further boost their transferability. Furthermore, SeTL also works well for semi-supervised learning (SSL) where only a small amount of labeled data is available for model training.

To sum up, we make the following contributions. We present SeTL, a novel approach to combat domain shift that provides an alternative to the most-favored feature-level alignment and the pixel-level transfer methods. Though it is simple, SeTL is able to fully promote self-teaching among the target domain with an auxiliary memory module. The SeTL performs outstandingly well on multiple benchmarks for UDA, Semi-supervised DA, and SSL with few annotated data points. We hope SeTL can be inspiring for further works on domain adaptation.

2 Related Work

Since this paper mainly focuses on the UDA problem, we first introduce some related existing deep domain adaptation approaches. More comprehensive overviews are provided in Csurka (2017); Kouw and Loog (2019); Wilson and Cook (2020). From another viewpoint, without direct domain alignment, our method could also be considered as a regularization approach for transductive learning, thus we also discuss related studies on this topic. At last, several works involved with memory mechanism are analyzed.

2.1 Deep Domain Adaptation

Deep domain adaptation methods leverage deep neural networks to learn more transferable representations by embedding domain adaptation in the pipeline of deep learning. Generally, the weights of the deep architecture containing a feature encoder and a classifier layer are shared for both domains, and various distribution discrepancy measures

Tzeng et al. (2014); Long et al. (2015); Ganin and Lempitsky (2015) are developed to promote domain confusion in the feature space. Maximum mean discrepancy (MMD) Gretton et al. (2007) and -distance Ben-David et al. (2010) are two most favored measures among them. To circumvent the problem that marginal distribution alignment cannot guarantee different domains are semantically aligned, following works Long et al. (2018); Cicek and Soatto (2019) exploit pseudo labels on the target domain to perform conditional distribution alignment. The learned classifier still fails to generalize well on the target domain, as it is mainly built on the labeled source data.

Another line of research Long et al. (2016); Rozantsev et al. (2018); Liang et al. (2020) exploits the individual characteristics of each domain by dropping the weight-sharing assumption fully or partially. Shu et al. Shu et al. (2018) propose non-conservative domain adaptation and incrementally refine the preciously learned classification boundary to fit the target domain only. With the classifier shared, Tzeng et al. Tzeng et al. (2017) first learn the source feature encoder and then the target feature encoder sequentially. While Bousmalis et al. Bousmalis et al. (2016) jointly learn the domain-shared encoder and domain-specific private encoders. Besides, Chang et al. Chang et al. (2019)

share all other model parameters but specialize batch normalization layers within the feature encoder. Liang et al. 

Liang et al. (2020) learn the target-specific feature extractor while only operating on the hypotheses induced from the source data. Compared with these methods, SeTL does not introduce any new layers and aims to learn one shared classifier for both domains with a virtual target-specific classifier.

2.2 Regularization for Transductive Learning

Besides the classification objective for labeled data, SSL methods Zhu (2005) generally resort to the cluster assumption or low-density separation assumption to fully exploit unlabeled data, e.g., Shannon entropy minimization Grandvalet and Bengio (2005). An alternative termed ‘Pseudo-Label’ is developed in Lee (2013) to progressively treat high-confidence predictions on unlabeled data as true labels and employ a standard cross-entropy loss. Following works Shi et al. (2018); Deng et al. (2019) incorporate pseudo labels to perform discriminative clustering for features of unlabeled data. Besides, Miyato et al. Miyato et al. (2018) propose the VAT loss to measure local smoothness of the conditional label distribution around each input data point against local perturbation. In fact, both UDA and SSL belong to the transductive learning; the only difference between them is that labeled data and unlabeled data are sampled from different distributions in UDA. Recent studies Chen et al. (2019a); Cui et al. (2020); Jin et al. (2020) show that regularization terms on unlabeled data without explicit feature-level domain alignment achieve promising adaptation results. In particular, the MaxSquare loss is elegantly designed in Chen et al. (2019a) to prevent the training process from being dominated by easy-to-transfer samples in the target domain. In contrast, the diversity of conditional predictions is first considered through batch nuclear-norm maximization Cui et al. (2020) and class confusion minimization Jin et al. (2020), respectively.

2.3 Transductive Learning with Memory Mechanism

A memory module can be read and written to remember past facts, making information across different mini-batches interactive and enabling more powerful learning for challenging tasks like question answering Sukhbaatar et al. (2015). A recent study Chen et al. (2018) first exploits the memory mechanism in the network training for SSL and computes the memory prediction for each training sample by the key addressing and value reading. Inspired by instance discrimination Wu et al. (2018), Saito et al. Saito et al. (2020) employ a memory bank and propose an entropy minimization loss to encourage neighborhood clustering in the target domain. Besides, Zhong et al. Zhong et al. (2019) leverage an exemplar memory module that saves up-to-date features for target data and computes the invariance learning loss for unlabeled target data. Among them, Chen et al. (2018) is the most closely related work to ours, but Chen et al. (2018) is proposed for SSL that only utilizes the labeled data for memory update and ignores self-learning in the unlabeled data.

3 Methodology

In the UDA task, we are given a labeled source domain with categories and an unlabeled target domain , while in semi-supervised domain adaptation (SSDA), we are given an additional labeled subset of the target domain . To be clear, denotes the entire target domain, and UDA has an empty . This paper focuses on the vanilla closed-set setting, i.e., two domains share the same categories. The ultimate goal of both UDA and SSDA is to label the target samples in via training the model on .

As shown in Fig. 1, we employ the widely-used architecture Ganin and Lempitsky (2015) which consists of two basic modules, a feature extractor and a classifier . Based on where to align, UDA approaches can be roughly categorized into three main cases, i.e., pixel-level Hoffman et al. (2018); Sankaranarayanan et al. (2018), feature-level Ganin and Lempitsky (2015); Tzeng et al. (2017); Long et al. (2018); Li et al. (2020a) and output-level Chen et al. (2019a); Cui et al. (2020); Jin et al. (2020). Pixel-level transfer is time-consuming and output-level regularization is sensitive to inaccurate model prediction, thus much DA research has been devoted to feature-level domain alignment. Prior studies Long et al. (2018); Cicek and Soatto (2019); Li et al. (2020a) further show better feature alignment can be achieved with the aid of noisy output-level predictions.

3.1 Preliminaries

To fully utilize the unlabeled data, following classic self-training Zhu (2005), Lee Lee (2013) presents a simple method for training deep neural networks for domain adaptation. It picks up the class

with the maximum predicted probability as true labels each time the weights are updated. Since the pseudo labels are not equally confident, in this work, we readily take the maximum predicted probabilities as weights and incorporate them into the standard cross-entropy loss, forming the following objective to adapt the model with unlabeled data:

(1)

where is the -dimensional prediction. As stated in Lee (2013), it favors a low-density separation between classes and is equivalent to entropy regularization Grandvalet and Bengio (2005) as follows,

(2)

Shannon entropy is employed to measure the class overlap. However, both regularization approaches Lee (2013); Grandvalet and Bengio (2005) and another recent regularization method Chen et al. (2019a) ignore the structure of unlabeled data and only focus on the instance-wise prediction itself.

Considering the prediction diversity among unlabeled data, Jin et al. Jin et al. (2020) propose to minimize the pair-wise class confusion within a mini-batch of training data. In that way, the overlap between any two classes can be reduced as well as the classification ambiguity. Besides, Cui et al. Cui et al. (2020) pursue a lower output matrix rank within a mini-batch to ensure both discriminability and diversity. Both approaches have been proven to achieve much better results than vanilla entropy minimization, implying that the structure of the classification output matrix is essential for unlabeled data. Though these output-level regularization methods Chen et al. (2019a); Cui et al. (2020) are originally proposed to make full use of unlabeled data without the assumption of domain shift, they still have achieved competitive performance with feature-level alignment methods for domain adaptation.

3.2 Self-Taught Labeling

In this paper, we propose a new regularization approach called self-taught labeling (SeTL) that fully exploits the structure of unlabeled data to get reliable pseudo labels under domain shift. Different from Liang et al. (2019) that employs the nearest centroid classifier with the assumption of centroid shift, SeTL aims to learn an extra specific classifier for the target domain. However, it is quite challenging to learn without labeled target data. Fortunately, according to a prior study Long et al. (2018), there exist some source-like samples whose output predictions are reliable, which can be used to help build the classifier proposed here and teach the remaining samples sequentially. To avoid the trivial sample selection and alternate training, SeTL employs a memory module that stores both the features and the output predictions of all the target samples to obtain more accurate pseudo labels intermediately. We describe the three main steps in SeTL as follows.

Memory bank update. To avoid ambiguity in the target predictions, we directly sharpen the output predictions via temperature scaling Guo et al. (2017); Berthelot et al. (2019) with temperature ,

(3)

As , the probability collapses to a point mass like Pseudo-Label Lee (2013). Then, the sharpened prediction

along with its L2-normalized feature vector

is written in the memory module based on the index. Here we do not adopt any moving average strategies for updating.

Neighborhood aggregation. With the memory module consisting of features and predictions, we can easily train a classifier by mapping features to predictions. However, the memory module keeps updating every mini-batch, and the training procedure involving extra parameters would be time-consuming. To address this, we present a non-parametric neighborhood aggregation strategy as to approximate . We first retrieve

nearest neighbors from the memory module for each sample in the current mini-batch based on the cosine similarity between their features. Then, we aggregate corresponding predictions of these nearest neighbors by taking the average,

(4)

where denotes the index set of neighbors in the memory module for the data point . In this manner, we obtain a new probability prediction via learning on the entire target data. Note that our strategy indeed considers the global structure beyond regularization within a mini-batch like Cui et al. (2020); Jin et al. (2020).

Pseudo-labeling. For each unlabeled datum , we get the pseudo label by choosing the category index with the maximum probability prediction , i.e., . Considering different neighborhoods lie in regions of different densities, it is desirable to assign a larger weight for the target data in a neighborhood of higher density. Intuitively, the larger the maximum value is, the higher density it will be for the region the datum lies in. Thus, we directly utilize as the confidence (weight) for the pseudo label . Finally, a weighted cross-entropy loss is imposed on the unlabeled target data as below,

(5)

Concerning the labeled data in , we employ the stand cross-entropy loss with label-smoothing regularization Szegedy et al. (2016), denoted as and , respectively. Integrating these losses together, we obtain the final objective for UDA and SSDA as follows,

(6)

where is a trade-off parameter. Actually, we can readily incorporate into other domain alignment methods like CDAN Long et al. (2018) as an additional loss. Besides, for SSL methods like MixMatch Berthelot et al. (2019), we just replace

with the one-hot encoding of

in the label guessing step.

4 Experiments

4.1 Setup

Datasets. We use four benchmark datasets in our experiments, introduced as follows.

Office-31 Saenko et al. (2010) is the most widely-used benchmark in the DA field, which consists of three different domains in 31 categories: Amazon (A) with 2,817 images, Webcam (W) with 795 images, and DSLR (D) with 498 images. There are six transfer tasks for evaluation in total.

Office-Home Venkateswara et al. (2017) is another popular benchmark that consists of images from four different domains: Artistic (A) images, Clip Art (C), Product (P) images, and Real-World (R) images, totally around 15,500 images from 65 different categories. All twelve transfer tasks are selected for evaluation.

VisDA-C Peng et al. (2017) is a large-scale benchmark used for the Visual Domain Adaptation Challenge 2017 that consists of two very distinct kinds of images from twelve common object classes, i.e., 152,397 synthetic images and 55,388 real images. We focus on the challenging synthetic-to-real transfer task.

DomainNet-126 is a subset of DomainNet Peng et al. (2019), by far the largest UDA dataset with six distinct domains and approximately 0.6 million images distributed among 345 categories. Following Saito et al. (2019), we pick four domains (Real (R), Clipart (C), Painting (P), Sketch (S)), and 126 classes for evaluation.

Implementation Details. 111Code will be available at https://github.com/tim-learn/SeTL.

We utilize all the source and target samples and report the average classification accuracy and standard deviation over 3 random trials. All the methods including domain alignment methods 

Long et al. (2018); Chen et al. (2019b), semi-supervised methods Berthelot et al. (2019), and regularization approaches Lee (2013); Grandvalet and Bengio (2005); Chen et al. (2019a); Jin et al. (2020); Cui et al. (2020) are implemented based on PyTorch. Note that MixMatch Berthelot et al. (2019) could be considered as a strong domain adaptation baseline Rukhovich and Galeev (2019). Besides, we select other state-of-the-art UDA approaches Xu et al. (2019); Zou et al. (2019); Kurmi et al. (2019); Li et al. (2020a); Lee et al. (2019b); Li et al. (2020b) and SSDA approaches Saito et al. (2019) for further comparison. For the trade-off parameter, we adopt a linear rampup scheduler from 0 to for all methods, and our method uses

. We adopt mini-batch SGD to learn the feature encoder by fine-tuning from the ImageNet pre-trained model with the learning rate 0.001, and new layers (bottleneck layer and classification layer) from scratch with the learning rate 0.01. We use the suggested training settings in

Long et al. (2018), including learning rate scheduler, momentum (0.9), weight decay (1e), bottleneck size (256), and batch size (36).

4.2 Results

Results of UDA. We use four datasets as introduced above for vanilla UDA tasks, with results shown in Tables 14. On the small-sized Office-31 dataset, we first study different regularization approaches when integrated with the source classification loss only. It is obvious that both MCC Jin et al. (2020) and BNM Cui et al. (2020) consistently perform better than instance-wise regularization methods like MinEnt Grandvalet and Bengio (2005), which verifies the importance of local diversity. SeTL outperforms MCC and BNM in 5 out of 6 tasks, obtaining the best average accuracy. To save space, we select the best-performing counterpart BNM for later comparison. When combined with state-of-the-art UDA methods Long et al. (2018); Chen et al. (2019b), the average accuracy of both methods increases accordingly, and SeTL still performs the best. Since Office-31 is relatively small, MixMatch Berthelot et al. (2019) performs worse than CDAN. Using pseudo labels provided by SeTL, MixMatch obtains boosted performance. Besides, SeTL achieves competitive performance with state-of-the-art UDA method like ATM Li et al. (2020a) without any explicit feature-level alignment. SeTL incorporated in the UDA method Chen et al. (2019b) achieves the best performance on the Office-31 dataset.

Method A D A W D A D W W A W D Avg. Avg. ResNet-50 He et al. (2016) 78.10.6 72.10.4 57.30.0 93.50.1 61.60.1 98.40.2 76.8 67.3 Pseudo-Label Lee (2013) 89.90.5 89.00.2 65.70.2 98.10.0 66.70.2 99.70.1 84.9 77.8 MinEnt Grandvalet and Bengio (2005) 90.50.3 89.70.6 67.10.0 97.60.0 64.90.1 100.0.0 85.0 78.0 MaxSquare Chen et al. (2019a) 91.10.2 90.40.6 67.90.0 97.70.0 64.00.1 100.0.0 85.2 78.4 MCC Jin et al. (2020) 92.30.2 93.70.2 74.80.0 98.50.0 75.40.1 100.0.0 89.1 84.1 BNM Cui et al. (2020) 92.30.2 94.20.1 74.80.0 98.50.0 75.30.1 100.0.0 89.2 84.2 SeTL (ours) 95.10.6 94.50.2 75.50.0 98.90.0 75.30.0 99.60.0 89.8 85.1 CDAN Long et al. (2018) 94.70.8 94.20.1 72.90.2 98.60.0 71.90.2 100.0.0 88.7 83.4  w/ BNM Cui et al. (2020) 94.80.2 94.90.2 76.00.0 99.00.0 75.90.0 100.0.0 90.1 85.4  w/ SeTL (ours) 94.90.1 94.30.2 77.40.2 98.10.0 77.20.0 99.80.0 90.3 86.0 BSP+CDAN Chen et al. (2019b) 94.10.5 95.00.3 74.00.0 98.40.0 75.80.2 100.0.0 89.6 84.7  w/ BNM Cui et al. (2020) 93.40.4 94.30.4 77.00.2 98.90.0 76.20.1 100.0.0 90.0 85.2  w/ SeTL (ours) 96.70.5 95.70.1 76.60.0 98.90.0 76.50.0 100.0.0 90.7 86.4 MixMatch Berthelot et al. (2019) 89.00.2 86.00.9 65.82.0 96.20.0 65.60.5 99.60.0 83.7 76.6  w/ SeTL (ours) 92.11.4 92.33.1 70.60.0 98.60.0 75.50.7 99.60.0 88.1 82.6 SAFN Xu et al. (2019) 90.70.5 90.10.8 73.00.2 98.60.2 70.20.3 99.80.0 87.1 81.0 CRST Zou et al. (2019) 88.70.8 89.40.7 72.60.7 98.90.4 70.90.5 100.0.0 86.8 80.4 CADA-P Kurmi et al. (2019) 95.60.1 97.00.2 71.50.2 99.30.1 73.10.3 100.0.0 89.5 84.3 ATM Li et al. (2020a) 96.40.2 95.70.3 74.10.2 99.30.1 73.50.3 100.0.0 89.8 84.9

Table 1: Accuracy (%) on Office for UDA (ResNet-50). [ denotes mean values except DW]

Method aero bike bus car horse knife mbike person plant skbrd train truck Mean ResNet-101 He et al. (2016) 68.6 24.3 54.9 63.5 69.2 15.7 85.2 13.5 68.1 32.0 82.2 16.8 49.5 BNM Cui et al. (2020) 91.6 70.8 76.5 65.4 90.7 77.8 90.6 76.3 91.7 68.2 88.3 44.3 77.7 SeTL (ours) 93.3 84.3 78.0 59.7 90.3 95.2 83.7 69.7 90.8 79.5 87.6 53.5 80.5 CDAN Long et al. (2018) 93.4 55.5 79.5 71.1 88.6 87.0 93.5 78.5 88.9 68.5 88.6 36.3 77.4  w/ BSP Chen et al. (2019b) 93.4 56.0 79.0 69.0 89.5 87.2 92.4 79.4 89.7 74.1 88.7 32.0 77.5  w/ BNM Cui et al. (2020) 94.2 66.7 78.7 70.2 90.7 88.5 92.7 78.0 90.4 73.8 88.9 44.1 79.7  w/ SeTL (ours) 94.4 74.1 83.1 63.6 92.2 91.2 91.7 77.0 91.6 86.1 87.6 44.2 81.4 MixMatch Berthelot et al. (2019) 94.3 71.3 94.2 81.6 95.2 0.6 90.6 40.7 93.8 96.2 84.7 0.5 70.3  w/ SeTL (ours) 94.8 85.1 82.2 71.1 95.8 98.2 88.1 81.7 94.1 92.5 91.7 62.4 86.5 SAFN Xu et al. (2019) 93.6 61.3 84.1 70.6 94.1 79.0 91.8 79.6 89.9 55.6 89.0 24.4 76.1 CRST Zou et al. (2019) 88.0 79.2 61.0 60.0 87.5 81.4 86.3 78.8 85.6 86.6 73.9 68.8 78.1 DTA Lee et al. (2019b) 93.7 82.2 85.6 83.8 93.0 81.0 90.7 82.1 95.1 78.1 86.4 32.1 81.5

Table 2: Per-class accuracy (%) on VisDA-C validation set using a ResNet-101 backbone.

For VisDA-C and Office-Home, we compare the performance between BNM and SeTL with or without domain alignment, respectively. As shown in Table 2, SeTL clearly performs better than BNM w.r.t. mean accuracy for both situations. Note, SeTL combined with MixMatch obtains the state-of-the-art mean accuracy 86.5% for VisDA-C, which outperforms recent UDA methods Xu et al. (2019); Zou et al. (2019); Lee et al. (2019b). Taking a closer look at Table 3, we observe similar results for Office-Home that SeTL beats BNM in terms of mean accuracy. Since VisDA-C only contains 12 classes in total, it is necessary to introduce DomainNet-126 as a new large-scale UDA testbed. Table 4 again validates the effectiveness of the proposed SeTL. Compared with medium-sized Office-Home, SeTL shows even larger advantages over BNM for large-scale datasets like VisDA-C and DomainNet-126.

Method AC AP AR CA CP CR PA PC PR RA RC RP Avg. ResNet-50 He et al. (2016) 44.9 66.3 74.1 51.9 61.7 63.7 52.6 39.1 71.3 63.9 45.8 77.1 59.4 BNM Cui et al. (2020) 56.6 77.6 81.0 67.4 76.3 77.2 65.2 55.1 81.9 73.4 57.0 84.2 71.1 SeTL (ours) 58.4 78.9 82.4 69.1 77.6 78.1 67.1 56.3 82.7 72.0 58.3 85.5 72.2 CDAN Long et al. (2018) 54.7 74.1 78.1 63.2 72.2 74.4 61.7 51.7 79.3 72.2 57.3 82.9 68.5  w/ BSP Chen et al. (2019b) 56.7 73.5 77.5 64.2 71.9 74.4 64.1 56.9 80.8 73.6 58.9 83.3 69.6  w/ BNM Cui et al. (2020) 58.1 77.2 81.1 67.5 75.3 77.2 65.5 56.8 82.6 74.1 59.9 84.6 71.7  w/ SeTL (ours) 60.1 77.9 82.3 68.6 78.2 77.8 67.9 58.3 83.0 74.5 61.6 87.1 73.1 MixMatch Berthelot et al. (2019) 52.4 74.3 80.2 64.8 74.5 75.3 61.7 51.0 80.0 72.4 56.6 83.8 68.9  w/ SeTL (ours) 58.8 77.7 82.4 67.2 78.2 79.5 64.9 53.7 83.7 71.8 61.4 85.4 72.1 SAFN Xu et al. (2019) 52.0 71.7 76.3 64.2 69.9 71.9 63.7 51.4 77.1 70.9 57.1 81.5 67.3 CADA-P Kurmi et al. (2019) 56.9 76.4 80.7 61.3 75.2 75.2 63.2 54.5 80.7 73.9 61.5 84.1 70.2 DCAN Li et al. (2020b) 54.5 75.7 81.2 67.4 74.0 76.3 67.4 52.7 80.6 74.1 59.1 83.5 70.5

Table 3: Accuracy (%) on Office-Home for UDA (ResNet-50).

Method CP CR CS PC PR PS RC RP RS SC SP SR Avg. ResNet-50 He et al. (2016) 49.0 62.4 50.3 56.7 75.3 50.2 58.3 63.5 48.9 56.9 52.0 59.5 56.9 BNM Cui et al. (2020) 59.8 72.2 60.1 69.5 80.7 65.4 68.9 69.2 60.8 71.8 66.5 74.0 68.2 SeTL (ours) 66.3 78.9 64.9 73.5 82.3 66.5 74.2 71.9 64.5 75.8 68.6 78.9 72.2 CDAN Long et al. (2018) 57.6 69.1 59.4 65.5 77.1 62.1 72.0 70.9 63.5 67.7 64.4 69.6 66.6  w/ BSP Chen et al. (2019b) 58.2 69.3 58.8 65.8 76.3 62.4 71.6 70.9 62.9 67.7 65.3 69.8 66.6  w/ BNM Cui et al. (2020) 61.4 73.8 61.5 69.6 80.6 66.6 72.5 70.8 63.9 71.6 68.0 73.7 69.5  w/ SeTL (ours) 66.2 79.0 65.3 73.0 82.2 67.3 74.6 71.8 65.9 74.9 69.5 77.5 72.3 MixMatch Berthelot et al. (2019) 58.3 72.8 60.4 69.3 79.6 66.2 70.9 71.7 62.5 72.4 67.0 75.8 68.9  w/ SeTL (ours) 66.8 79.1 67.5 73.9 82.4 67.9 75.7 73.9 68.5 76.7 71.4 79.8 73.6

Table 4: Accuracy (%) on DomainNet-126 for UDA (ResNet-50).

Results of SSDA. We follow the settings in MME Saito et al. (2019) and evaluate SSDA methods on two benchmark datasets: Office-Home and DomainNet-126. For each dataset, there exist two SSDA settings, i.e., 1-shot and 3-shot, where each class in the target domain has one or three labeled data points, respectively. As shown in Table 5, SeTL outperforms both BNM and MCC for both settings, and MixMatch also benefits from the incorporation of SeTL. Comparing the results of SeTL under 1-shot and 3-shot, we find the difference between them is relatively small, implying that SeTL can fully exploit the unlabeled data to compensate for the scarcity of labeled data. We can draw similar conclusions on the Office-Home dataset from Table 6. Moreover, compared with prior state-of-the-art SSDA results in Saito et al. (2019), both SeTL and its combination with MixMatch achieve better performance for both datasets under both settings.

Method C S P C P R R C R P R S S P Average 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot ResNet-34 He et al. (2016) 54.8 57.9 59.2 63.0 73.7 75.6 61.2 63.9 64.5 66.3 52.0 56.0 60.4 62.2 60.8 63.6 MCC Jin et al. (2020) 56.9 60.5 62.8 66.4 75.4 77.2 65.5 67.8 67.0 68.3 57.9 59.3 63.5 64.9 64.1 66.3 BNM Cui et al. (2020) 58.5 62.7 69.2 72.0 77.0 79.5 69.4 73.5 69.4 71.2 61.2 65.0 63.6 67.0 66.9 70.1 SeTL (ours) 66.0 66.1 73.0 74.2 81.2 81.3 74.8 76.9 71.3 72.5 65.2 65.0 68.7 70.6 71.5 72.4 MixMatch Berthelot et al. (2019) 59.3 62.8 66.7 68.6 75.2 78.8 69.6 72.7 67.8 68.8 62.5 65.6 66.3 67.1 66.8 69.2  w/ SeTL (ours) 64.4 66.0 71.1 72.3 80.2 80.9 73.9 75.2 70.2 71.2 65.7 67.3 67.7 69.5 70.5 71.8 ENT Saito et al. (2019) 54.6 60.0 65.4 71.1 75.0 78.6 65.2 71.0 65.9 69.2 52.1 61.1 60.0 59.7 62.6 67.6 MME Saito et al. (2019) 56.3 61.8 69.0 71.7 76.1 78.5 70.0 72.2 67.7 69.7 61.0 61.9 64.8 66.8 66.4 68.9

Table 5: Accuracy (%) on DomainNet-126 for SSDA (ResNet-34). means results from Saito et al. (2019)

Method A C A P A R C A C P C R P A P C P R R A R C R P Average 1-s 3-s 1-s 3-s 1-s 3-s 1-s 3-s 1-s 3-s 1-s 3-s 1-s 3-s 1-s 3-s 1-s 3-s 1-s 3-s 1-s 3-s 1-s 3-s 1-s 3-s VGG-16 Simonyan and Zisserman (2015) 38.9 48.1 64.8 71.9 69.8 72.8 50.7 55.5 66.2 71.7 64.6 69.3 50.8 54.1 38.5 47.9 71.8 73.5 61.4 61.9 42.5 50.5 76.8 79.5 58.1 63.1 MCC Jin et al. (2020) 42.2 49.6 69.3 74.6 71.8 74.7 55.6 56.9 69.2 75.8 69.8 72.2 55.2 56.1 41.5 50.0 74.1 75.1 63.8 63.0 43.9 52.6 78.4 81.5 61.2 65.2 BNM Cui et al. (2020) 41.0 50.4 69.8 77.5 74.3 76.5 58.3 59.4 71.4 76.8 70.6 73.5 54.3 57.2 40.1 52.4 76.6 77.5 63.8 65.1 41.8 53.3 79.8 83.7 61.8 66.9 SeTL (ours) 46.6 53.7 71.8 78.2 76.2 76.6 59.6 62.1 75.7 78.3 72.3 75.5 59.9 61.6 46.7 52.7 77.9 78.2 66.4 67.1 50.9 55.3 81.2 84.2 65.4 68.6 MixMatch Berthelot et al. (2019) 40.8 47.1 67.8 74.0 72.2 73.9 55.6 57.5 68.9 75.3 68.8 71.0 50.2 55.5 35.6 47.5 73.5 74.7 63.8 67.7 38.0 44.7 79.6 81.7 59.6 64.2  w/ SeTL (ours) 49.6 53.7 74.6 79.0 75.1 76.9 58.9 61.6 74.9 78.4 71.9 74.0 58.6 62.6 47.2 52.0 77.1 78.2 67.2 68.5 53.3 56.2 82.6 84.2 65.9 68.8 DANNGanin and Lempitsky (2015) 44.4 50.0 64.3 69.5 68.9 72.3 52.3 56.4 65.3 69.8 64.2 68.7 51.3 56.3 45.9 52.4 72.7 73.6 62.7 63.7 52.0 56.1 75.7 77.9 60.0 63.9 MME Saito et al. (2019) 45.8 54.9 68.6 75.7 72.2 75.3 57.5 61.1 71.3 76.3 68.0 72.9 56.0 59.2 46.2 53.6 74.4 76.7 65.1 65.7 49.1 56.9 78.7 82.9 62.7 67.6

Table 6: Accuracy (%) on Office-Home for SSDA (VGG-16). means results from Saito et al. (2019).

Results of SSL. We also evaluate SeTL in the case without domain shift. Here we focus on a special case of SSL where annotated samples are very scarce. For simplicity, we adopt the same three-shot setting in SSDA for the SSL task. Especially, we take labeled target data as the labeled set and unlabeled target data as the unlabeled set, forming the scarce-labeled SSL task.

Dataset Office-Home DomainNet-126 Method Art Clipart Product Real-World Avg. Clipart Painting Real Sketch Avg. ResNet-50 He et al. (2016) 48.60.2 42.30.2 69.00.1 66.50.1 56.6 41.50.1 46.20.0 66.40.0 33.40.0 46.9 Pseudo-Label Lee (2013) 48.00.3 41.40.2 71.20.3 65.80.1 56.6 41.00.0 46.30.0 72.50.0 33.20.4 48.2 MinEnt Grandvalet and Bengio (2005) 51.80.0 44.20.4 72.30.1 68.90.0 59.3 43.80.0 48.60.0 68.80.0 36.00.0 49.3 MaxSquare Chen et al. (2019a) 54.40.0 43.90.6 73.00.1 69.00.0 60.1 43.60.4 48.70.0 69.00.0 36.90.0 49.6 MCC Jin et al. (2020) 58.70.0 47.00.8 77.50.0 74.10.0 64.3 45.60.0 49.50.0 70.80.0 38.60.0 51.1 BNM Cui et al. (2020) 59.00.0 46.00.1 75.70.1 71.50.0 63.0 44.80.1 46.50.0 69.90.0 35.50.1 49.2 SeTL (ours) 59.30.2 46.80.8 78.40.0 76.10.1 65.2 54.60.1 60.00.0 75.50.1 39.40.1 57.4 MixMatch Berthelot et al. (2019) 52.10.0 42.70.8 72.90.3 69.00.2 59.2 41.20.0 39.30.1 64.50.4 34.20.5 44.8  w/ SeTL (ours) 57.20.2 48.40.8 74.80.4 74.90.5 63.8 49.50.5 51.90.0 73.50.0 40.30.3 53.8

Table 7: Accuracy (%) on Office-Home and DomainNet-126 for scarce-labeled SSL (ResNet-50).

As shown in Table 7, SeTL performs the best on both Office-Home and DomainNet-126. For such a scarce-labeled SSL task, MixMatch performs badly. The reason may be that labeled data are quite scarce, resulting in low-quality pseudo labels and thus bringing much noise in the following mixup step. Taking full advantage of unlabeled data, SeTL can improve the quality of pseudo labels and significantly boost the performance of MixMatch when replacing the label guessing process in Mixmatch with our SeTL. Benefited from a large amount of unlabeled data, SeTL outperforms BNM and MCC for SSL tasks on DomainNet-126 with a larger margin than that on Office-Home.

(a) Convergence (b) t-SNE visualizations of features learned by ResNet-50 He et al. (2016), Pseudo-Label Lee (2013), and SeTL (ours)
Figure 2: For AC task on Office-Home, (a) shows the convergence of SeTL and BNM Cui et al. (2020) w.r.t. the rampup of , and (b) shows the t-SNE visualizations of different methods. (red: A, blue: C)

4.3 Model Analysis

We study the convergence of SeTL and the ramp-up of , and make comparisons with BNM in Fig. 2(a). Comparing both methods with or without the ramp-up, it is easy to verify the effectiveness of linear ramp-up. Since the pseudo labels or original classifier outputs in the early stage are not reliable enough, using a ramp-up to progressively increase the regularization weight is desirable for both SeTL and BNM. Besides, with the iteration number increasing, the accuracy of SeTL grows up and converges at last. Furthermore, we employ the t-SNE visualization Maaten and Hinton (2008) in Fig. 2(b) to show whether features from different domains are well aligned even without explicit domain alignment. Compared with ResNet-50 and Pseudo-Label, features from both domains learned by SeTL are semantically aligned and more favorable.

Ablation Office-31 VisDA-C
SeTL (default, ) 89.8 80.5
SeTL w/o weight 89.5 () 79.9 ()
SeTL w/ temperature 89.4 () 80.0 ()
SeTL w/ neighborhood size 84.8 () 79.8 ()
SeTL w/ neighborhood size 88.0 () 80.5 (-)
SeTL w/ parameter 90.0 () 78.8 ()
SeTL w/ parameter 89.3 () 81.0 ()
Table 8: Ablation study.

We further conduct ablation on Office-31 and VisDA-C for UDA and show average accuracy in Table 8. Comparing results in the first three rows, we find both weighting and sharpening strategies are effective. Besides, we study the neighborhood size for SeTL and find a larger value of can bring better performance. In particular, on the small Office-31 dataset, using is quite risky and achieves worse results. Regarding another parameter , we discover is a suitable choice for both datasets. For the large-scale VisDA-C dataset, the learned pseudo labels are more reliable, so a large value of is beneficial.

5 Conclusion

We presented SeTL, a new regularization approach to address dataset shift for domain adaptation tasks. Despite the simplicity, extensive experiments demonstrated that SeTL outperforms both domain alignment methods and other regularization methods with consistent margins on UDA, SSDA, and even scarce-labeled SSL tasks. In the future, we would like to extend SeTL to other challenging transfer tasks like universal DA Saito et al. (2020); You et al. (2019) and dense labeling tasks like semantic segmentation Tsai et al. (2018); Chen et al. (2019c).

References

  • [1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010) A theory of learning from different domains. Mach. Learn. 79 (1-2), pp. 151–175. Cited by: §2.1.
  • [2] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019) Mixmatch: a holistic approach to semi-supervised learning. In Proc. NeurIPS, Cited by: §3.2, §3.2, §4.1, §4.2, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7.
  • [3] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. In Proc. NeurIPS, Cited by: §2.1.
  • [4] W. Chang, T. You, S. Seo, S. Kwak, and B. Han (2019) Domain-specific batch normalization for unsupervised domain adaptation. In Proc. CVPR, Cited by: §2.1.
  • [5] M. Chen, H. Xue, and D. Cai (2019) Domain adaptation for semantic segmentation with maximum squares loss. In Pro. ICCV, Cited by: §2.2, §3.1, §3.1, §3, §4.1, Table 1, Table 7.
  • [6] X. Chen, S. Wang, M. Long, and J. Wang (2019) Transferability vs. discriminability: batch spectral penalization for adversarial domain adaptation. In Proc. ICML, Cited by: §4.1, §4.2, Table 1, Table 2, Table 3, Table 4.
  • [7] Y. Chen, X. Zhu, and S. Gong (2018) Semi-supervised deep learning with memory. In Proc. ECCV, Cited by: §2.3.
  • [8] Y. Chen, Y. Lin, M. Yang, and J. Huang (2019) Crdoco: pixel-level domain transfer with cross-domain consistency. In Proc. CVPR, Cited by: §5.
  • [9] S. Cicek and S. Soatto (2019) Unsupervised domain adaptation via regularized conditional alignment. In Proc. ICCV, Cited by: §1, §2.1, §3.
  • [10] G. Csurka (2017) A comprehensive survey on domain adaptation for visual applications. In

    Domain adaptation in computer vision applications

    ,
    pp. 1–35. Cited by: §2.
  • [11] S. Cui, S. Wang, J. Zhuo, L. Li, Q. Huang, and Q. Tian (2020) Towards discriminability and diversity: batch nuclear-norm maximization under label insufficient situations. In Proc. CVPR, Cited by: §2.2, §3.1, §3.2, §3, Figure 2, §4.1, §4.2, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7.
  • [12] Z. Deng, Y. Luo, and J. Zhu (2019) Cluster alignment with a teacher for unsupervised domain adaptation. In Proc. ICCV, Cited by: §2.2.
  • [13] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In Proc. ICML, Cited by: §1, §2.1, §3, Table 6.
  • [14] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17 (1), pp. 2096–2030. Cited by: §1, §1.
  • [15] B. Gong, Y. Shi, F. Sha, and K. Grauman (2012) Geodesic flow kernel for unsupervised domain adaptation. In Proc. CVPR, Cited by: §1.
  • [16] Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Proc. NeurIPS, Cited by: §2.2, §3.1, §4.1, §4.2, Table 1, Table 7.
  • [17] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola (2007) A kernel method for the two-sample-problem. In Proc. NeurIPS, Cited by: §2.1.
  • [18] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proc. ICML, Cited by: §3.2.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. CVPR, Cited by: Figure 2, Table 1, Table 2, Table 3, Table 4, Table 5, Table 7.
  • [20] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018) CyCADA: cycle-consistent adversarial domain adaptation. In Proc. ICML, Cited by: §1, §3.
  • [21] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: §1.
  • [22] Y. Jin, X. Wang, M. Long, and J. Wang (2020) Minimum class confusion for versatile domain adaptation. In Proc. ECCV, Cited by: §2.2, §3.1, §3.2, §3, §4.1, §4.2, Table 1, Table 5, Table 6, Table 7.
  • [23] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann (2019) Contrastive adaptation network for unsupervised domain adaptation. In Proc. CVPR, Cited by: §1.
  • [24] W. M. Kouw and M. Loog (2019) A review of domain adaptation without target labels. IEEE Trans. Pattern Anal. Mach. Intell. (), pp. 1–1. Cited by: §2.
  • [25] V. K. Kurmi, S. Kumar, and V. P. Namboodiri (2019) Attending to discriminative certainty for domain adaptation. In Proc. CVPR, Cited by: §4.1, Table 1, Table 3.
  • [26] C. Lee, T. Batra, M. H. Baig, and D. Ulbricht (2019) Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proc. CVPR, Cited by: §1.
  • [27] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Cited by: §2.2, §3.1, §3.2, Figure 2, §4.1, Table 1, Table 7.
  • [28] S. Lee, D. Kim, N. Kim, and S. Jeong (2019) Drop to adapt: learning discriminative features for unsupervised domain adaptation. In Proc. ICCV, Cited by: §4.1, §4.2, Table 2.
  • [29] J. Li, E. Chen, D. Zhengming, L. Zhu, K. Lu, and H. T. Shen (2020) Maximum density divergence for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. (), pp. 1–1. Cited by: §3, §4.1, §4.2, Table 1.
  • [30] S. Li, C. H. Liu, Q. Lin, B. Xie, Z. Ding, G. Huang, and J. Tang (2020) Domain conditioned adaptation network. In Proc. AAAI, Cited by: §4.1, Table 3.
  • [31] J. Liang, R. He, Z. Sun, and T. Tan (2019) Distant supervised centroid shift: a simple and efficient approach to visual domain adaptation. In Proc. CVPR, Cited by: §3.2.
  • [32] J. Liang, D. Hu, and J. Feng (2020) Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In Proc. ICML, Cited by: §2.1.
  • [33] M. Long, Y. Cao, J. Wang, and M. Jordan (2015) Learning transferable features with deep adaptation networks. In Proc. ICML, Cited by: §1, §2.1.
  • [34] M. Long, Z. Cao, J. Wang, and M. I. Jordan (2018) Conditional adversarial domain adaptation. In Proc. NeurIPS, Cited by: §1, §2.1, §3.2, §3.2, §3, §4.1, §4.2, Table 1, Table 2, Table 3, Table 4.
  • [35] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. In Proc. NeurIPS, Cited by: §2.1.
  • [36] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. J. Mach. Learn. Res. 9 (Nov), pp. 2579–2605. Cited by: §4.3.
  • [37] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 41 (8), pp. 1979–1993. Cited by: §2.2.
  • [38] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang (2019) Moment matching for multi-source domain adaptation. In Proc. ICCV, Cited by: §4.1.
  • [39] X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko (2017) Visda: the visual domain adaptation challenge. arXiv preprint arXiv:1710.06924. Cited by: §1, §4.1.
  • [40] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence (2009)

    Dataset shift in machine learning

    .
    Cited by: §1.
  • [41] A. Rozantsev, M. Salzmann, and P. Fua (2018) Beyond sharing weights for deep domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 41 (4), pp. 801–814. Cited by: §2.1.
  • [42] D. Rukhovich and D. Galeev (2019) MixMatch domain adaptaion: prize-winning solution for both tracks of visda 2019 challenge. arXiv preprint arXiv:1910.03903. Cited by: §4.1.
  • [43] K. Saenko, B. Kulis, M. Fritz, and T. Darrell (2010) Adapting visual category models to new domains. In Proc. ECCV, Cited by: §4.1.
  • [44] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko (2019) Semi-supervised domain adaptation via minimax entropy. In Proc. ICCV, Cited by: §4.1, §4.1, §4.2, Table 5, Table 6.
  • [45] K. Saito, D. Kim, S. Sclaroff, and K. Saenko (2020) Universal domain adaptation through self supervision. arXiv preprint arXiv:2002.07953. Cited by: §2.3, §5.
  • [46] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa (2018) Generate to adapt: aligning domains using generative adversarial networks. In Proc. CVPR, Cited by: §1, §3.
  • [47] W. Shi, Y. Gong, C. Ding, Z. MaXiaoyu Tao, and N. Zheng (2018) Transductive semi-supervised deep learning using min-max features. In Proc. ECCV, Cited by: §2.2.
  • [48] R. Shu, H. Bui, H. Narui, and S. Ermon (2018) A dirt-t approach to unsupervised domain adaptation. In Proc. ICLR, Cited by: §2.1.
  • [49] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, Cited by: Table 6.
  • [50] S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end memory networks. In Proc. NeurIPS, Cited by: §2.3.
  • [51] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proc. CVPR, Cited by: §3.2.
  • [52] T. Tommasi, M. Lanzi, P. Russo, and B. Caputo (2016) Learning the roots of visual domain shift. In Proc. ECCV, Cited by: §1.
  • [53] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In Proc. CVPR, Cited by: §1, §5.
  • [54] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In Proc. CVPR, Cited by: §2.1, §3.
  • [55] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell (2014) Deep domain confusion: maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Cited by: §2.1.
  • [56] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan (2017) Deep hashing network for unsupervised domain adaptation. In Proc. CVPR, Cited by: §4.1.
  • [57] G. Wilson and D. J. Cook (2020) A survey of unsupervised deep domain adaptation. ACM Trans. Intell. Syst. Technol. 11 (5). Cited by: §2.
  • [58] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proc. CVPR, Cited by: §2.3.
  • [59] R. Xu, G. Li, J. Yang, and L. Lin (2019) Larger norm more transferable: an adaptive feature norm approach for unsupervised domain adaptation. In Proc. ICCV, Cited by: §4.1, §4.2, Table 1, Table 2, Table 3.
  • [60] K. You, M. Long, Z. Cao, J. Wang, and M. I. Jordan (2019) Universal domain adaptation. In Proc. CVPR, Cited by: §5.
  • [61] H. Zhao, R. T. Des Combes, K. Zhang, and G. Gordon (2019) On learning invariant representations for domain adaptation. In Proc. ICML, Cited by: §1.
  • [62] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang (2019) Invariance matters: exemplar memory for domain adaptive person re-identification. In Proc. CVPR, Cited by: §2.3.
  • [63] X. J. Zhu (2005) Semi-supervised learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §2.2, §3.1.
  • [64] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang (2019) Confidence regularized self-training. In Proc. ICCV, Cited by: §4.1, §4.2, Table 1, Table 2.