Transferrable Contrastive Learning for Visual Domain Adaptation

by   Yang Chen, et al., Inc.

Self-supervised learning (SSL) has recently become the favorite among feature learning methodologies. It is therefore appealing for domain adaptation approaches to consider incorporating SSL. The intuition is to enforce instance-level feature consistency such that the predictor becomes somehow invariant across domains. However, most existing SSL methods in the regime of domain adaptation usually are treated as standalone auxiliary components, leaving the signatures of domain adaptation unattended. Actually, the optimal region where the domain gap vanishes and the instance level constraint that SSL peruses may not coincide at all. From this point, we present a particular paradigm of self-supervised learning tailored for domain adaptation, i.e., Transferrable Contrastive Learning (TCL), which links the SSL and the desired cross-domain transferability congruently. We find contrastive learning intrinsically a suitable candidate for domain adaptation, as its instance invariance assumption can be conveniently promoted to cross-domain class-level invariance favored by domain adaptation tasks. Based on particular memory bank constructions and pseudo label strategies, TCL then penalizes cross-domain intra-class domain discrepancy between source and target through a clean and novel contrastive loss. The free lunch is, thanks to the incorporation of contrastive learning, TCL relies on a moving-averaged key encoder that naturally achieves a temporally ensembled version of pseudo labels for target data, which avoids pseudo label error propagation at no extra cost. TCL therefore efficiently reduces cross-domain gaps. Through extensive experiments on benchmarks (Office-Home, VisDA-2017, Digits-five, PACS and DomainNet) for both single-source and multi-source domain adaptation tasks, TCL has demonstrated state-of-the-art performances.


page 1

page 2

page 3

page 4


Discriminative Cross-Domain Feature Learning for Partial Domain Adaptation

Partial domain adaptation aims to adapt knowledge from a larger and more...

Domain Confused Contrastive Learning for Unsupervised Domain Adaptation

In this work, we study Unsupervised Domain Adaptation (UDA) in a challen...

Contrastive Domain Adaptation

Recently, contrastive self-supervised learning has become a key componen...

Contrastive Vicinal Space for Unsupervised Domain Adaptation

Utilizing vicinal space between the source and target domains is one of ...

Few-Max: Few-Shot Domain Adaptation for Unsupervised Contrastive Representation Learning

Contrastive self-supervised learning methods learn to map data points su...

Contrastive Test-Time Adaptation

Test-time adaptation is a special setting of unsupervised domain adaptat...

Distance-based Hyperspherical Classification for Multi-source Open-Set Domain Adaptation

Vision systems trained in closed-world scenarios will inevitably fail wh...

1. Introduction

Deep Neural Networks (DNN)

(He et al., 2016; Li et al., 2021) have shown powerful capability of feature learning when trained on large-scale datasets. Conventional supervised learning using DNN frameworks requires enormous annotated data via expensive and time-consuming manual labeling. This has constrained the scalability of DNN vision models, especially when the annotation is unavailable or only limited compared to a large number of parameters constructing the network. Unsupervised Domain Adaptation (UDA) aims to alleviate this problem by making the most out of the available labels and the learned knowledge obtained from a rich-resource domain (i.e., source domain with labeled instances), such that this information can be recycled and applied to understand the scarce-resource domain (i.e., target domain without label annotations). Due to the domain distributional shift between the source domain and target domain, much effort has been made to explicitly reduce domain shift by aligning source and target distributions. A typical example is (Long et al., 2015), where the domain gap is reduced by minimizing the Maximum Mean Discrepancy (MMD) metric between representations of source and target instances (Figure 1(a)).

Inspired by semi-supervised learning methods (Sajjadi et al., 2016; Tarvainen and Valpola, 2017), Self-Supervised Learning (SSL) has become an alternative that pioneers the way of UDA approaches (Sun et al., 2019; Carlucci et al., 2019; Xiao et al., 2020; French et al., 2018). The basic idea behind these SSL works is to reformulate the UDA task as a semi-supervised learning problem: to perform self-supervised pretext tasks (e.g., rotation prediction) using unlabeled target and labeled source data (Figure 1(b)), and to perform supervised learning on labeled source data. Take for instance, (Sun et al., 2019; Xiao et al., 2020) apply 2d rotation to training data, and the encoder has to learn useful semantic features to accomplish the rotation prediction task. Although these seminal SSL based UDA approaches have achieved state-of-the-art performances in comparison to conventional UDA frameworks in Figure 1(a), existing SSL paradigms, unfortunately, tend to ignore the rich information and inherent distributional assumptions hidden behind the domain gaps. Since the domain gap assumption does not necessarily respect the various instance-level hypothesis that conventional SSL desires, there is still plenty of room to improve SSL frameworks when it is associated with UDA tasks.

The above considerations motivate us to tailor brand new SSL feature extraction recipes exclusively for UDA. Accordingly, we present Transferrable Contrastive Learning (TCL), a novel UDA algorithm that unifies and forties the strength of both SSL and UDA congruently. Our launching point is to refurbish conventional Contrastive Learning – a recently prevailing SSL framework, such that our novel UDA distributional assumptions, useful prior knowledge in the UDA field, and long-existing old training tricks such as pseudo labeling can all be snugly plugged in with suitable modifications. To this end, we build domain-specific memories: source memory and target memory. The delicate design of these memories are of pivotal importance that validates TCL, and bridges SSL with domain adaptation tasks from there. We define source memory to track source domain instances’ features along with their class information. We also have target memory that records target domain feature along with their “pseudo-labels”. These particular slicing strategies of memories according to their domain and labels (or pseudo-labels) are essential for contrastive learning to rejuvenate in the context of UDA. The constructed domain-specific memories are aimed to offer a bed for later jointly modeling among heterogeneous pretext tasks (e.g., pseudo-labeling and class-level instance discrimination) while the domain gap is simultaneously reduced.

We summarize our contribution succinctly here: we propose a novel self-supervised learning paradigm called TCL. The TCL is a brand new UDA recipe to achieve both inter-class discriminativeness and cross-domain transferability of features through contrastive feature learning. TCL provides a tailored solution that focuses on the effective integration of contrastive learning and UDA via its associated pseudo labeling strategy and memory slicing policies. The TCL framework contrasts with any existing UDA algorithms that combine instance-level invariance rigidly as merely a standalone “icing on the cake” for UDA problems, and TCL discusses how the conventional contrastive learning could be recast exclusively to serve the UDA goals. We demonstrate that TCL achieves state-of-the-art empirical performances on five benchmarks, in regard to both single and multi-source domain adaptation tasks.

2. Related Work

Unsupervised Domain Adaptation methods can be generally categorized into four groups. The first group is the domain discrepancy based approach (Kang et al., 2019; Long et al., 2015, 2016, 2017; Yan et al., 2017; Yao et al., 2015). Motivated by the theoretical bound proposed in (Ben-David et al., 2010, 2006), this direction aims at minimizing the discrepancy between the source and target domains. The adopted discrepancy loss varies across different statistic distances metrics (e.g., Maximum Mean Discrepancy (MMD) (Long et al., 2015), Jensen-Shannon (JS) Divergence (Ganin et al., 2016) and Wasserstein distance (Shen et al., 2018)). The second group usually resorts to adversarial learning, which aims to leverage adversarial learning to align two domains on either feature-level (Cui et al., 2020; Ganin and Lempitsky, 2015; Long et al., 2018; Tzeng et al., 2017; Li et al., 2019a, b) or pixel-level (Bousmalis et al., 2017; Ghifary et al., 2016; Xu et al., 2020; Chen et al., 2019a). The third group is the pseudo label based approach (Chen et al., 2019b; French et al., 2018; Gu et al., 2020; Pan et al., 2019b; Jing et al., 2020; Cai et al., 2019). Take for example, (Chen et al., 2019b; Pan et al., 2019b)

utilize pseudo labels to estimate class centers in the target domain, which are then used to match each class center across the source domain and target domain. The fourth direction is the recently emerging self-supervised learning based approaches

(Sun et al., 2019; Carlucci et al., 2019; Xiao et al., 2020; French et al., 2018). In Jigsaw (Carlucci et al., 2019), the permutation of images crops from each instance are predicted, so that the network improves its semantic learning capability for the UDA task. In addition to single source domain UDA, works also consider the problem of multi-source domain adaptation task (Hoffman et al., 2018; Peng et al., 2019; Wang et al., 2020; Xu et al., 2018; Yang et al., 2020; Zhao et al., 2018; Pan et al., 2019a), which is a more generalized and practical scenario when annotated training data from multiple source domains are accessible. For example, MDAN (Zhao et al., 2018) utilizes adversarial adaptation to induce invariant features for all pairs of source-target domains. MSDA (Peng et al., 2019)

matches the statistical moments of all source domains and target domain to reduce the domain gap.

Figure 2. An overview of TCL. In addition to conventional supervised learning over labeled source query sample, pseudo-labeling is performed to supervise the prediction of target query sample via Cross-Entropy (CE) loss. TCL loss then minimizes the cross-domain discrepancy between query and positive keys belonging to the same class, while the assumed negative keys from different classes are stored in domain-specific memories. TCL loss is essentially a cross-domain class-level objective that naturally unifies both class-level instance discrimination and domain alignment objectives.

Contrastive Learning. Among the recent state-of-the-art self-supervised learning algorithms (Bachman et al., 2019; Caron et al., 2020; Chen et al., 2020; Chen et al., 2021; He et al., 2020; Hjelm et al., 2019; Oord et al., 2018; Yao et al., 2021), contrastive learning (Hadsell et al., 2006; Cai et al., 2020; Lin et al., 2021) has offered impressively superior performance, especially when used for pre-training tasks. The key idea of contrastive learning is to achieve instance-level discrimination and invariance, by pushing semantically distinct points away in the feature space, while pulling semantically nearby points closer. Inspired by NCE (Gutmann and Hyvärinen, 2010), CPC (Oord et al., 2018)

considers a softmax classifier to distinguish between individual instance classes. As NCE theoretically justified, a large number of instances help to alleviate the learning problem. MoCo

(He et al., 2020) thus further pushes the limit of contrastive training via the construction of a momentum updated memory bank, to store past old representations as more as possible. SimCLR (Chen et al., 2020), SwAV (Caron et al., 2020) and MoCo-v3 (Chen et al., 2021) provide alternative training techniques that efficiently improve contrastive learning.

In this work, we seek to establish alternative novel symbiosis to combine contrastive learning with UDA task in a more congruent way. Our proposed TCL loss is inspired by contrastive learning but further goes beyond the instance level discrimination in typical contrastive learning. In comparison to previous UDA methods that also utilize SSL, TCL aims to leverage and fortify the functional components of both class-level instance discrimination and domain alignment across different domains, and to achieve “the whole is greater than the sum of its parts” advantage out of the two.

3. Approach

In this work, we tailor self-supervised learning (e.g., contrastive learning) for UDA, by coupling SSL pretext tasks and the objective of domain alignment via a single framework in a contrastive fashion. An overview of our Transferrable Contrastive Learning (TCL) architecture is depicted in Figure 2.

3.1. Preliminaries

Contrastive Learning. In the context of contrastive learning (e.g., MoCo (He et al., 2020)), each image is considered as an instance, specified by index . A popular paradigm for contrastive learning is to first produce two randomly generated transformations of into query image and key image , and then consider these two augmentations are coming from the same distribution, also specified by the instance . The motivation of contrastive learning is to find some instance-invariant encoders so that and are as close as possible. In the meanwhile, the constructed and need to keep features and discriminative against instances generated from other noise distribution. In this way, the contrastive loss aims to penalize dissimilarity within each positive pair (, ). Given features generated from other distributions, the encoders and are also expected to encourage discriminativeness within each negative pair ( , ). A popular formulation to achieve this goal is to cast the contrastive learning into a classification problem, e.g., as in InfoNCE (Oord et al., 2018), where the loss classifies positive pairs from negative pairs via a softmax function:


Here is the temperature parameter, and denotes the inner product between and . Different approaches also employ efficient policies for effectively generating negative pairs. For example, in MoCo, the key features are sequentially queued into a “memory bank”. While

is updated via backpropagation, the

is only momentum updated according to the changes of . In this way, the contrastive loss is able to read a large number of negative keys from the slowly evolving memory bank, so that the number of negative pairs are no longer constrained by the batch size.

Problem Formulation of UDA. Consider a source domain having labeled samples distributed across classes, where is the label of the th image . In the meanwhile, assume there is also a target domain with number of unlabeled samples ranging over the same as in the source domain, without any annotations though. The goal of UDA is to train a network by exploiting labeled data from source domain and unlabeled data from , so that the obtained model may also be well adapted to the target domain on prediction tasks, even if there is evident and non-negligible domain gap present between the distribution of and .

Notations. We firstly generate two augmentations of each source input , i.e., respectively into and . Analogically, we also generate two augmentations of each target data , denoted as and . According to the standard paradigms in contrastive learning, we implement a query feature extractor , followed by an additional query projection , where

is the source query vector of the

instance. Similarly, we deploy another key feature extractor , followed by a key projection , where is the source key vector of the instance. In parallel, we also require target query and target key features are generated via exactly the same encoders: i.e., , where ; and , where . We also respectively define query classifier and key classifier , where the symbol “” corresponds to the relevant input depending on the context. Accordingly, the group of query/key feature extractor, projection and classifier can be naturally treated as the query/key encoder. During training, the parameters of query encoder () are updated via backpropagation according to the loss, whereas the parameters of key encoder () are only momentum updated. We defer the parameter update details to the later Section 3.2.

3.2. TCL framework

In short, TCL capitalizes on particular class-level invariance assumption and memory bank slicing policy to tackle the UDA problems. These hypotheses are expected to leverage the benefits from both domain adaptation techniques and self-supervised feature learning in a coupled manner. The intuition here is that, we still are able to construct positive and negative pairs across domains based on true/tentative label predictions over source/target data, so that intra-class similarity and domain alignment are learned in a reliable contrastive

way. However, since the conventional contrastive learning assumptions purely aim at instance-level invariance, a rigid combination of the two (i.e., UDA and contrastive learning) does not necessarily reduce the degree of freedom of the problem when searching the parameter regions that minimizes the domain gaps. We fix this issue by introducing TCL.

Pseudo-labeling in Target

. To start with, we define our specific pseudo-labeling method for the unlabeled target domain. The motivation behind this is to promote instance-level invariance intrinsically favored by contrastive learning further to the cross-domain class-level invariance desired by UDA. Let the probability

denote the predicted classification distribution across classes, given the source query feature . We minimize the standard cross-entropy loss between the prediction and the ground truth label distribution for each source sample:


where the class prediction is obtained via classifier , and is the classifier parameter. Conventionally, the obtained classifier should be able to classify unseen test data from the same distribution as . In the context of UDA though, it is necessary that is also exposed to target data , so that the classifier becomes invariant and robust against domain distributional discrepancy.

According to the discussion in the seminal work (Ben-David et al., 2010), the domain discrepancy between and must satisfy the ideal joint hypothesis, so that the domain adaptation problem itself is applicable. If this hypothesis performs poorly, we cannot expect to learn any good target classifier by minimizing source error. Based on this principle, the classification error from directly implementation the classifier on target data is rigorously upper bounded.

The ideal joint hypothesis therefore reassures us that provided is accurate on source data , the error from implementing directly on target data likely remains acceptable to some extent. Otherwise, the target domain probably is too distinct to adapt for, and any adaptation approaches will be irrelevant. Based on this hypothesis, we label each target data tentatively with the predication through the key classifier :


where . During each iteration of training, the classifier is updated via backpropogation, while is only momentum updated.

Note the unique training mechanism of classifier is critical for our TCL approach to succeed. Mathematically speaking, the momentum updated is equivalently a temporally ensembled version of . This implicit averaging strategy effectively reduces the pseudo label error propagation owing to the presence of the domain gap, and efficiently stabilizes the classification on target data. Accordingly, we assign the predicted value to be the class “pseudo label” of sample . The standard cross entropy loss on target data is then computed as the target classification error, iff. the prediction exceeds a threshold ():


Here, the indicator function if condition holds, otherwise . Please pay attention to Eq. (4), where the pseudo label prediction is computed given the target key features , but used to supervise the target query features update via prediction . This echoes our motivation that classifier is a temporally ensembled version of , whose parameter evolution is more reliable and stable than conventional pseudo labeling strategy.

Key Encoder and Domain-specific Memories. Our memory bank definition is intentionally tailored to serve UDA task. Recall that during each forward pass, we have obtained key features . We enqueue each pair sequentially into each of the domain specific source memory bank ( number of banks if multiple sources). Following (Kang et al., 2019)

, the spherical K-means is also adopted here to perform the clustering on target samples, which helps further refine the pseudo label prediction

. We then sequentially enqueue each batch of target key feature and the corresponding pseudo label into the target memory bank. In regard of memory bank updates, TCL shares similar spirit of (He et al., 2020): As the training proceeds, the oldest batch of keys in each target/source memory bank are removed, and the current batch of keys are enqueued. But in comparison to conventional contrastive learning, the main goal of TCL is rather to efficiently log both the feature simultaneously with their label/psudo label information in a pairwise manner, which eases our implementation of TCL loss as follows.

TCL Loss: from Instance-level to Cross-domain Class-level Invariance. Having all of the notations clarified in the previous sections, we eventually arrive at the transferable contrastive learning loss: TCL loss, our essential adaptation mechanism. The introduction of TCL loss is expected to naturally leverage the heterogeneous self-supervised learning tasks, i.e., instance-level invariance into cross-domain class-level invariance that favored by both contrastive learning (He et al., 2020) and UDA problem itself.

Formally, in a training batch, we define all source query that belong to some specific class as positive queries to class . We also consider all the target samples currently enqueued in the target memory bank with pseudo label as positive keys of class . All of the remaining keys in the target memory bank are treated as negative keys to this specific class . Mathematically, we yield:


where class category ranges over all existing classes in the current batch. Analogically, we also construct a that is symmetric to the definition of :


The TCL loss boils down to:


It is transparent now that TCL is a cross-domain class-level contrastive loss, an effective extension of the instance-invariance assumption that snugly bridges contrastive learning with UDA. In comparison to conventional instance level contrastive learning as in Eq, (1), the motivation behind Eq. (7) is quite straightforward: contrastive loss penalizes incompatibility of each cross-domain positive pairs and that most likely fall into the same category, given the predicted pseudo label on target samples. Eq. (7) also effectively pushes away cross-domain

negative pairs if these samples are believed to be coming from distinct class categories. Accordingly, such class-level contrastive learning naturally erases the intra-class feature variance between source and target, while inter-class feature discrimination is further improved.

Extension to Multi-source Domain Adaptation. An immediate extension of our proposal is to jointly apply Eq. (7) across multiple domains. Consider the scenario where we have a dataset composed of distinct labeled source domains: ,,…,,…, distributed across classes . Our goal becomes to best exploit all of these diverse sources and the associated annotations simultaneously so that we can improve the prediction on unlabeled target data . One extra bonus out of the proposed TCL loss is that, Eq. (7) can be conveniently plugged into this multi-source scenario with least modifications:


Here, we introduce an extra term that takes into account the cross source-source domain feature correlations. Loss basically retains the same form of as in Eq. (5), and only differs in that the positive/negative pair construction for is completely based on ground truth annotation of cross source-source domain data, instead of the utility of pseudo labels in . Correspondingly, we leverage the cross-domain class-level correlation for each specific class by capitalizing on all of the annotations available from multiple sources. This potentially imposes stronger cross-domain feature invariance for each class, so that the pseudo label predication on becomes more reliable.

Overall Objective. To summarize, for multi-source adaption, we minimize the overall loss for each training batch:


The hyperparameter

trades-off the impact of against the classification loss and . Note the loss reduces to the plain form of if and only if there is only a single labeled source domain of data considered., i.e., when . This leads to our overall loss for single-source adaptation problem:


During the training, the parameters in query and key encoders are updated in order to optimize Eq. (9) or Eq. (10) depending on the actual task. Specifically, the parameters of query encoder are updated via traditional backpropagation, whereas the parameters of key encoder is momentum updated as:


where indicates the training iteration and is a momentum cofficient.

Method ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr Avg

Source Only
34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1

DAN (Long et al., 2015)
43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3

DANN (Ganin et al., 2016)
45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6

JAN (Long et al., 2017)
45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8 58.3

SE (French et al., 2018)
48.8 61.8 72.8 54.1 63.2 65.1 50.6 49.2 72.3 66.1 55.9 78.7 61.5

DWT-MEC (Roy et al., 2019)
50.3 72.1 77.0 59.6 69.3 70.2 58.3 48.1 77.3 69.3 53.6 82.0 65.6

TAT (Liu et al., 2019)
51.6 69.5 75.4 59.4 69.5 68.6 59.5 50.5 76.8 70.9 56.6 81.6 65.8

SAFN (Xu et al., 2019)
52.0 71.7 76.3 64.2 69.9 71.9 63.7 51.4 77.1 70.9 57.1 81.5 67.3

TADA (Wang et al., 2019)
53.1 72.3 77.2 59.1 71.2 72.1 59.7 53.1 78.4 72.4 60.0 82.9 67.6

SymNets (Zhang et al., 2019b)
47.7 72.9 78.5 64.2 71.3 74.2 64.2 48.8 79.5 74.5 52.6 82.7 67.6

MDD (Zhang et al., 2019a)
54.9 73.7 77.8 60.0 71.4 71.8 61.2 53.6 78.1 72.5 60.2 82.3 68.1

SSDA (Xiao et al., 2020)
51.7 69.0 75.4 60.4 70.3 70.7 57.7 53.3 78.6 72.2 59.9 81.7 66.7

SRDC (Tang et al., 2020)
52.3 76.3 81.0 69.5 76.2 78.0 68.7 53.8 81.7 76.3 57.1 85.0 71.3
GVB-GD (Cui et al., 2020) 57.0 74.7 79.8 64.6 74.1 74.6 65.2 55.1 81.0 74.6 59.7 84.3 70.4
RSDA (Gu et al., 2020) 53.2 77.7 81.3 66.4 74.0 76.5 67.9 53.0 82.0 75.8 57.8 85.4 70.9
TCL (ResNet-50) 59.4 78.8 81.6 69.9 76.9 78.9 69.2 58.7 82.4 76.9 62.7 85.6 73.4
Table 1. Performance comparison with the state of arts on Office-Home dataset.
Method plane bicycle bus car horse knife mcycl person plant sktbrd train truck Avg
Source Only 72.3 6.1 63.4 91.7 52.7 7.9 80.1 5.6 90.1 18.5 78.1 25.9 49.4
RevGrad (Ganin and Lempitsky, 2015) 81.9 77.7 82.8 44.3 81.2 29.5 65.1 28.6 51.9 54.6 82.8 7.8 57.4
DAN (Long et al., 2015) 68.1 15.4 76.5 87.0 71.1 48.9 82.3 51.5 88.7 33.2 88.9 42.2 62.8
JAN (Long et al., 2017) 75.7 18.7 82.3 86.3 70.2 56.9 80.5 53.8 92.5 32.2 84.5 54.5 65.7
MCD (Saito et al., 2018b) 87.0 60.9 83.7 64.0 88.9 79.6 84.7 76.9 88.6 40.3 83.0 25.8 71.9
ADR (Saito et al., 2018a) 87.8 79.5 83.7 65.3 92.3 61.8 88.9 73.2 87.8 60.0 85.5 32.3 74.8

TPN (Pan et al., 2019b)
93.7 85.1 69.2 81.6 93.5 61.9 89.3 81.4 93.5 81.6 84.5 49.9 80.4

SE (French et al., 2018)
95.9 87.4 85.2 58.6 96.2 95.7 90.6 80.0 94.8 90.8 88.4 47.9 84.3

SE-CC (Pan et al., 2020)
96.3 86.5 82.4 81.3 96.1 97.2 91.2 84.7 94.4 94.1 88.3 53.4 87.2

CAN (Kang et al., 2019)
97.0 87.2 82.5 74.3 97.8 96.2 90.8 80.7 96.6 96.3 87.5 59.9 87.2

CAN(Kang et al., 2019)+ ALP (Zhang et al., 2020)
97.5 86.9 83.1 74.2 98.0 97.4 90.5 80.9 96.9 96.5 89.0 60.1 87.6

TCL (ResNet-101)
97.3 91.5 85.9 73.9 96.6 97.1 93.6 85.1 97.0 96.1 89.9 70.9 89.6
Table 2. Performance comparison with the state of arts on VisDA-2017 dataset.

4. Experiments

Datasets. We evaluate the domain adaptation approaches on the following 5 datasets: A) Office-Home (Venkateswara et al., 2017) is often considered as a challenging dataset for visual domain adaptation tasks, which consists of images distributed across classes. The Office-Home dataset is a composition dataset including four distinct domains: Artistic (Ar), Clip Art (Cl), Product (Pr) and Real-World (Rw). Following (French et al., 2018; Ganin et al., 2016; Long et al., 2017), we consider single source adaptation on this data. The adaptation directions range over all possible permutations of 2 out of 4 domains. For example, direction ArCl means domain Ar is labeled source domain and Cl is the unlabeled target domain. B) VisDA-2017 (Peng et al., 2017) is a large scale dataset containing synthetic images (Syn) of classes in the source training set and

real images (Real) from MS-COCO as the target domain validation set. We only consider the single adaption direction Syn

Real, following the state-of-the-art approaches. C) Digits-five (Xu et al., 2018) is a popular benchmark for multi-source domain adaptation (MSDA), which consists of classes of digit images respectively sampled from five different datasets, including MNIST (mt) (LeCun et al., 2001), MNIST-M (mm) (Ganin and Lempitsky, 2015), SVHN (sv) (Netzer et al., 2011), USPS (up) (Hull, 1994) and Synthetic Digits (sy) (Ganin and Lempitsky, 2015). These five datasets essentially represent five distinct domains. Following (Peng et al., 2019; Xu et al., 2018), we construct training/test dataset for MNIST, MNIST-M, SVHN and Synthetic Digits by randomly sampling training images/ test images from each of the corresponding domain. For USPS, we used their complete training and test datasets. The adaptation direction on Digits-five is defined as follows. Each individual domain rotates its role as the unlabeled target domain, while the remaining 4 domains are regarded as multiple labeled source domains. Take for instance sv means dataset sv is the target dataset, while all of the remaining domains are considered as the labeled source domains. Therefore, there are in total 5 adaptation directions under this Digits-five dataset. D) PACS (Li et al., 2017) is another popular benchmark for MSDA, which is composed of four domains (Art, Cartoon, Photo and Sketch). Each domain includes samples from different categories, including a total of samples. E) DomainNet (Peng et al., 2019) is a more challenging benchmark frequently used for evaluations on MSDA. The DomainNet dataset contains samples from domains: Clipart (clp), Infograph (inf), Painting (pnt), Quickdraw (qdr), Sketch (skt), and Real (rel). Each domain has classes, and the dataset has images in total. We use one domain as target and the remaining as sources, as the same setting of (Peng et al., 2019; Yang et al., 2020).

Implementation Details.

For fair comparison with SOTA approaches across different datasets, we utilize the commonly adopted ImageNet pre-trained ResNet-50/101/18/101 as feature extractor (

i.e., ) for Office-Home, VisDA-2017, PACS and DomainNet, respectively. For Digits-five, the feature extractor is composed of three conv layers and two fc layers, which is also exploited in (Peng et al., 2019). We adopt task-specific fc layer to parameterize the classifier . The projection is implemented by a fc layer with an output dimension of . We finetune from the pre-trained layers and train the newly added layer, where the learning rate of the latter is times that of the former. We use mini-batch SGD with momentum of to train the network for all experiments. The initial learning rate is for the convolutional layers and for the newly added layers. We follow the same learning rate schedule as in (Ganin and Lempitsky, 2015; Long et al., 2015, 2017) and (Peng et al., 2019) for single-source and multi-source UDA, respectively. Inspired by (Sohn et al., 2020), we use a standard flip-and-shift augmentation strategy for and adopt RandAugment (Cubuk et al., 2020) for and (). The temperature parameter in Eq. (5) and Eq. (6) is fixed to , and momentum coefficient in Eq. (11) is .

Digits-five DomainNet
Method mm mt up sv sy Avg clp inf pnt qdr rel skt Avg
Source Only 63.40.7 90.50.8 88.70.9 63.50.9 82.40.6 77.7 47.60.5 13.00.4 38.10.5 13.30.4 51.90.9 33.70.5 32.9
MDAN (Zhao et al., 2018) 69.50.3 98.00.9 92.40.7 69.20.6 87.40.5 83.3 52.40.6 21.30.8 46.90.4 8.60.6 54.90.6 46.50.7 38.4
DCTN (Xu et al., 2018) 70.51.2 96.20.8 92.80.3 77.60.4 86.80.8 84.8 48.60.7 23.50.6 48.80.6 7.20.5 53.50.6 47.30.5 38.2
 (Peng et al., 2019) 72.81.1 98.40.7 96.10.8 81.30.9 89.60.6 87.7 58.60.5 26.00.9 52.30.6 6.30.6 62.70.5 49.50.8 42.6
MDDA (Zhao et al., 2020) 78.60.6 98.80.4 93.90.5 79.30.8 89.70.7 88.1 59.40.6 23.80.8 53.20.6 12.50.6 61.80.5 48.60.8 43.2
CMSS (Yang et al., 2020) 75.30.6 99.00.1 97.70.1 88.40.5 93.70.2 90.8 64.20.2 28.00.2 53.60.4 16.00.1 63.40.2 53.80.4 46.5
LtC-MSDA (Wang et al., 2020) 85.60.8 99.00.4 98.30.4 83.20.6 93.00.5 91.8 63.10.5 28.70.7 56.10.5 16.30.5 66.10.6 53.80.6 47.4
TCL 96.60.4 99.40.2 99.30.2 91.30.5 97.80.3 96.9 70.90.3 30.20.6 61.80.4 16.5 0.6 72.20.3 59.60.4 51.9
Table 3. Classification accuracy (mean std ) on Digits-five and DomainNet datasets.

4.1. Performance Comparison

Single-Source UDA on Office-Home. Table 1 reports the classification accuracy comparisons on twelve transfer directions corresponding to Office-Home dataset. Generally, our TCL exhibits clear advantages over the existing state-of-the-art methods across all the adaptation directions. Please note that TCL is especially effective and commanding on harder transfer tasks, e.g., Pr Cl and Ar

Cl, as the two domains turn out to be substantially different. This might be a good justification that the integration of SSL pretext tasks and domain alignment target in a contrastive genre is a legitimate solution in the context of domain adaptation. Actually, by merely aligning the data distributions between source and target domains, DAN and JAN demonstrate relatively better performance than source only baselines. In comparison, TCL further stresses the impact of cross-domain inner-class invariance while the inter-class discrimination is effectively preserved, and therefore demonstrates stronger resilience against adaptation directions. Please note that RSDA also benefits from pseudo labeling. Nevertheless, RSDA lacks effective mechanism in preserving the discriminativeness across classes and thus performs inferior to TCL. We highlight the method SSDA, which uses instance level rotation prediction as unsupervised learning in addition to the supervised task. SSDA performs drastically worse than our TCL, validating our assumption that instance level self-supervised learning does not necessarily coincide with the optimal parameter region of the supervised task objective.

Single-Source UDA on VisDA-2017. Table 2 displays the performances of various models on VisDA-2017. Similar to the observations on Office-Home

, TCL again demonstrates strong superiority over its competitors. Particularly, TCL offers significant performance boost on the classes of “bicycle”, “person” and “truck” in comparison to the existing state of the art methods. When taking a closer inspection, one might observe that the source only model performs extremely poor on these 3 classes. This well verifies the significant distribution gap between source and target domains on these classes, highlighting the efficiency of our TCL. Moreover, compared to CAN that also exploits pseudo-labeling for domain adaptation, our TCL achieves better performances. The reason might be that CAN updates pseudo labels based on iterative learning during each epoch. Instead, TCL stabilizes pseudo-labeling via a temporally ensembled encoder, and it enforces inner-class invariance among cross-batch instances. Although the source only baseline seemingly offers the best adaptation result on the “car class”, we conjecture that the model has possibly overfitted to the car class coincidentally in the presence of domain gaps, by sacrificing the bicycle class, person class and etc. From this point, the average score possibly is a more balanced indicator to disambiguate the effectiveness of all algorithms.

Method Art Cartoon Photo Sketch Avg
Source Only 74.90.8 72.10.7 94.50.5 64.71.5 76.6
DANN  (Ganin and Lempitsky, 2015) 81.91.1 77.51.2 91.81.2 74.61.0 81.5
MDAN  (Zhao et al., 2018) 79.10.3 76.00.7 91.40.8 72.00.8 79.6
WBN  (Mancini et al., 2018) 89.90.2 89.70.5 97.40.8 58.01.5 83.8
MCD  (Saito et al., 2018b) 88.71.0 88.91.5 96.40.4 73.93.9 87.0
MSDA  (Peng et al., 2019) 89.30.4 89.91.0 97.30.3 76.72.8 88.3
JiGen  (Carlucci et al., 2019) 84.9 81.1 98.0 79.1 85.7
CMSS  (Yang et al., 2020) 88.60.3 90.40.8 96.90.2 82.00.5 89.5
TCL 93.60.2 91.70.3 98.50.1 87.50.8 92.8
Table 4. Classification accuracy (mean std ) on PACS dataset based on ResNet-18.

Multi-Source UDA on Digits-five, PACS and DomainNet. Here we compare with several methods on multi-source domain adaptation task: DANN (Ganin and Lempitsky, 2015), WBN (Mancini et al., 2018), MCD (Saito et al., 2018b), MDAN (Zhao et al., 2018), DCTN (Xu et al., 2018), MSDA (Peng et al., 2019), MDDA (Zhao et al., 2020), JiGen (Carlucci et al., 2019), CMSS (Yang et al., 2020), and LtC-MSDA (Wang et al., 2020). We directly extract the results of the above methods from their publications or from the reports in (Wang et al., 2020; Yang et al., 2020).

Table 3 illustrates the performance comparisons on the five multi-source adaption directions of Digits-five. Notably, TCL achieves an impressive 96.9% averaged accuracy across 5 directions, significantly surpasses other baselines by a large margin (around improvements over the existing best state-of-the-art approach). In addition, TCL also consistently demonstrates distinct advantages in each individual transfer direction, especially on the most challenging direction mm (), and sv () task. This empirically justifies the advantage of loss defined in Eq. (8), where the joint contribution of multiple source domains has been effectively incorporated into the framework of contrastive learning.

As shown in Table 4, our TCL achieves state-of-the-art average accuracy of on PACS. On the most challenging Sketch task, we obtain , clearly outperforming other baselines. Note that JiGen is a typical UDA algorithm that utilizes instance level self-supervised learning to improve feature learning. The superiority of TCL over JiGen again justifies advantages of TCL by promoting the instance level assumptions to useful UDA hypothesis.

Dataset w/o. w/o. TCL
Office-Home 72.0 71.4 73.4
VisDA-2017 88.7 85.5 89.6
Table 5. The effect of pseudo-labeling loss on target () and transferrable contrastive loss () in our TCL. The mean accuracy over 12 tasks on Office-Home and the mean accuracy over classes on VisDA-2017 validation set are reported.

Table 3 also displays the adaptation performance of various algorithms on DomainNet. TCL again outperforms the existing works on most of the adaptation tasks, and TCL achieves the SOTA average accuracy of . The DomainNet is particularly challenging from two perspectives: Firstly, the domain gap in each adaption direction is significant. Secondly, a relatively large number of categories (i.e., ) has made learning discriminative features much more challenging. Nevertheless, TCL successfully relieves these issues, owing to its unique mechanism in the cross-domain class alignment, and TCL correspondingly demonstrates its outstanding ability to find transferable features on this challenging dataset.

4.2. Experimental Analysis

Ablation Study. In this section, we investigate the influence of each component in the overall objective defined in Eq. (10). Table 5 examines the impact of two key components of TCL: the pseudo-labeling loss on target and class-specific TCL loss . By removing from , the averaged accuracy of TCL framework on Office-Home and VisDA-2017 dataset respectively drops by and by . This validates the critical role of in the success of TCL, and verifies the importance of cross-domain intra-class alignment. Even if we remove , TCL still performs better than the comparable RSDA and SE (refer to Table 1 and Table 2), two algorithms that both capitalize on merely the usage of pseudo label without contrastive loss. This further proves the advantage of our unique strategy of momentum pseudo labeling according to Eq. (4). In addition, if we remove the loss from Eq. (10), the performance of TCL also degrades by and . This further supports our assumption that reliable pseudo-label information and class information are critical for the class-level contrastive learning to succeed when embedded in UDA tasks. Without effective learning from the target data, the contrastive loss would be misled by wrong positive pairs, and worsens performance.

Sensitivity of Trade-off Parameter . We examine the impact of in Eq. (10), a hyperparameter that trades off between the classification loss (, ) and TCL loss . Figure 3 shows that TCL model is relatively robust against the change of , although we observe an evident drop when . This probably is attributed to the weaker source domain supervision owing to the large , under which the pseudo label becomes less reliable. The pseudo labels therefore becomes more noisy and tend to offer wrong class information to the and , which further propagate the error. In contrast, if is too small (e.g., ), the influence of is diluted and TCL loses its unique capability on encouraging cross-domain alignment and inter-class discriminativeness. We observe similar curve pattern on the multi-source adaptation task, and we do not include the result owing to limited space.

Figure 3. Effect of hyperparameter on VisDA-2017.
Office-Home 70.6 72.3 73.4
VisDA-2017 85.1 86.7 89.6
Table 6. Study on different variants of contrastive learning losses.

Evaluations of Variants of Contrastive Learning Losses. Recall that our proposed TCL loss is a cross-domain class-level contrastive learning objective. To verify the advantage of TCL framework over the conventional instance invariance assumption in the context of UDA, we include two contrastive learning loss variants for comparison. (1) Instance-level Discrimination Loss (IDL): We replace class-level invariance loss in Eq. (7) directly by conventional instance-level contrastive loss on unlabeled target data, i.e., the IDL framework composes of a conventional instance level InfoNCE (Oord et al., 2018) in addition to and in Eq. (10). (2) Intra-domain Class-level Discrimination Loss (ICDL): An alternative to perform class-specific contrastive learning is to replace

by instead penalizing class-level contrastive loss separately for source domain and target domain. We call this framework ICDL. ICDL only considers either intra-source or intra-target class-level discrimination without any cross-domain intra-class constraints. Detailed loss functions of these variants are in supplementary material.

As Table 6 shows, if we replace by the IDL loss, the averaged accuracy on Office-Home and VisDA-2017 respectively drops by 2.8% and by 4.5% in comparison to TCL framework. The drop well corroborates our hypothesis in the paper that rigid combination of instance level invariance with UDA task is sub-optimal, as the two assumptions does not necessarily coincide. Note IDL loss only emphasizes instance-level discrimination, and thus introduces noise into the memory bank as the negative keys can even be from the same class of each query instance. IDL loss thus inevitably deviates from the motivation of UDA, which aims to erase inter-domain gap.

Table 6 also shows the performances of our TCL by replacing ICDL objective for domain adaptation. As expected, none of these modifications has improved over original TCL. The main reason is that ICDL objective does not impose any inter-domain invariance within each class, and therefore cannot effectively erase any domain gap as TCL does. Please note that our TCL loss goes beyond merely intra-domain class-level instance discrimination as loss ICDL does, and effectively unifies both class-level instance discrimination and inter domain alignment across different domains.

Extending TCL loss to MSDA. As presented in Eq. (8), we extend TCL framework to deal with MSDA tasks. Eq. (8) takes into account the cross source-source domain intra-class feature correlations. In comparison, an alternative is to simply combine multiple source domains as a single source domain and directly perform single source TCL loss in Eq. (7), we call it “TCL-SourceCombine”. As shown in Figure (4), “TCL-SourceCombine” is worse than our TCL, justifying that exploring class-level domain alignment across all domains even in the source domain offers further advantage.

Figure 4. Comparison with different ways of extending TCL loss to MSDA on PACS.

5. Conclusions

In this work, we present Transferrable Contrastive Learning (TCL), which explores domain adaptation in a self-supervised manner. Particularly, we aim to establish a brand new symbiosis that unifies and fortifies self-supervised pretext tasks (pseudo-labeling and class-level instance discrimination) and domain discrepancy minimization in a single framework. By promoting instance level invariance to inter-domain class-level invariance, TCL elegantly couples the contrastive learning strategy with the goal of UDA. Our proposed TCL loss is devised to integrate both class-level instance discrimination and domain discrepancy minimization via contrastive loss, which best screens the semantic compatibility between query-key pairs across different domains. Our TCL can yield state-of-the-art results on five benchmarks for both single-source and multi-source domain adaptation tasks, especially when compared with the conventional instance level contrastive learning baselines.

Acknowledgments. This work was supported in part by NSFC No. 61872329 and the Fundamental Research Funds for the Central Universities under contract WK3490000005.


  • (1)
  • Bachman et al. (2019) Philip Bachman, R Devon Hjelm, and William Buchwalter. 2019. Learning representations by maximizing mutual information across views. In NeurIPS.
  • Ben-David et al. (2010) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Machine learning (2010).
  • Ben-David et al. (2006) Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. 2006. Analysis of representations for domain adaptation. In NeurIPS.
  • Bousmalis et al. (2017) Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017.

    Unsupervised pixel-level domain adaptation with generative adversarial networks. In

  • Cai et al. (2019) Qi Cai, Yingwei Pan, Chong-Wah Ngo, Xinmei Tian, Lingyu Duan, and Ting Yao. 2019. Exploring object relation in mean teacher for cross-domain detection. In CVPR.
  • Cai et al. (2020) Qi Cai, Yu Wang, Yingwei Pan, Ting Yao, and Tao Mei. 2020. Joint contrastive learning with infinite possibilities. In NeurIPS.
  • Carlucci et al. (2019) Fabio M Carlucci, Antonio D’Innocente, Silvia Bucci, Barbara Caputo, and Tatiana Tommasi. 2019. Domain generalization by solving jigsaw puzzles. In CVPR.
  • Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS.
  • Chen et al. (2019b) Chaoqi Chen, Weiping Xie, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, Tingyang Xu, and Junzhou Huang. 2019b. Progressive feature alignment for unsupervised domain adaptation. In CVPR.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In ICML.
  • Chen et al. (2021) Xinlei Chen, Saining Xie, and Kaiming He. 2021. An Empirical Study of Training Self-Supervised Visual Transformers. arXiv preprint arXiv:2104.02057 (2021).
  • Chen et al. (2019a) Yang Chen, Yingwei Pan, Ting Yao, Xinmei Tian, and Tao Mei. 2019a. Mocycle-gan: Unpaired video-to-video translation. In ACM MM.
  • Cubuk et al. (2020) Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW.
  • Cui et al. (2020) Shuhao Cui, Shuhui Wang, Junbao Zhuo, Chi Su, Qingming Huang, and Qi Tian. 2020. Gradually Vanishing Bridge for Adversarial Domain Adaptation. In CVPR.
  • French et al. (2018) Geoffrey French, Michal Mackiewicz, and Mark Fisher. 2018. Self-ensembling for domain adaptation. In ICLR.
  • Ganin and Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In ICML.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. JMLR (2016).
  • Ghifary et al. (2016) Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen Li. 2016. Deep reconstruction-classification networks for unsupervised domain adaptation. In ECCV.
  • Gu et al. (2020) Xiang Gu, Jian Sun, and Zongben Xu. 2020. Spherical Space Domain Adaptation With Robust Pseudo-Label Loss. In CVPR.
  • Gutmann and Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS.
  • Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In CVPRW.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
  • Hjelm et al. (2019) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. In ICLR.
  • Hoffman et al. (2018) Judy Hoffman, Mehryar Mohri, and Ningshan Zhang. 2018. Algorithms and theory for multiple-source adaptation. In NeurIPS.
  • Hull (1994) Jonathan J. Hull. 1994. A database for handwritten text recognition research. IEEE Trans. on PAMI (1994).
  • Jing et al. (2020) Taotao Jing, Haifeng Xia, and Zhengming Ding. 2020. Adaptively-Accumulated Knowledge Transfer for Partial Domain Adaptation. In ACM MM.
  • Kang et al. (2019) Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. 2019. Contrastive adaptation network for unsupervised domain adaptation. In CVPR.
  • LeCun et al. (2001) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 2001. Gradient-based learning applied to document recognition. Intelligent Signal Processing (2001).
  • Li et al. (2017) Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. 2017. Deeper, broader and artier domain generalization. In ICCV.
  • Li et al. (2019a) Jingjing Li, Erpeng Chen, Zhengming Ding, Lei Zhu, Ke Lu, and Zi Huang. 2019a. Cycle-consistent conditional adversarial transfer networks. In ACM MM.
  • Li et al. (2019b) Shuang Li, Chi Harold Liu, Binhui Xie, Limin Su, Zhengming Ding, and Gao Huang. 2019b. Joint adversarial domain adaptation. In ACM MM.
  • Li et al. (2021) Yehao Li, Ting Yao, Yingwei Pan, and Tao Mei. 2021. Contextual Transformer Networks for Visual Recognition. arXiv preprint arXiv:2107.12292 (2021).
  • Lin et al. (2021) Jingyang Lin, Yingwei Pan, Rongfeng Lai, Xuehang Yang, Hongyang Chao, and Ting Yao. 2021. Core-Text: Improving Scene Text Detection with Contrastive Relational Reasoning. In ICME.
  • Liu et al. (2019) Hong Liu, Mingsheng Long, Jianmin Wang, and Michael Jordan. 2019. Transferable adversarial training: A general approach to adapting deep classifiers. In ICML.
  • Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning transferable features with deep adaptation networks. In ICML.
  • Long et al. (2018) Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. 2018. Conditional adversarial domain adaptation. In NeurIPS.
  • Long et al. (2016) Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2016. Unsupervised domain adaptation with residual transfer networks. In NeurIPS.
  • Long et al. (2017) Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2017. Deep transfer learning with joint adaptation networks. In ICML.
  • Mancini et al. (2018) Massimiliano Mancini, Lorenzo Porzi, Samuel Rota Bulo, Barbara Caputo, and Elisa Ricci. 2018. Boosting domain adaptation by discovering latent domains. In CVPR.
  • Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. 2011. Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
  • Pan et al. (2019a) Yingwei Pan, Yehao Li, Qi Cai, Yang Chen, and Ting Yao. 2019a. Multi-Source Domain Adaptation and Semi-Supervised Domain Adaptation with Focus on Visual Domain Adaptation Challenge 2019. arXiv preprint arXiv:1910.03548 (2019).
  • Pan et al. (2020) Yingwei Pan, Ting Yao, Yehao Li, Chong-Wah Ngo, and Tao Mei. 2020. Exploring category-agnostic clusters for open-set domain adaptation. In CVPR.
  • Pan et al. (2019b) Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-Wah Ngo, and Tao Mei. 2019b. Transferrable prototypical networks for unsupervised domain adaptation. In CVPR.
  • Peng et al. (2019) Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. 2019. Moment matching for multi-source domain adaptation. In ICCV.
  • Peng et al. (2017) Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. 2017. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924 (2017).
  • Roy et al. (2019) Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Samuel Rota Bulo, Nicu Sebe, and Elisa Ricci. 2019. Unsupervised domain adaptation using feature-whitening and consensus loss. In CVPR.
  • Saito et al. (2018a) Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate Saenko. 2018a. Adversarial dropout regularization. In ICLR.
  • Saito et al. (2018b) Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. 2018b. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR.
  • Sajjadi et al. (2016) Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. 2016. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In NeurIPS.
  • Shen et al. (2018) Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. 2018. Wasserstein distance guided representation learning for domain adaptation. In AAAI.
  • Sohn et al. (2020) Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS.
  • Sun et al. (2019) Yu Sun, Eric Tzeng, Trevor Darrell, and Alexei A Efros. 2019. Unsupervised domain adaptation through self-supervision. arXiv preprint arXiv:1909.11825 (2019).
  • Tang et al. (2020) Hui Tang, Ke Chen, and Kui Jia. 2020. Unsupervised Domain Adaptation via Structurally Regularized Deep Clustering. In CVPR.
  • Tarvainen and Valpola (2017) Antti Tarvainen and Harri Valpola. 2017.

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In

  • Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In CVPR.
  • Venkateswara et al. (2017) Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. 2017. Deep hashing network for unsupervised domain adaptation. In CVPR.
  • Wang et al. (2020) Hang Wang, Minghao Xu, Bingbing Ni, and Wenjun Zhang. 2020. Learning to Combine: Knowledge Aggregation for Multi-Source Domain Adaptation. arXiv preprint arXiv:2007.08801 (2020).
  • Wang et al. (2019) Ximei Wang, Liang Li, Weirui Ye, Mingsheng Long, and Jianmin Wang. 2019. Transferable attention for domain adaptation. In AAAI.
  • Xiao et al. (2020) Liang Xiao, Jiaolong Xu, Dawei Zhao, Zhiyu Wang, Li Wang, Yiming Nie, and Bin Dai. 2020. Self-Supervised Domain Adaptation with Consistency Training. In ICPR.
  • Xu et al. (2020) Minghao Xu, Jian Zhang, Bingbing Ni, Teng Li, Chengjie Wang, Qi Tian, and Wenjun Zhang. 2020. Adversarial domain adaptation with domain mixup. In AAAI.
  • Xu et al. (2018) Ruijia Xu, Ziliang Chen, Wangmeng Zuo, Junjie Yan, and Liang Lin. 2018. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In CVPR.
  • Xu et al. (2019) Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. 2019. Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation. In ICCV.
  • Yan et al. (2017) Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Yong Xu, and Wangmeng Zuo. 2017. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In CVPR.
  • Yang et al. (2020) Luyu Yang, Yogesh Balaji, Ser-Nam Lim, and Abhinav Shrivastava. 2020. Curriculum Manager for Source Selection in Multi-Source Domain Adaptation. In ECCV.
  • Yao et al. (2015) Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, and Tao Mei. 2015. Semi-supervised domain adaptation with subspace learning for visual recognition. In CVPR.
  • Yao et al. (2021) Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, and Tao Mei. 2021. Seco: Exploring sequence supervision for unsupervised representation learning. In AAAI.
  • Zhang et al. (2020) Yabin Zhang, Bin Deng, Kui Jia, and Lei Zhang. 2020. Label Propagation with Augmented Anchors: A Simple Semi-Supervised Learning baseline for Unsupervised Domain Adaptation. In ECCV.
  • Zhang et al. (2019a) Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael I Jordan. 2019a. Bridging theory and algorithm for domain adaptation. arXiv preprint arXiv:1904.05801 (2019).
  • Zhang et al. (2019b) Yabin Zhang, Hui Tang, Kui Jia, and Mingkui Tan. 2019b. Domain-symmetric networks for adversarial domain adaptation. In CVPR.
  • Zhao et al. (2018) Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. 2018. Adversarial multiple source domain adaptation. In NeurIPS.
  • Zhao et al. (2020) Sicheng Zhao, Guangzhi Wang, Shanghang Zhang, Yang Gu, Yaxian Li, Zhichao Song, Pengfei Xu, Runbo Hu, Hua Chai, and Kurt Keutzer. 2020. Multi-source Distilling Domain Adaptation. In AAAI.