Deep neural networks (DNNs) have significantly improved the state of the art on many supervised tasks[8, 46, 39, 18]. However, without sufficient training data, DNNs have weak generalization ability to new tasks or new environments . This is known as the dataset bias or domain-shift problem . Unsupervised domain adaptation (UDA) [30, 12] aims to generalize a model learned from a source domain with rich annotated data to a new target domain without any labeled data. Recently, many approaches have been proposed to learn transferable representations, by simultaneously matching feature distributions across different domains [17, 42].
Motivated by , [43, 11] introduced a min-max game: a domain discriminator is learned by minimizing the error of distinguishing data samples from the source and the target domains, while a feature generator learns transferable features that are indistinguishable by the domain discriminator. This enforces that the learned features be domain-invariant. Meanwhile, a feature classifier (only on source-domain features) ensures that the learned features are class-conditional. Despite promising results, these adversarial methods suffer from inherent algorithmic weaknesses . Specifically, the generator may generate ambiguous features near class boundaries : while the generator manages to fool the discriminator, some target-domain features may still be misclassified. In other words, the model merely aligns the global marginal distribution of the two domains and ignores the class-conditional decision boundaries.
To overcome this issue, recent UDA models further align class-level distributions by taking the decision boundary into consideration. These methods either rely on iteratively refining the decision boundary with empirical data [38, 34], or utilizing multiple-view information . Alternatively, the maximum classifier discrepancy (MCD) model  conducts a min-max game between a feature generator and two classifiers. Ambiguous target samples that are far from source-domain samples can be detected when the discrepancy between the two classifiers is maximized, as shown in Figure 1(b). Meanwhile, as the generator fools the classifiers, the generated target features may fall into the source feature regions. However, the target samples may not be smooth on the low-dimensional manifold [6, 28], meaning that neighboring samples may not belong to the same class. As a result, some generated target features could be miscategorized as shown in Figure 1(c).
We propose the Contrastively Smoothed Class Alignment (CoSCA) model to improve the alignment of class-conditional feature distributions between source and target domains, by alternatively estimating the underlying label hypothesis of target samples to map them into tighter clusters, and adapting feature representations based on a proposed contrastive loss. Specifically, by aligning ambiguous target samples near the decision boundaries with their neighbors and distancing them from non-neighbors, CoSCA enhances the alignment of each class in a contrastive manner. Figure 1(f) demonstrates an enhanced and smoothed version of the class-conditional alignment. Moreover, as shown in Figure 1(d), Maximum Mean Discrepancy (MMD) is included to better merge the source and target domain feature representations. The overall framework is trained end-to-end in an adversarial manner.
Our main contributions are summarized as follows:
We propose CoSCA, a novel approach that smooths class alignment for maximizing classifier discrepancy with a contrastive loss. CoSCA also provides better global domain alignment via the use of MMD loss.
We validate the proposed approach on several domain adaptation benchmarks. Extensive experiments demonstrate that CoSCA achieves state-of-the-art results on several benchmarks.
Unsupervised Domain Adaptation. A practical solution for domain adaptation is to learn domain-invariant features whose distribution is similar across the source and target domains. For example,  designed discriminative features by using clustering techniques and pseudo-labels. DAN  and JAN  minimized the MMD loss between two domains. Adversarial domain adaptation was proposed to integrate adversarial learning and domain adaptation in a two-player game [11, 43, 42]. Following this idea, most existing adversarial-learning methods reduce feature differences by fooling a domain discriminator [26, 12]. However, the relationship between target samples and the class-conditional decision boundaries when aligning features  was not considered.
Class-conditional Alignment. Recent work enforces class-level alignment while aligning global marginal distributions. Adversarial Dropout Regularization (ADR)  and Maximum Classifier Discrepancy (MCD)  were proposed to train a neural network in an adversarial manner, avoiding generating non-discriminative features lying in the region near the decision boundary. [31, 27] considered class information when measuring domain discrepancy. Co-regularized Domain Adaptation (Co-DA)  utilized multi-view information to match the marginal feature distributions corresponding to the class-conditional distributions. Compared with previous work that executed the alignment by optimizing on “hard” metrics [36, 23], we propose to smooth the alignment iteratively, with explicitly defined loss.
Contrastive Learning. The intuition for contrastive learning is to let the model understand the difference between one set (, data points) and another, instead of only characterizing a single set . This idea has been explored in previous works that model intra-class compactness and inter-class separability (, distinctiveness loss , contrastive loss , triplet loss ) and tangent distance 
. It has also been extended to consider several assumptions in semi-supervised and unsupervised learning[28, 24], such as the low-density region (or cluster) assumption [28, 33] that the decision boundary should lie in the low-density region, rather than crossing the high-density region. Recently, contrastive learning was applied in UDA , in which the intra/inter-class domain discrepancy were modeled. In comparison, our work is based on the MCD framework, utilizing the low-density assumption and focusing on separating the ambiguous target data points by optimizing the contrastive objective, allowing the decision boundary to sit in the low-density region, , region of vacancy, and smoothness assumption.
The task of unsupervised domain adaptation seeks to generalize a learned model from a source domain to a target domain, the latter following a different (but related) data distribution from the former. Specifically, the source- and target-domain samples are denoted , and , respectively, where and are the input, and represents the data labels of classes in the source domain. The target domain shares the same label types as the source domain, but we possess no labeled examples from the target domain. We are interested in learning a deep network that reduces domain shift in the data distribution across and , in order to make accurate predictions for . We use the notation to describe the source-domain samples and labels, and for the unlabeled target-domain samples.
Adversarial domain adaptation approaches such as [36, 21] achieve this goal via a two-step procedure: ) train a feature generator and the feature classifiers , with the source domain data, to ensure the generated features are class-conditional; ) train and so that the prediction discrepancy between the two classifiers is maximized, and train to generate features that are distinctively separated. The maximum classifier discrepancy detects the target features that are far from the support of the source domain. As the generator tries to fool the classifiers (, minimizing the discrepancy), these target-domain features are enforced to be categorized and aligned with the source-domain features.
However, only measuring divergence between and
can be considered first-order moment matching, which may be insufficient for adversarial training. Previous work also observed similar issues[2, 41]. We tackle this challenge by adding the Maximum Mean Discrepancy (MMD) loss, that matches the difference via higher-order moments. Also, the class alignment in existing UDA methods takes into account the intra-class domain discrepancy only, which makes it difficult to separate samples within the same class that are close to the decision boundary. Thus, in addition to the discrepancy loss, we also measure both intra- and inter-class discrepancy across domains. Specifically, we propose to minimize the distance among target-domain features that fall into the same class based on decision boundaries, and separate those features from different categories. During this process, ambiguous target features are simultaneously kept away from the decision boundaries and mapped into the high-density region, achieving better class alignment.
Global Alignment with MMD
Following , we first train a feature generator and two classifiers and to minimize the softmax cross-entropy loss using the data from the labeled source domain , defined as:
where and are the probabilistic output of the two classifiers and , respectively.
In addition to (Global Alignment with MMD), we explicitly minimize the distance between the source and target feature distributions with MMD. The main idea of MMD is to estimate the distance between two distributions as the distance between sample means of the projected embeddings in a Hilbert space. Minimizing MMD is equivalent to minimizing all orders of moments . In practice, the squared value of MMD is estimated with empirical kernel mean embeddings:
where is the kernel mapping, , , and denote the size of a training mini-batch of the data from the source domain and the target domain , respectively; denotes the -norm. With the MMD loss , the normalized features in the two domains are encouraged to be identically distributed, leading to better global domain alignment.
Contrastively Smoothed Class Alignment
Discrepancy Loss. The discrepancy loss represents the level of disagreement between the two feature classifiers in prediction for target-domain samples. Specifically, the discrepancy loss between and is defined as:
where denotes the -norm, and and are the probability output of and for the -th class, respectively. Accordingly, we can define the discrepancy loss over the target domain :
Adversarial training is conducted in the Maximum Classifier Discrepancy (MCD) setup :
where is a hyper-parameter. Minimizing the discrepancy between the two classifiers and induces smoothness for the clearly classified target-domain features, while the region in the vacancy among the ambiguous ones remains non-smooth. Moreover, MCD only utilizes the unlabeled target-domain samples, while ignoring the labeled source-domain data when estimating the discrepancy.
Contrastive Loss. To further optimize to estimate the underlying label hypothesis of target-domain samples, we propose to measure the intra- and inter-class discrepancy across domains, conditional on class information. By using an indicator defined as , we define the contrastive loss between and as:
where is a distance measure (defined below), and is the predicted target label for . Specifically, (6) covers two types of class-aware domain discrepancies: ) intra-class domain discrepancy (); and ) inter-class domain discrepancy (). Note that is known, providing some supervision for parameter learning. Similarly, we can define the constrastive loss between and as:
To obtain the indicator , estimated target label is required. Specifically, for each data sample
, a pseudo label is predicted based on the maximum posterior probability of the two classifiers:
Ideally, based on the indicator, should ensure the gathering of features that fall in the same class, while separating those in different categories. Following , we utilize contrastive Siamese networks , which can learn an invariant mapping to a smooth and coherent feature space and perform well in practice:
where and is a pre-defined margin. The margin loss constrains the neighboring features to be consistent. Based on the above definitions of source-and-target and target-and-target contrastive losses, the overall objective is obtained:
Minimizing the contrastive loss encourages features in the same class to aggregate together while pushing unrelated pairs away from each other. In other words, the semantic feature approximation is enhanced to induce smoothness between data in the feature space.
We need to optimize , and by combining all the aforementioned losses, performed in an adversarial training manner. Specifically, we first train the classifiers and and the generator to minimize the objective:
We then train the classifiers and while keeping the generator fixed. The objective is:
Lastly, we train the generator with the following objective, while keeping both and fixed:
where , and are hyper-parameters that balance the different objectives. These steps are repeated, with the full algorithm summarized in Algorithm 1. In our experiments, the inner-loop iteration numbers and are both set to 2.
Class-aware sampling. When training with the contrastive loss, it is important to sample a mini-batch of data with all the classes, to allow (10) to be fully trained. We propose to use a class-aware sampling strategy to enable efficient update of the network. Specifically, we randomly select a subset of classes and then sample data from each class. Consequently, in each mini-batch of the data, we are able to estimate the intra/inter-class discrepancy for each selected class.
Dynamic parameterization of . In our implementation, we adapt a dynamic to parameterize . We set , which is a Gaussian curve ranging from 0 to . This is to prevent unlabeled target features gathering in the early stage of training, as the pseudo labels might not be reliable.
|With Instance-Normalized Input:|
We evaluate the proposed model mainly on image datasets. To compare with MCD  as well as the state-of-the-art results in [38, 23], we evaluate on the same datasets used in those studies: the digit datasets (, MNIST, MNISTM, Street View House Numbers (SVHN), and USPS), CIFAR-10, and STL-10. We also conduct experiments on the VisDA dataset, i.e., large-scale images. Our model can also be applied to non-visual domain adaptation tasks. Specifically, to show the flexibility of our model, we also evaluate it on the Amazon Reviews dataset.
For visual domain adaptation tasks, the proposed model is implemented based on VADA  and Co-DA  to avoid any incidental difference caused by network architecture. However, different from these models, our model does not require a discriminator, and only adopts the architecture for the feature generator and the classifier . We also include instance normalization [38, 44], achieving superior results on several benchmarks. For the VisDA dataset, we implemented our model based on the codebase of self-ensembling domain adaptation (SEDA) . To compare with MCD , we re-implemented it using the exact architecture as our model.
In addition to the aforementioned baseline models, we also include the results from recently proposed unsupervised domain adaptation models. Note that standard domain adaptation methods (such as Transfer Component Analysis (TCA)  and Subspace Alignment (SA) ) are not included; these models only work on pre-extracted features, and are often not scalable to large datasets. Instead, we mainly compare our model with methods based on adversarial neural networks.
For the non-visual task, we adopt a one-layer CNN structure from previous work . The feature generator consists of three components, including a 300-dimensional word embedding layer using GloVe 
, a one-layer CNN with ReLU, and a max-over-time pooling through which the final sentence representation is obtained. The classifiersand can be decomposed into one dropout layer and one fully connected output layer.
There are four types of digit images (, four domains). MNIST and USPS are both hand-written gray-scale images, the domain difference between which is relatively small. MNISTM  is a dataset built upon MNIST by adding randomly colored image patches from BSD500 dataset . SVHN includes colored images of street numbers. All images are rescaled to .
MNISTSVHN. As gray-scale handwritten digits, images from MNIST has much lower dimensionality than those colored house numbers from SVHN. With such large domain gap, MCD fails to align the features of the two. Figure 3(a) plots the t-SNE embedding of the features learned by MCD. Domains are indicated by different colors, and classes are indicated by different digit numbers. The maximized discrepancy provides too many ambiguous target-domain samples. As a result, the feature generator may not properly align them with the source-domain samples. In comparison, as shown in Figure 3(b), CoSCA utilizes the MMD between the source and the target domain features, thus maintaining a better global domain alignment. With further smoothed class-conditional adaptation, it achieves test accuracy of 80.7, as shown in Table 1, competitive with state-of-the-art results from .
SVHNMNIST. Classification with the MNIST dataset is easier than others. As shown in the table, source-only achieves 82.4 on SVHNMNIST with instance normalization. Therefore, even with the same amount of domain difference, performance on SVHNMNIST is much better than MNISTSVHN across all compared models. The test accuracy of our model achieves 98.7.
MNISTMNISTM. Since MNISTM is a colored version of MNIST, there exists a one-to-one matching between the two datasets, , a domain adaptation model would perform well as long as domain-invariant features are properly extracted. CoSCA provides better results than Co-DA, yielding a test accuracy of 98.9.
MNISTUSPS. Evaluation on MNIST and USPS datasets is also conducted to compare our model with other baselines. Ours achieves a superb result of 99.3.
CIFAR-10 and STL-10 Datasets
CIFAR-10 and STL-10 are both 10-class datasets, with each image containing an animal or a type of transportation. Images from each class are much more diverse than the digit datasets, with higher intrinsic dimensionality, which makes it a harder domain adaptation task. There are 9 overlapping classes between these two datasets. CIFAR provides images of size 3232 and a large training set of 50,000 image samples, while STL contains higher quality images of size 9696, but with a much smaller training set of 5,000 samples. Following [10, 38, 23], we remove non-overlapping classes from these two datasets and resize the images from STL to 3232.
Due to the small training set in STL, STLCIFAR is more difficult than CIFARSTL. For the latter, the source-only model with no adaptation involved achieves an accuracy of 77.0. With adaptation, the margin-of-improvement is relatively small, while CoSCA provides the best improvement of 4.7 among all the models (Table 1). For STLCIFAR, our model yields a 12.6 margin-of-improvement and an accuracy of 75.2. Figures 3(c) and 3(d) provide t-SNE plots for MCD and our model, respectively, which shows our model achieves much better alignment for each class.
The VisDA dataset is a large-scale image dataset that evaluates the adaptation from synthetic-object to real-object images. Images from the source domain are synthetic renderings of 3D models from different angles and lighting conditions. There are 152,397 image samples in the source domain, and 55,388 image samples in the target domain. The image size, after rescaling as in , is . A model architecture with ResNet101 
pre-trained on Imagenet is required. There are 12 different object categories in VisDA, shared by the source and the target domains.
Table 2 shows the test accuracy of different models in all object classes. The class-aware methods, namely MCD , SEDA , and our proposed CoSCA, outperforms the source only model in all categories. In comparison, the methods that are mainly based on distribution matching do not perform well in some of the categories. CoSCA outperforms MCD, showing the effectiveness of contrastive loss and MMD global alignment. In addition, it performs better than SEDA in most categories, demonstrating its robustness in handling large scale images.
We also evaluate CoSCA on the Amazon Reviews dataset collected by . It contains reviews from several different domains, with 1000 positive and 1000 negative reviews in each domain.
Table 3 shows the average classification accuracy of different methods. We use the same model architecture and parameter setting for MCD and the source-only model. Results show that the proposed CoSCA outperforms all other methods. Specifically, it improves the performance from test accuracy of 81.96 to 83.17, when comparing to the state-of-the-art method DAS. MCD achieves 81.35, also outperformed by CoSCA.
To further demonstrate the improvement of CoSCA over MCD , we conduct ablation studies. Specifically, with the same network architecture and setup, we compare model performance among 1) MCD, 2) MCD with only smooth alignment (MCD+Contras), 3) MCD with only global alighnment (MCD+MMD), and 4) CoSCA, to validate the effectiveness of adding contrastive loss and MMD loss to MCD. As MCD has already achieved great performance on some of the benchmark datasets, we mainly choose those tasks on which MCD does not perform very well, in order to better analyze the margin-of-improvement. Therefore, MNISTSVHN, STLCIFAR, and Amazon Reviews are selected for this experiment, and the results are provided in Table 4.
Effect of Contrastive Alignment. We compare CoSCA with MCD as well as its few variations, to validate the effectiveness of the proposed contrastive alignment. Table 4 provides the test accuracy for every model across the selected benchmark datasets. For MNISTSVHN, MCD+Contrastive outperforms MCD by 7.2. For STLCIFAR and Amazon Reviews, the margin-of-improvement is 4.2 and 1.21, respectively (less significant than MNISTSVHN, possibly due to the smaller domain difference). Note that the results of MCD+Contras are still worse than CoSCA, demonstrating the effectiveness of the global domain alignment and the framework design of our model.
Effect of MMD. We further investigate how the MMD loss can impact the performance of our proposed CoSCA. Specifically, MCD+MMD achieves a test accuracy of 72.1 for MNISTSVHN, only lifting the original result of MCD by 3.4. For STLCIFAR and Amazon Reviews, the margin-of-improvement is 1.0 and 0.38, respectively. While this validates the effectiveness of having global alignment in the MCD framework, the improvement is small. Without a smoothed class-conditional alignment, MCD still encounters misclassified target features during training, leading to a sub-optimal solution. Notice that when comparing CoSCA with MCD+Contras, the improvement is significant for MNISTSVHN, with validation accuracy and training stability enhanced. This demonstrates the importance of global alignment when there exists a large domain difference.
We have proposed Contrastively Smoothed Class Alignment (CoSCA) for the UDA problem, by explicitly combining intra-class and inter-class domain discrepancy and optimizing class alignment through end-to-end training. Experiments on several benchmarks demonstrate that our model can outperform state-of-the-art baselines. Our experimental analysis shows that CoSCA learns more discriminative target-domain features, and the introduced MMD feature matching improves the global domain alignment. For future work, we want to develop a theoretical interpretation of contrastive learning for domain adaptation, particularly characterizing its effects on the alignment of source and target domain feature distributions.
-  (2011) Contour detection and hierarchical image segmentation. PAMI. Cited by: Digit Datasets.
-  (2017) Generalization and equilibrium in generative adversarial nets (GANs). In ICML, Cited by: Approach.
-  (2007) Domain adaptation for sentiment classification. In ACL, Cited by: Text Dataset.
-  (2016) Domain separation networks. In NeurIPS, Cited by: Table 1.
-  (1994) Signature verification using a ”siamese” time delay neural network. In NeurIPS, Cited by: Contrastively Smoothed Class Alignment.
-  (2009) Semi-supervised learning. IEEE Transactions on Neural Networks. Cited by: Introduction.
Contrastive learning for image captioning. In NeurIPS, Cited by: Related Work.
-  (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In ICML, Cited by: Introduction.
-  (2013) Unsupervised visual domain adaptation using subspace alignment. In ICCV, Cited by: Experiments.
-  (2018) Self-ensembling for domain adaptation. In ICLR, Cited by: Table 2, CIFAR-10 and STL-10 Datasets, VisDA Dataset, Experiments.
Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495. Cited by: Introduction, Related Work.
-  (2016) Domain-adversarial training of neural networks. JMLR. Cited by: Introduction, Related Work, Table 1, Table 2, Digit Datasets, Table 3.
-  (2014) Generative adversarial nets. In NeurIPS, Cited by: Introduction.
-  (2012) A kernel two-sample test. JMLR. Cited by: Global Alignment with MMD.
-  (2009) Covariate shift by kernel mean matching. In MIT press, Cited by: Introduction.
-  (2006) Dimensionality reduction by learning an invariant mapping. In CVPR, Cited by: Related Work.
-  (2017) Associative domain adaptation. In ICCV, Cited by: Introduction.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: Introduction, VisDA Dataset.
-  (2018) Adaptive semi-supervised learning for cross-domain sentiment classification. In ACL, Cited by: Table 3.
-  (2019) Contrastive adaptation network for unsupervised domain adaptation. In CVPR, Cited by: Related Work.
-  (2019) Unsupervised visual domain adaptation: A deep max-margin gaussian process approach. arXiv preprint arXiv:1902.08727. Cited by: Approach.
-  (2014) Convolutional neural networks for sentence classification. In EMNLP, Cited by: Experiments.
-  (2018) Co-regularized alignment for unsupervised domain adaptation. In NeurIPS, Cited by: Introduction, Related Work, Table 1, Digit Datasets, CIFAR-10 and STL-10 Datasets, Experiments, Experiments.
-  (2017) Triple generative adversarial nets. In NeurIPS, Cited by: Related Work.
-  (2015) Learning transferable features with deep adaptation networks. In ICML, Cited by: Related Work, Table 1, Table 2.
-  (2018) Conditional adversarial domain adaptation. In NeurIPS, Cited by: Related Work.
-  (2016) Unsupervised domain adaptation with residual transfer networks. In NeurIPS, Cited by: Related Work, Related Work.
-  (2017) Smooth neighbors on teacher graphs for semi-supervised learning. In CVPR, Cited by: Introduction, Related Work, Contrastively Smoothed Class Alignment.
-  (2011) Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks. Cited by: Experiments.
A survey on transfer learning. IEEE Transactions on knowledge and data engineering. Cited by: Introduction.
-  (2018) Multi-adversarial domain adaptation. In AAAI, Cited by: Related Work.
Glove: global vectors for word representation. In EMNLP, Cited by: Experiments.
-  (2011) The manifold tangent classifier. In NeurIPS, Cited by: Related Work.
-  (2017) Adversarial dropout regularization. arXiv preprint arXiv:1711.01575. Cited by: Introduction, Related Work.
-  (2017) Asymmetric tri-training for unsupervised domain adaptation. In ICML, Cited by: Table 1.
-  (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, Cited by: Introduction, Introduction, Related Work, Related Work, Global Alignment with MMD, Contrastively Smoothed Class Alignment, Table 1, Table 2, Approach, VisDA Dataset, VisDA Dataset, Ablation Study, Table 3, Table 4, Experiments, Experiments.
-  (2016) Learning transferrable representations for unsupervised domain adaptation. In NeurIPS, Cited by: Related Work.
-  (2018) A dirt-t approach to unsupervised domain adaptation. In ICLR, Cited by: Introduction, Introduction, Table 1, CIFAR-10 and STL-10 Datasets, Experiments, Experiments.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: Introduction.
-  (2011) Unbiased look at dataset bias. In CVPR, Cited by: Introduction.
-  (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: Approach.
-  (2015) Simultaneous deep transfer across domains and tasks. In ICCV, Cited by: Introduction, Related Work.
-  (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: Introduction, Related Work.
-  (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: Experiments.
Learning fine-grained image similarity with deep ranking. In CVPR, Cited by: Related Work.
-  (2014) How transferable are features in deep neural networks?. In NeurIPS, Cited by: Introduction.
-  (2018) Pivot based language modeling for improved neural domain adaptation. In ACL, Cited by: Table 3.
-  (2013) Contrastive learning using spectral methods. In NeurIPS, Cited by: Related Work.