Unsupervised Domain Adaptation Papers and Code
Deep learning methods have shown promise in unsupervised domain adaptation, which aims to leverage a labeled source domain to learn a classifier for the unlabeled target domain with a different distribution. However, such methods typically learn a domain-invariant representation space to match the marginal distributions of the source and target domains, while ignoring their fine-level structures. In this paper, we propose Cluster Alignment with a Teacher (CAT) for unsupervised domain adaptation, which can effectively incorporate the discriminative clustering structures in both domains for better adaptation. Technically, CAT leverages an implicit ensembling teacher model to reliably discover the class-conditional structure in the feature space for the unlabeled target domain. Then CAT forces the features of both the source and the target domains to form discriminative class-conditional clusters and aligns the corresponding clusters across domains. Empirical results demonstrate that CAT achieves state-of-the-art results in several unsupervised domain adaptation scenarios.READ FULL TEXT VIEW PDF
Deep neural networks, trained with large amount of labeled data, can fai...
Unsupervised domain adaptation seeks to learn an invariant and discrimin...
Unsupervised domain adaptation tries to adapt a classifier trained on a
Domain adaptation aims to leverage the supervision signal of source doma...
We study the problem of unsupervised domain adaptation, which aims to ad...
In this paper, we study the problem of unsupervised domain adaptation th...
Unsupervised domain adaptation (UDA) for semantic segmentation aims to a...
Unsupervised Domain Adaptation Papers and Code
Code for "Cluster Alignment with a Teacher for Unsupervised Domain Adaptation"
Deep learning has achieved remarkable performance in a wide variety of computer vision tasks, such as image recognition and object detection . However, classifiers trained on specific datasets cannot always generalize effectively to new datasets owing to the well-known domain shift problem [5, 44]. Enabling models to generalize from a source domain to a target domain is usually referred to as domain adaptation (DA) . In many cases, it is expensive or difficult to collect annotations on the target domain. Learning algorithms attempting to tackle the transferring problem from a fully labeled source domain to an unlabeled target domain is called unsupervised domain adaptation (UDA) . UDA is particularly challenging because the target domain cannot provide explicit information to facilitate the adaptation of classifiers.
Recently, deep models have been developed with promise in unsupervised domain adaptation to learn expressive features [46, 7, 8, 23, 21, 45, 24, 38, 40, 37]. These deep UDA methods mainly focus on matching the source and target domains via adversarial training [7, 45, 2, 21, 24, 49, 38, 14] or kernelized training [23, 24, 25]. The main hypothesis behind them is that the marginal distributions of the two domains can be aligned in some feature space learned by optimizing a deep network, and thus the classifier trained with source data tends to perform well on the target domain. Theoretical analyses  also show that minimizing the divergence between the marginal distributions in the learned feature space is beneficial to reduce the classifier’s error.
However, these methods are not problemless. In classification, as the classes correspond to different semantics and different characteristics, the marginal distribution of the data naturally has a class-conditional multi-modal structure. Moreover, the modes corresponding to the same class in different domains are not always geometrically similar. Thus, it is not sufficient for existing deep UDA methods to only minimize the discrepancy between the marginal distributions while neglecting their structures, and such methods tend to fail in challenging cases, such as those in Fig. 1. Properly incorporating this fine-grained class-conditional structure has been shown beneficial in various tasks. For example, Shi and Sha  make the discriminative clustering assumption which helps to adapt the decision boundaries for the source domain to the target domain discriminatively.111Besides UDA tasks, previous work  has also shown an interesting exploration of the class-conditional structures for learning deep models that are robust against adversarial attacks. However, one limitation of 
is that it adopts a simple linear transformation to learn the feature space, which cannot effectively extract high-order features from raw data (e.g., images) as the deep UDA methods. Another limitation is that builds a nearest neighbor based prediction model, which outputs the prediction of one sample based on all the source data. Then, the training is not compatible with the stochastic training of deep network and has a high complexity.
In this paper, we present Cluster Alignment with a Teacher (CAT), a new deep UDA model that incorporates the class-conditional structures for more effective adaptation. CAT conjoins the complimentary advantages of deep learning methods and discriminative clustering methods for UDA. Technically, there are three learning objectives in CAT. At first, CAT minimizes the supervised classification loss on the labeled source data and builds a teacher classifier, an implicit ensemble of the source classifier, to provide pseudo labels for unlabeled target data. The underlying notion is that the golden classifier trained on source domain can perform well on a majority of target samples because of the similarity between the two domains and the teacher-student paradigm is not sensitive to the false pseudo labels . To exploit the fine-grained class-conditional structures in the feature space and address the aforementioned issues suffered by existing deep UDA methods, CAT also includes two objectives which depend on the pseudo labels provided by the teacher classifier. On one hand, for discriminative learning in both domains, CAT deploys a class-conditional clustering loss to force the features from the same class to concentrate together and the features from different classes to be separated. On the other hand, for the class-conditional alignment between the two domains, CAT aligns the clusters which correspond to the same class but come from different domains via a conditional feature matching loss. The prediction models used in CAT are a student deep network and its implicit ensemble (, the teacher classifier), thus CAT can address the training issues of  and also enjoy the more flexible ability of feature learning. Furthermore, it is obvious that CAT is compatible to the marginal distribution alignment methods on the tasks where the source data is similarly distributed as the target data. The former can provide a fine-grained class-conditional alignment of domains and the latter can provide a global alignment of them.
We evaluate the proposed CAT through extensive experiments on both synthetic and real-world datasets. Empirical results show that CAT presents striking performance across various tasks. Moreover, CAT can be further combined with the existing deep UDA methods, which globally match the marginal distributions. We found that CAT can bias them successfully to achieve the discriminative alignment between domains, establishing new state-of-the-art baselines on popular benchmarks. In the combined methods, we also propose a confidence-thresholding technique to filter out low-confidence target samples (which are likely to be mapped into incorrect clusters by the marginal distribution alignment methods) to enhance the stability of the training.
To summarize, the contributions of the paper are three-fold:
We consider and exploit the discriminative class-conditional structures of distributions in deep UDA and propose CAT to achieve better alignment between the source domain and the target domain.
The proposed CAT is compatible and applicable to the existing UDA methods which rely on marginal distribution alignment.
We empirically show that CAT is not sensitive to hyper-parameters and can boost the marginal distribution alignment approaches significantly, achieving new state-of-the-art across various settings.
Unsupervised domain adaptation has drawn increasing interests, and has been developed mainly in two directions: Maximum Mean Discrepancy (MMD) based approaches and adversarial training based approaches. Tzeng  and Long  minimize MMD to match the two domains while 
proposes to align the joint distributions of them using Joint MMD criterion. Since the development of Generative Adversarial Networks (GANs), adversarial training has been applied into domain adaptation and fruitful works emerge. Ganin [8, 7] develop the framework of domain adversarial training and plenty of works are proposed to improve it by aligning source domain and target domain better in the feature space [45, 49, 15] or image space . Zhang  successfully extend RevGrad  to consider each domain’s characteristics using collaborative games. Image to image translation approaches [9, 2, 14, 35, 29, 40] also play an important role in the advancement of domain adaptation and demonstrate impressive performance, especially on semantic segmentation tasks. In addition, Saito  propose to align the two domains using decision boundaries of task-specific classifier. Associative DA  proposes an associative loss to reduce discrepancy between domains and SimNet  proposes to use a similarity-based classifier in UDA. Though the existing methods match the two domains in different ways, most of them ignore the discriminative information in the alignment procedure, which may lead to failed adaptation because of improper alignment between source domain and target domain. In contrast, CAT explicitly discovers class-conditional structure using a teacher model and constructs a more reasonable matching procedure based on them.
Using a teacher model is inspired by consistency-based methods in semi-supervised learning (SSL)[17, 43]. Recent attempts to apply SSL techniques in UDA include [6, 42, 47]. CAT differs from these previous works in that CAT exploit the discriminative class-conditional structures in both the alignment and classification procedures while they focus on improving the classifier for the target domain by implementing the cluster assumption . CAT imposes a much stronger regularization and assists in a better alignment.
In this section, we first introduce the setting and framework of deep UDA and then present the Cluster Alignment with a Teacher (CAT). Finally, we discuss about CAT.
In an UDA task, we are given a set of source samples with labels , and a set of unlabeled target samples . Notably, the two sets of samples are drawn from different distributions which lead to the domain shift challenge. Therefore, the UDA algorithms should learn to adapt the classifier trained on the source domain to the unlabeled target domain. Deep learning techniques have been introduced into UDA [8, 45, 2, 21, 24, 38, 23, 25] and they demonstrate remarkable performance across tasks. Generally, in these methods, the classifier (parameterized by ) is constructed as where maps samples into features in the space and outputs the predictions based on the extracted features. The learning includes simultaneously optimizing the classifier w.r.t. the labeled source data and minimizing the distance between the marginal distributions of the two domains in the feature space , resulting in a domain-invariant feature space. Technically, in the source domain, we minimize the supervised loss as:
where is a pre-defined loss, , cross-entropy loss. Meanwhile, we minimize the discrepancy loss as:
where is a distance and usually correlated with the distance in the error bound theory of DA . The theory reveals that the expected error on target samples of any classifier drawn from a hypothesis set has the following bound [1, 49]:
where denotes the expected error on source samples of , and and represent the labelling functions  for the source and target domains, respectively. denotes the disagreement between the labelling functions in the target domain. Notably, can be small enough by optimizing w.r.t. the labeled source data. The supervised loss and discrepancy loss focus on minimizing and respectively to obtain small target domain classification error.
However, the methods working in the above framework are not problemless. Theoretically, they ignore minimizing which may lead to a large upper bound of and result in unsatisfying target domain performance . Empirically, the data in classification naturally has a class-conditional multi-modal structure, thus aligning marginal distributions while ignoring the fine-level discriminative structures of domains may hurt the target classification performance (Fig. 1-left). Moreover, these methods may fail in more practical and challenging problems, the source and target domains have obviously different class imbalance ratios (Fig. 1-right).
To overcome these issues, we expect to exploit the fine-level structures in the feature space for discriminative learning and match the class-conditional distributions of source and target domains to reduce of the mismatching between and . Therefore, in the deep UDA scenario, we propose Cluster Alignment with a Teacher (CAT), a new deep UDA model for more effective adaptation. Specifically, for the objectives of discriminative learning and class-conditional alignment between domains, we propose a discriminative clustering loss to force the features of both the source and the target domains to form discriminative clusters, and a cluster-based alignment loss to align the clusters corresponding to the same class in different domains. Given them, we propose to train CAT by solving the following optimization problem:
where the hyper-parameter sets a relative trade-off. The whole framework is shown in Fig. 2. We build a teacher classifier, an implicit ensemble of the classifier to be optimized, to provide pseudo labels for the unlabeled target data. These pseudo labels will be used in and . We use stochastically sampled mini-batches in the two objectives and the two classifiers make predictions in a forward-propagation way, thus CAT can be trained more efficiently than . Furthermore, and optimize the feature space directly and will be more effective than the nearest neighbor based clustering loss in . We elaborate and in the following sections.
For better classification and alignment, we propose to discover the class-conditional structures in the feature space in both the source and the target domains, and then shape them to be discriminative clusters. In the source domain, the class-conditional structure is obvious because the data is fully labeled. Nevertheless, in the target domain, we cannot obtain the class-conditional structure easily due to the lack of labels. The semantic similarity between the two domains implies that the classifier trained on the source domain can predict most of target samples correctly. Consequently, using pseudo labels  as the annotations for target data and conducting self-training is a direct approach, but it suffers from the error amplification issue which can be detrimental to the learning procedure. To discover the class-conditional structure of the target features in a reliable way, we introduce a teacher classifier defined as an implicit ensemble of the previous student classifier  to provide pseudo labels for target data.
Based on the pseudo labels given by the teacher, we can explicitly force the target class-conditional structure to be more discriminative using a clustering loss. For the source domain, a similar one can be applied. Formally, we employ the following discriminative clustering loss to learn a more discriminative feature space for the two domains (we omit the dependence of on for clarity, unless stated otherwise):
where is the distance (, squared Euclidean distance) between two features, is a pre-defined margin, and is an indicator function which outputs 1 only if and have the same ground truth label (source domain) or teacher-annotated label (target domain). Obviously, encourages the features from the same class to concentrate together and pushes the features from different classes far away from each other with a distance at least. This loss modifies the structures in the representation space gradually, and consequently, it demonstrates a class-conditional cluster structure (as shown in Sec. 4.4). Note that minimizing is consistent with the cluster assumption  of classifier and benefits the performance of classification.
It’s a common doubt whether the incorrect predictions of the teacher classifier would destroy the training dynamics. However, previous works on semi-supervised learning [17, 43] have validated that this kind of training always leads to good convergence and demonstrates robustness against incorrect labels. Intuitively, the teacher instructs the training of one instance through a bundle of others’ predictions which alleviates the negative influence of incorrect predictions notably and aids the student to give better predictions.
Once the feature space presents discriminative cluster structure, the classifier is expected to make more accurate predictions. However, the label predictor trained on source domain features may fail due to the geometrical mismatching between the clusters which correspond to the same class in different domains. This kind of mismatching is brought by the individual characteristics of each domain. As a result, it is necessary to impose a class-conditional alignment of two domains to learn better domain-invariant features and adjust the target feature space to be more suitable for classification. Naturally, we expect to minimize the divergence between the corresponding clusters in source domain and the teacher-annotated target domain:
where () denotes the set consisting of all the features belonging to class of domain (domain ). Extensive previous works [10, 20] have been proposed to minimize the distance between two sets of samples but we expect to achieve this in a more simple and efficient way by exploiting the separable and tight clusters in the feature space. Drawing inspiration from feature matching GANs  which optimizes the distance between the first-order statistics of distributions and demonstrates striking results on SSL tasks, we choose to extend it to work in a conditional way. Formally, we introduce the following cluster alignment loss:
where and are calculated by
where is the subset of containing all the source samples whose ground-truth labels are and is the subset of including all the target samples annotated as class by the teacher classifier . This loss is slightly different from the original feature matching loss: it matches the statistics of the representation space which totally determine the predictions instead of those produced by an extra critic network. Arguably, the objective has a local optima where class-conditional distributions are matched thoroughly. The cluster alignment loss and the discriminative clustering loss work together to align the class-conditional structures of the two domains in a discriminative way.
In fact, the source domain and target domain in the existing popular UDA tasks (, digits adaptation and Office-31) have analogous marginal distributions. Therefore, in these experiments, we combine CAT with the marginal distribution alignment methods, and CAT contributes to bias them to match the cluster-based marginal distributions. The negative effects of these methods of ignoring the discriminability may hurt the stability of training and the capability of converged models. For example, several target circle samples in Fig. 1-left will be misclassified by them.
We are dedicated to delivering a technique to improve these models given the observation that in the early stages of training, a portion of target samples lie around the decision boundaries of the adapted classifier, , they have low classification confidence (the largest output probability) and are likely to be misclassified. Therefore, these samples are possible to be mapped into the incorrect clusters in the marginal alignment process and the training falls into local optima. To solve this, we propose to use confidence-thresholding method to hold out uncertain data points with confidence less than, while aligning the confident instances which are more geometrically close to the source domain with the source data. Formally, we instantiate this technique in the typical and brief RevGrad  and propose robust RevGrad (rRevGrad) which optimizes the following loss:
where is the critic model parameterized by and is an indicator function which outputs 1 only if teacher’s confidence of is greater than . With the divergence between the two domains decreasing, more and more target samples are selected into the domain adversarial training. Gradually, almost all the target samples are included in the training which avoids the lost of target information. We empirically observed that rRevGrad improves the classification performance on the target domain and enhances the stability of training (see Sec.4).
Comparison with SSL based deep UDA methods [42, 37, 6]. CAT not only implements of cluster assumption for better classification but also imposes a class-conditional alignment between domains which is more principal in UDA. However, [42, 37, 6] focus on improving classifier to make it more consistent and robust for the target domain based on cluster assumption. Thus CAT is compatible to these methods (see Sec. 4).
Comparison with MSTN . The cluster alignment loss using conditional feature matching technique is kind of similar with the semantic loss in MSTN . However, in MSTN, minimizing distance between centers is necessary but not sufficient to achieve semantic alignment. In CAT, we regularize the features to form separable and tight clusters, so the feature matching based loss can match the clusters naturally. They are also different in implementation.
Mini-batch stochastic training of CAT. We implement the two objectives in CAT using stochastically sampled mini-batches as and . Specifically, is an instance-wise loss and can work well. The class-conditional expectation or in could be none when these is no points belonging to class . At this time, we remove the term corresponding to class in Eq. 8 and calculate the mean of the other terms. We empirically find CAT needs only more training time when combining with existing methods.
Teacher-student paradigm. First, using teacher as labeling function on target domain avoids the error amplification issue. Furthermore, once the classifier becomes more accurate on the target domain, the teacher classifier performs better as well. Then the pseudo labels used in and are more likely to be correct which in turn enhances the classifier. Consequently, a boosting cycle between them is formed.
To demonstrate the effectiveness of CAT, we evaluate it through various experiments on synthetic imbalanced dataset and three challenging UDA tasks: SVHN-MNIST-USPS, Office-31  and ImageCLEF-DA. We show that CAT considers and exploits the fine-level class-conditional structures of the source and target domains, and makes the learned feature space discriminative and aligned, thus yielding improved performance on the target domain.
Datasets and configurations. SVHN-MNIST-USPS is a challenging adaptation task of digits between three datasets SVHN , MNIST  and USPS. We conduct experiments in three directions: SVHNMNIST, MNISTUSPS and USPSMNIST. We follow the protocol in : we use the whole training sets for the adaptation from SVHN to MNIST and randomly sample 2000 images from MNIST and 1800 images from USPS for the adaptation between the two datasets. Following MSTN , the images are cast to when using LeNet  as classifier. When combining with MCD  and VADA , We take the identical settings as the original methods.
Imbalanced SVHN-MNIST-USPS. We randomly sample 1000 instances from class 0 and 100 instances from class 1 from the original source domain and construct a new one. Then we sample 100 instances from class 0 and 1000 instances from class 1 from the target domain to form a new target. Therefore, the synthetic adaptation dataset contains several imbalanced two-class adaptation tasks. The experiment settings are the same as those of SVHN-MNIST-USPS.
Office-31 and ImageCLEF-DA are two real-world datasets which are widely used in domain adaptation research. Office-31 is composed of three domains: Amazon (A), DSLR (D) and Webcam (W), containing 2817, 498 and 795 images from 31 categories, respectively. ImageCLEF-DA
includes three domains: Caltech-256 (C), ImageNet ILSVRC 2012 (I) and Pascal VOC 2012 (P), containing 600 images from 12 classes, respectively. We use data augmentation such as random flipping and cropping in training for fair comparison with the baselines.
Implementation. In synthetic and digits experiments using LeNet, we set according to the performance of CAT on the synthetic dataset and we forward-propagate a target sample twice under different perturbations(, dropout) and use the latter as the prediction of the teacher for its simplicity (similar with the model of ). In all the other experiments, we fix and deploy a temporal ensemble  of previous predictions of as teacher (the accumulation decay constant is set to 0.6). We design a ramp-up function similar with that of  to update in experiments using LeNet and set suggested by RevGrad  in which increases linearly from 0 to 1 in the others. We set in all the experiments without tuning. Refer to Appendix. D for more details of the used architectures and optimization settings.
We first test CAT on the imbalanced SVHN-MNIST-USPS dataset, a challenging task where the source domains have ratio of class imbalance while the target domains have . We implement CAT and RevGrad based on the official codes of MSTN  using LeNet. The results are shown in Table 1
. We repeat each task 3 times and report the averaged test accuracy and standard deviation.
|Method||SVHN to MNIST||MNIST to USPS||USPS to MNIST|
|Method||A to W||D to W||W to D||A to D||D to A||W to A||Avg|
|Method||SVHN to MNIST||MNIST to USPS||USPS to MNIST|
It is notable that RevGrad and MSTN fail thoroughly owing to their obsession of matching the marginal distributions. In contrast, CAT gives almost completely correct predictions for the target domains. This experiment verifies that existing methods through aligning marginal distributions are restrictive and require the modes corresponding to the same class but different domains to be geometrically similar. They are sensitive and fragile in practical tasks.
We apply CAT to the popular digits adaptation task SVHN-MNIST-USPS and compare to the state-of-the-art approaches in Table 3 (all baseline results are taken from related literature). CAT, RevGrad+CAT and rRevGrad+CAT follow the settings of MSTN  using the LeNet. We implement MCD+CAT and VADA+CAT based on the official codes of MCD and VADA using their architectures instead of LeNet for fair comparison. We only integrate CAT with the first stage algorithm VADA in DIRT-T  while discarding the fine-tuning stage for simpleness.
There are several conclusions we can make. First, CAT reveals strikingly improved test accuracy on SVHN to MNIST task without tuning the hyper-parameters, and CAT even outperforms MCD  and VADA 
which use much wider and deeper neural networks thanks to the class-conditional discriminative alignment between the source and target domains. This task is the most challenging one among the three because of the complex samples and the internal class imbalance ofSVHN. Second, CAT does not perform well enough on the other two tasks but when combining with rRevGrad and MCD, CAT outperforms the strong baselines MSTN  and MCD  with obvious margins. Third, applying CAT into RevGrad , MCD  and VADA 
can enhance the base methods significantly, especially on the typical and simple RevGrad. Finally, rRevGrad+CAT displays higher test accuracy and lower variance than those of RevGrad+CAT and the advantage is particularly obvious when the two domains have different class-conditional structures (,SVHN to MNIST), so we utilize rRevGrad+CAT on more challenging tasks.
We evaluate CAT using two sets of extensive experiments on the widely used Office-31 and ImageCLEF-DA. They contain more realistic and high-dimensional images, providing a good complement to the digits adaptation task. The results are provided in Table 2 and Table 4, respectively. We integrate CAT with rRevGrad (using ResNet-50  and AlexNet  as the classifiers) and JAN  (using ResNet-50  as the classifier), which is sufficient to testify the effectiveness of discriminative cluster-based alignment and teacher-student paradigm.
We observe that CAT can boost JAN  and rRevGrad significantly, especially on the difficult A to W, A to D, D to A and W to A tasks in Office-31, and the combined models surpass the strong baselines RevGrad and JAN by obvious margins (more than 1%). CAT based methods also outperform MSTN on various tasks substantially which proves the class-conditional discriminative alignment is superior to the semantic alignment used by MSTN. The improvement of test accuracy on most tasks of ImageCLEF-DA shows that CAT can still work well when the domains are small containing only 600 images. We further confirm that CAT can diliver a discriminative and aligned cluster-structure feature space by visualizing the learned features in the Appendix. A.
|Method||I to P||P to I||I to C||C to I||C to P||P to C||Avg|
Visualization of feature space. We visualize the features of the two domains learned by the powerful rRevGrad+CAT and RevGrad  on the SVHN to MNIST task using t-SNE . The results are shown in Fig. 3. As expected, using CAT (Fig. 2(b)), the features are concentrated and form tight clusters and those from different classes are separated. In contrast, the features learned by RevGrad  (Fig. 2(a)) are more overlapping and less discriminative.
Clustering in the feature space.
We further examine the feature space shaped by CAT and other baselines by conducting K-means clustering using the aligned features. We utilize the trained models to infer the hidden features of all the images from two domains. Then the features are clustered into components by K-means in scikit-learn . We set as the number of categories. We greedily set the label of a cluster as the most frequent label in it to calculate clustering accuracy as shown in Fig. 4. As expected, the feature spaces learned by rRevGrad+CAT demonstrate more discriminative cluster structure and this is consistent with classification results of CAT.
Convergence. To inspect how CAT converges, we plot the test accuracy with respect to the number of iterations in Fig. 5. On the two adaptation tasks using AlexNet, CAT shows similar convergence rate with RevGrad  but better performance. Appendix. B and C provide more analyses.
In this paper we address the challenges of making better alignment between domains and advocate to exploit the discriminative class-conditional structures for effective adaptation in deep UDA. We propose Cluster Alignment with a Teacher (CAT) to achieve the objectives of discriminative learning and class-conditional alignment via a discriminative clustering loss and a cluster-based alignment loss. CAT produces a domain-invariant feature space with improved discriminative power and enhances the performance significantly. CAT establishes new state-of-the-art baselines on benchmarks and additional analyses testify its effectiveness.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 7, 2017.
Imagenet classification with deep convolutional neural networks.In Advances in neural information processing systems, pages 1097–1105, 2012.
Generative moment matching networks.In International Conference on Machine Learning, pages 1718–1727, 2015.
At first, we visualize the learned feature spaces of CAT, RevGrad  and MSTN  on the imbalanced SVHN to MNIST task using t-SNE , as shown in Fig.6. It obvious that CAT can force the samples from the same class to concentrate together to form more tight clusters than those of RevGrad and MSTN, and the clusters present strip pattern in the 2-D space. CAT can also align the class-conditional distributions of the source and the target domains correctly. However, RevGrad and MSTN tend to align the ‘0’ images in SVHN with the ‘1’ images in MNIST, thus the learned feature spaces of them are confusing and not discriminative. This visualization verifies the results in Table. 1.
Secondly, we draw the feature spaces learned by CAT+rRevGrad and RevGrad on MNIST to USPS and USPS to MNIST tasks in Fig. 7 using t-SNE . CAT+rRevGrad can deliver more discriminative feature spaces with separable and tight class-conditional clusters. Therefore, it is sufficient to use the first-order statistics based matching loss to match the class-conditional distributions of the two domains. The aligned clusters of the source and the target domains also verify the effectiveness of the loss .
Furthermore, we examine the feature space learned by CAT on more challenging tasks in Office-31 dataset and ImageCLEF-DA dataset, and results are demonstrated in Fig. 8. These features are outputted by the AlexNet model trained with rRevGrad+CAT method. The class-conditional distributions are shaped to be tight and separable clusters, and the corresponding cluters from the source domain and the target domain are aligned. Therefore, CAT can achieve the objectives of discriminative learning and class-conditional alignment, thus can perform well on the extensive experiments on Office-31 and ImageCLEF-DA datasets.
When aligning the source domain and target domain via the combination of RevGrad and CAT, the loss which is maximized w.r.t. the critic can be viewed as a lower bound of (see  for the details) where denotes the Jensen-Shannon divergence between distributions. Therefore, we plot
to quantitatively estimate the divergence between the two domains, following . The results are shown in Fig.9 and we use the AlexNet as the classifier here. CAT can boost RevGrad significantly, leading to faster and better convergence. This set of experiments verifies that when combining CAT with the marginal distribution alignment approaches, it can provide a discriminative class-conditional alignment and bias the existing approaches to align the cluster-structure marginal distributions better.
We use the confidence-thresholding technique in the RevGrad method and we claim that in the training procedure, more and more samples are going to have confidence greater than and are selected into the domain adversarial training. Here we prove it through three experiments on tasks in SVHN-MNIST-USPS, Office-31 and ImageCLEF-DA, respectively. We plot the selection rate of this technique in the rRevGrad+CAT method w.r.t. the number of iterations in Fig. 10. We note after several thousands of iterations, the selection rate will be almost in the Amazon to DSLR and p to i tasks. In SVHN to MNIST task, we use a ramp-up function as after 5000 iterations, suggested by related SSL works. Therefore, after around 15000 iterations, the discriminative clustering structure forms, and then the samples are pushed far away from the decision boundaries. So almost all the samples will have confidence more than and will be selected into the domain adversarial training.
On digits adaptation tasks, we use the simple LeNet with Batch Normalization after the convolutional layers and use the probability logits as features for adaptation, following [49, 45]. When combining with RevGrad and rRevGrad, the critic model has aarchitecture.
On more challenging tasks, we conduct experiments based on the AlexNet  and ResNet-50  equipped with 256-D bottleneck layers after the and layers respectively (following [25, 49]). We use the features outputted by the bottleneck layers as image representations for adaptation and use a three-layer critic with architecture. We finetune all the layers before the bottleneck layers in AlexNet and ResNet-50 and train the bottleneck layers and the classification layers via back propagation.
We use the stochastic gradient descent with 0.9 momentum with an annealed learning ratewhere p changes from 0 to 1 in the training progress [7, 49] when using LeNet and AlexNet as the classifiers. The learning rate for finetuned layers is set to be the ten percent of that for layers trained from scratch. We use batches with 128 elements in experiments using LeNet, batches with 200 elements in experiments using AlexNet and batches with 36 elements in experiments using ResNet-50.
We use the same architectures and optimization settings (, batch size, learning rate, optimizer and weight decay) as those of the original methods [42, 38] when combining CAT with them. We will release the codes after review.