Cluster Alignment with a Teacher for Unsupervised Domain Adaptation

03/24/2019 ∙ by Zhijie Deng, et al. ∙ Tsinghua University 0

Deep learning methods have shown promise in unsupervised domain adaptation, which aims to leverage a labeled source domain to learn a classifier for the unlabeled target domain with a different distribution. However, such methods typically learn a domain-invariant representation space to match the marginal distributions of the source and target domains, while ignoring their fine-level structures. In this paper, we propose Cluster Alignment with a Teacher (CAT) for unsupervised domain adaptation, which can effectively incorporate the discriminative clustering structures in both domains for better adaptation. Technically, CAT leverages an implicit ensembling teacher model to reliably discover the class-conditional structure in the feature space for the unlabeled target domain. Then CAT forces the features of both the source and the target domains to form discriminative class-conditional clusters and aligns the corresponding clusters across domains. Empirical results demonstrate that CAT achieves state-of-the-art results in several unsupervised domain adaptation scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

DA

Unsupervised Domain Adaptation Papers and Code


view repo

CAT

Code for "Cluster Alignment with a Teacher for Unsupervised Domain Adaptation"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has achieved remarkable performance in a wide variety of computer vision tasks, such as image recognition 

[16] and object detection [34]. However, classifiers trained on specific datasets cannot always generalize effectively to new datasets owing to the well-known domain shift problem [5, 44]. Enabling models to generalize from a source domain to a target domain is usually referred to as domain adaptation (DA) [1]. In many cases, it is expensive or difficult to collect annotations on the target domain. Learning algorithms attempting to tackle the transferring problem from a fully labeled source domain to an unlabeled target domain is called unsupervised domain adaptation (UDA) [11]. UDA is particularly challenging because the target domain cannot provide explicit information to facilitate the adaptation of classifiers.

Figure 1: (Best viewed in color.) Left: The two domains have diverse modes. Right: The two domains have different class imbalance ratios. Existing methods aligning the marginal distributions while ignoring the class-conditional structures cannot perform well in these cases. However, CAT incorporates the discriminative clustering structures in both domains for better adaptation, thus delivers a more reasonable domain-invariant cluster-structure feature space with enhanced discriminative power. See Fig. 3 and Appendix. A for the learned feature space of real data.

Recently, deep models have been developed with promise in unsupervised domain adaptation to learn expressive features [46, 7, 8, 23, 21, 45, 24, 38, 40, 37]. These deep UDA methods mainly focus on matching the source and target domains via adversarial training [7, 45, 2, 21, 24, 49, 38, 14] or kernelized training [23, 24, 25]. The main hypothesis behind them is that the marginal distributions of the two domains can be aligned in some feature space learned by optimizing a deep network, and thus the classifier trained with source data tends to perform well on the target domain. Theoretical analyses [1] also show that minimizing the divergence between the marginal distributions in the learned feature space is beneficial to reduce the classifier’s error.

However, these methods are not problemless. In classification, as the classes correspond to different semantics and different characteristics, the marginal distribution of the data naturally has a class-conditional multi-modal structure. Moreover, the modes corresponding to the same class in different domains are not always geometrically similar. Thus, it is not sufficient for existing deep UDA methods to only minimize the discrepancy between the marginal distributions while neglecting their structures, and such methods tend to fail in challenging cases, such as those in Fig. 1. Properly incorporating this fine-grained class-conditional structure has been shown beneficial in various tasks. For example, Shi and Sha [41] make the discriminative clustering assumption which helps to adapt the decision boundaries for the source domain to the target domain discriminatively.111Besides UDA tasks, previous work [31] has also shown an interesting exploration of the class-conditional structures for learning deep models that are robust against adversarial attacks. However, one limitation of [41]

is that it adopts a simple linear transformation to learn the feature space, which cannot effectively extract high-order features from raw data (e.g., images) as the deep UDA methods. Another limitation is that

[41] builds a nearest neighbor based prediction model, which outputs the prediction of one sample based on all the source data. Then, the training is not compatible with the stochastic training of deep network and has a high complexity.

In this paper, we present Cluster Alignment with a Teacher (CAT), a new deep UDA model that incorporates the class-conditional structures for more effective adaptation. CAT conjoins the complimentary advantages of deep learning methods and discriminative clustering methods for UDA. Technically, there are three learning objectives in CAT. At first, CAT minimizes the supervised classification loss on the labeled source data and builds a teacher classifier, an implicit ensemble of the source classifier, to provide pseudo labels for unlabeled target data. The underlying notion is that the golden classifier trained on source domain can perform well on a majority of target samples because of the similarity between the two domains and the teacher-student paradigm is not sensitive to the false pseudo labels [17]. To exploit the fine-grained class-conditional structures in the feature space and address the aforementioned issues suffered by existing deep UDA methods, CAT also includes two objectives which depend on the pseudo labels provided by the teacher classifier. On one hand, for discriminative learning in both domains, CAT deploys a class-conditional clustering loss to force the features from the same class to concentrate together and the features from different classes to be separated. On the other hand, for the class-conditional alignment between the two domains, CAT aligns the clusters which correspond to the same class but come from different domains via a conditional feature matching loss. The prediction models used in CAT are a student deep network and its implicit ensemble (, the teacher classifier), thus CAT can address the training issues of [41] and also enjoy the more flexible ability of feature learning. Furthermore, it is obvious that CAT is compatible to the marginal distribution alignment methods on the tasks where the source data is similarly distributed as the target data. The former can provide a fine-grained class-conditional alignment of domains and the latter can provide a global alignment of them.

We evaluate the proposed CAT through extensive experiments on both synthetic and real-world datasets. Empirical results show that CAT presents striking performance across various tasks. Moreover, CAT can be further combined with the existing deep UDA methods, which globally match the marginal distributions. We found that CAT can bias them successfully to achieve the discriminative alignment between domains, establishing new state-of-the-art baselines on popular benchmarks. In the combined methods, we also propose a confidence-thresholding technique to filter out low-confidence target samples (which are likely to be mapped into incorrect clusters by the marginal distribution alignment methods) to enhance the stability of the training.

To summarize, the contributions of the paper are three-fold:

  • We consider and exploit the discriminative class-conditional structures of distributions in deep UDA and propose CAT to achieve better alignment between the source domain and the target domain.

  • The proposed CAT is compatible and applicable to the existing UDA methods which rely on marginal distribution alignment.

  • We empirically show that CAT is not sensitive to hyper-parameters and can boost the marginal distribution alignment approaches significantly, achieving new state-of-the-art across various settings.

2 Related work

Unsupervised domain adaptation has drawn increasing interests, and has been developed mainly in two directions: Maximum Mean Discrepancy (MMD) based approaches and adversarial training based approaches. Tzeng  [46] and Long  [23] minimize MMD to match the two domains while [24]

proposes to align the joint distributions of them using Joint MMD criterion. Since the development of Generative Adversarial Networks (GANs) 

[10], adversarial training has been applied into domain adaptation and fruitful works emerge. Ganin  [8, 7] develop the framework of domain adversarial training and plenty of works are proposed to improve it by aligning source domain and target domain better in the feature space [45, 49, 15] or image space [21]. Zhang  [50] successfully extend RevGrad [8] to consider each domain’s characteristics using collaborative games. Image to image translation approaches [9, 2, 14, 35, 29, 40] also play an important role in the advancement of domain adaptation and demonstrate impressive performance, especially on semantic segmentation tasks. In addition, Saito  [38] propose to align the two domains using decision boundaries of task-specific classifier. Associative DA [12] proposes an associative loss to reduce discrepancy between domains and SimNet [32] proposes to use a similarity-based classifier in UDA. Though the existing methods match the two domains in different ways, most of them ignore the discriminative information in the alignment procedure, which may lead to failed adaptation because of improper alignment between source domain and target domain. In contrast, CAT explicitly discovers class-conditional structure using a teacher model and constructs a more reasonable matching procedure based on them.

Using a teacher model is inspired by consistency-based methods in semi-supervised learning (SSL) 

[17, 43]. Recent attempts to apply SSL techniques in UDA include [6, 42, 47]. CAT differs from these previous works in that CAT exploit the discriminative class-conditional structures in both the alignment and classification procedures while they focus on improving the classifier for the target domain by implementing the cluster assumption [4]. CAT imposes a much stronger regularization and assists in a better alignment.

3 Methodology

In this section, we first introduce the setting and framework of deep UDA and then present the Cluster Alignment with a Teacher (CAT). Finally, we discuss about CAT.

3.1 Deep unsupervised domain adaptation

In an UDA task, we are given a set of source samples with labels , and a set of unlabeled target samples . Notably, the two sets of samples are drawn from different distributions which lead to the domain shift challenge. Therefore, the UDA algorithms should learn to adapt the classifier trained on the source domain to the unlabeled target domain. Deep learning techniques have been introduced into UDA [8, 45, 2, 21, 24, 38, 23, 25] and they demonstrate remarkable performance across tasks. Generally, in these methods, the classifier (parameterized by ) is constructed as where maps samples into features in the space and outputs the predictions based on the extracted features. The learning includes simultaneously optimizing the classifier w.r.t. the labeled source data and minimizing the distance between the marginal distributions of the two domains in the feature space , resulting in a domain-invariant feature space. Technically, in the source domain, we minimize the supervised loss as:

(1)

where is a pre-defined loss, , cross-entropy loss. Meanwhile, we minimize the discrepancy loss as:

(2)

where is a distance and usually correlated with the distance in the error bound theory of DA [1]. The theory reveals that the expected error on target samples of any classifier drawn from a hypothesis set has the following bound [1, 49]:

(3)

where denotes the expected error on source samples of , and and represent the labelling functions [1] for the source and target domains, respectively. denotes the disagreement between the labelling functions in the target domain. Notably, can be small enough by optimizing w.r.t. the labeled source data. The supervised loss and discrepancy loss focus on minimizing and respectively to obtain small target domain classification error.

However, the methods working in the above framework are not problemless. Theoretically, they ignore minimizing which may lead to a large upper bound of and result in unsatisfying target domain performance [49]. Empirically, the data in classification naturally has a class-conditional multi-modal structure, thus aligning marginal distributions while ignoring the fine-level discriminative structures of domains may hurt the target classification performance (Fig. 1-left). Moreover, these methods may fail in more practical and challenging problems, the source and target domains have obviously different class imbalance ratios (Fig. 1-right).

3.2 Cluster Alignment with a Teacher

To overcome these issues, we expect to exploit the fine-level structures in the feature space for discriminative learning and match the class-conditional distributions of source and target domains to reduce of the mismatching between and . Therefore, in the deep UDA scenario, we propose Cluster Alignment with a Teacher (CAT), a new deep UDA model for more effective adaptation. Specifically, for the objectives of discriminative learning and class-conditional alignment between domains, we propose a discriminative clustering loss to force the features of both the source and the target domains to form discriminative clusters, and a cluster-based alignment loss to align the clusters corresponding to the same class in different domains. Given them, we propose to train CAT by solving the following optimization problem:

(4)

where the hyper-parameter sets a relative trade-off. The whole framework is shown in Fig. 2. We build a teacher classifier, an implicit ensemble of the classifier to be optimized, to provide pseudo labels for the unlabeled target data. These pseudo labels will be used in and . We use stochastically sampled mini-batches in the two objectives and the two classifiers make predictions in a forward-propagation way, thus CAT can be trained more efficiently than [41]. Furthermore, and optimize the feature space directly and will be more effective than the nearest neighbor based clustering loss in [41]. We elaborate and in the following sections.

3.2.1 Discriminative clustering with a teacher

Figure 2: The framework of CAT (The source supervised loss is omitted for clarity).

For better classification and alignment, we propose to discover the class-conditional structures in the feature space in both the source and the target domains, and then shape them to be discriminative clusters. In the source domain, the class-conditional structure is obvious because the data is fully labeled. Nevertheless, in the target domain, we cannot obtain the class-conditional structure easily due to the lack of labels. The semantic similarity between the two domains implies that the classifier trained on the source domain can predict most of target samples correctly. Consequently, using pseudo labels [19] as the annotations for target data and conducting self-training is a direct approach, but it suffers from the error amplification issue which can be detrimental to the learning procedure. To discover the class-conditional structure of the target features in a reliable way, we introduce a teacher classifier defined as an implicit ensemble of the previous student classifier  [17] to provide pseudo labels for target data.

Based on the pseudo labels given by the teacher, we can explicitly force the target class-conditional structure to be more discriminative using a clustering loss. For the source domain, a similar one can be applied. Formally, we employ the following discriminative clustering loss to learn a more discriminative feature space for the two domains (we omit the dependence of on for clarity, unless stated otherwise):

(5)
(6)

where is the distance (, squared Euclidean distance) between two features, is a pre-defined margin, and is an indicator function which outputs 1 only if and have the same ground truth label (source domain) or teacher-annotated label (target domain). Obviously, encourages the features from the same class to concentrate together and pushes the features from different classes far away from each other with a distance at least. This loss modifies the structures in the representation space gradually, and consequently, it demonstrates a class-conditional cluster structure (as shown in Sec. 4.4). Note that minimizing is consistent with the cluster assumption [26] of classifier and benefits the performance of classification.

It’s a common doubt whether the incorrect predictions of the teacher classifier would destroy the training dynamics. However, previous works on semi-supervised learning [17, 43] have validated that this kind of training always leads to good convergence and demonstrates robustness against incorrect labels. Intuitively, the teacher instructs the training of one instance through a bundle of others’ predictions which alleviates the negative influence of incorrect predictions notably and aids the student to give better predictions.

3.2.2 Cluster alignment via conditional feature matching

Once the feature space presents discriminative cluster structure, the classifier is expected to make more accurate predictions. However, the label predictor trained on source domain features may fail due to the geometrical mismatching between the clusters which correspond to the same class in different domains. This kind of mismatching is brought by the individual characteristics of each domain. As a result, it is necessary to impose a class-conditional alignment of two domains to learn better domain-invariant features and adjust the target feature space to be more suitable for classification. Naturally, we expect to minimize the divergence between the corresponding clusters in source domain and the teacher-annotated target domain:

(7)

where () denotes the set consisting of all the features belonging to class of domain (domain ). Extensive previous works [10, 20] have been proposed to minimize the distance between two sets of samples but we expect to achieve this in a more simple and efficient way by exploiting the separable and tight clusters in the feature space. Drawing inspiration from feature matching GANs [39] which optimizes the distance between the first-order statistics of distributions and demonstrates striking results on SSL tasks, we choose to extend it to work in a conditional way. Formally, we introduce the following cluster alignment loss:

(8)

where and are calculated by

(9)

where is the subset of containing all the source samples whose ground-truth labels are and is the subset of including all the target samples annotated as class by the teacher classifier . This loss is slightly different from the original feature matching loss: it matches the statistics of the representation space which totally determine the predictions instead of those produced by an extra critic network. Arguably, the objective has a local optima where class-conditional distributions are matched thoroughly. The cluster alignment loss and the discriminative clustering loss work together to align the class-conditional structures of the two domains in a discriminative way.

3.2.3 Improved marginal distribution alignment

In fact, the source domain and target domain in the existing popular UDA tasks (, digits adaptation and Office-31) have analogous marginal distributions. Therefore, in these experiments, we combine CAT with the marginal distribution alignment methods, and CAT contributes to bias them to match the cluster-based marginal distributions. The negative effects of these methods of ignoring the discriminability may hurt the stability of training and the capability of converged models. For example, several target circle samples in Fig. 1-left will be misclassified by them.

We are dedicated to delivering a technique to improve these models given the observation that in the early stages of training, a portion of target samples lie around the decision boundaries of the adapted classifier, , they have low classification confidence (the largest output probability) and are likely to be misclassified. Therefore, these samples are possible to be mapped into the incorrect clusters in the marginal alignment process and the training falls into local optima. To solve this, we propose to use confidence-thresholding method to hold out uncertain data points with confidence less than

, while aligning the confident instances which are more geometrically close to the source domain with the source data. Formally, we instantiate this technique in the typical and brief RevGrad [8] and propose robust RevGrad (rRevGrad) which optimizes the following loss:

(10)

where is the critic model parameterized by and is an indicator function which outputs 1 only if teacher’s confidence of is greater than . With the divergence between the two domains decreasing, more and more target samples are selected into the domain adversarial training. Gradually, almost all the target samples are included in the training which avoids the lost of target information. We empirically observed that rRevGrad improves the classification performance on the target domain and enhances the stability of training (see Sec.4).

3.3 Discussion

Comparison with SSL based deep UDA methods [42, 37, 6]. CAT not only implements of cluster assumption for better classification but also imposes a class-conditional alignment between domains which is more principal in UDA. However, [42, 37, 6] focus on improving classifier to make it more consistent and robust for the target domain based on cluster assumption. Thus CAT is compatible to these methods (see Sec. 4).

Comparison with MSTN [49]. The cluster alignment loss using conditional feature matching technique is kind of similar with the semantic loss in MSTN [49]. However, in MSTN, minimizing distance between centers is necessary but not sufficient to achieve semantic alignment. In CAT, we regularize the features to form separable and tight clusters, so the feature matching based loss can match the clusters naturally. They are also different in implementation.

Mini-batch stochastic training of CAT. We implement the two objectives in CAT using stochastically sampled mini-batches as and . Specifically, is an instance-wise loss and can work well. The class-conditional expectation or in could be none when these is no points belonging to class . At this time, we remove the term corresponding to class in Eq. 8 and calculate the mean of the other terms. We empirically find CAT needs only more training time when combining with existing methods.

Teacher-student paradigm. First, using teacher as labeling function on target domain avoids the error amplification issue. Furthermore, once the classifier becomes more accurate on the target domain, the teacher classifier performs better as well. Then the pseudo labels used in and are more likely to be correct which in turn enhances the classifier. Consequently, a boosting cycle between them is formed.

4 Experiments

To demonstrate the effectiveness of CAT, we evaluate it through various experiments on synthetic imbalanced dataset and three challenging UDA tasks: SVHN-MNIST-USPS, Office-31 [36] and ImageCLEF-DA. We show that CAT considers and exploits the fine-level class-conditional structures of the source and target domains, and makes the learned feature space discriminative and aligned, thus yielding improved performance on the target domain.

Datasets and configurations. SVHN-MNIST-USPS is a challenging adaptation task of digits between three datasets SVHN [30], MNIST [18] and USPS. We conduct experiments in three directions: SVHNMNIST, MNISTUSPS and USPSMNIST. We follow the protocol in [45]: we use the whole training sets for the adaptation from SVHN to MNIST and randomly sample 2000 images from MNIST and 1800 images from USPS for the adaptation between the two datasets. Following MSTN [49], the images are cast to when using LeNet [18] as classifier. When combining with MCD [38] and VADA [42], We take the identical settings as the original methods.

Imbalanced SVHN-MNIST-USPS. We randomly sample 1000 instances from class 0 and 100 instances from class 1 from the original source domain and construct a new one. Then we sample 100 instances from class 0 and 1000 instances from class 1 from the target domain to form a new target. Therefore, the synthetic adaptation dataset contains several imbalanced two-class adaptation tasks. The experiment settings are the same as those of SVHN-MNIST-USPS.

Office-31 and ImageCLEF-DA are two real-world datasets which are widely used in domain adaptation research. Office-31 is composed of three domains: Amazon (A), DSLR (D) and Webcam (W), containing 2817, 498 and 795 images from 31 categories, respectively. ImageCLEF-DA

includes three domains: Caltech-256 (C), ImageNet ILSVRC 2012 (I) and Pascal VOC 2012 (P), containing 600 images from 12 classes, respectively. We use data augmentation such as random flipping and cropping in training for fair comparison with the baselines.

Implementation. In synthetic and digits experiments using LeNet, we set according to the performance of CAT on the synthetic dataset and we forward-propagate a target sample twice under different perturbations(, dropout) and use the latter as the prediction of the teacher for its simplicity (similar with the model of [17]). In all the other experiments, we fix and deploy a temporal ensemble [17] of previous predictions of as teacher (the accumulation decay constant is set to 0.6). We design a ramp-up function similar with that of [17] to update in experiments using LeNet and set suggested by RevGrad [8] in which increases linearly from 0 to 1 in the others. We set in all the experiments without tuning. Refer to Appendix. D for more details of the used architectures and optimization settings.

4.1 Experiments on imbalanced SVHN-MNIST-USPS

We first test CAT on the imbalanced SVHN-MNIST-USPS dataset, a challenging task where the source domains have ratio of class imbalance while the target domains have . We implement CAT and RevGrad based on the official codes of MSTN [49] using LeNet. The results are shown in Table 1

. We repeat each task 3 times and report the averaged test accuracy and standard deviation.

Method SVHN to MNIST MNIST to USPS USPS to MNIST
RevGrad [8]
MSTN [49]
CAT
Table 1: Summary of domain adaptation results on the imbalanced digits datasets in terms of test accuracy (%).
Method A to W D to W W to D A to D D to A W to A Avg
ResNet-50 [13]
DAN [23] 80.4
RevGrad [8]
RTN [25]
JAN [24]
SimNet [33]
GenToAdapt [40]
JAN+CAT
rRevGrad+CAT
AlexNet [16]
DDC [46]
DRCN [9]
RTN [25]
RevGrad [8]
JAN [24]
AutoDIAL [3]
MSTN [49]
rRevGrad+CAT
Table 2: Accuracy on the Office-31 datasets in terms of test accuracy (%) (ResNet-50 and AlexNet).
Method SVHN to MNIST MNIST to USPS USPS to MNIST
Source Only
DDC [46]
CoGAN [21] -
DRCN [9]
ADDA [45]
LEL [27] - -
AssocDA [12] - -
MSTN [49] -
CAT
RevGrad [8]
RevGrad+CAT
rRevGrad+CAT
MCD [38]
MCD+CAT
VADA [42] - -
VADA+CAT - -
Table 3: Summary of domain adaptation results on the digits datasets in terms of test accuracy (%).

It is notable that RevGrad and MSTN fail thoroughly owing to their obsession of matching the marginal distributions. In contrast, CAT gives almost completely correct predictions for the target domains. This experiment verifies that existing methods through aligning marginal distributions are restrictive and require the modes corresponding to the same class but different domains to be geometrically similar. They are sensitive and fragile in practical tasks.

4.2 Svhn-Mnist-Usps digits datasets

We apply CAT to the popular digits adaptation task SVHN-MNIST-USPS and compare to the state-of-the-art approaches in Table 3 (all baseline results are taken from related literature). CAT, RevGrad+CAT and rRevGrad+CAT follow the settings of MSTN [49] using the LeNet. We implement MCD+CAT and VADA+CAT based on the official codes of MCD and VADA using their architectures instead of LeNet for fair comparison. We only integrate CAT with the first stage algorithm VADA in DIRT-T [42] while discarding the fine-tuning stage for simpleness.

There are several conclusions we can make. First, CAT reveals strikingly improved test accuracy on SVHN to MNIST task without tuning the hyper-parameters, and CAT even outperforms MCD [38] and VADA [42]

which use much wider and deeper neural networks thanks to the class-conditional discriminative alignment between the source and target domains. This task is the most challenging one among the three because of the complex samples and the internal class imbalance of

SVHN. Second, CAT does not perform well enough on the other two tasks but when combining with rRevGrad and MCD, CAT outperforms the strong baselines MSTN [49] and MCD [38] with obvious margins. Third, applying CAT into RevGrad [8], MCD [38] and VADA [42]

can enhance the base methods significantly, especially on the typical and simple RevGrad. Finally, rRevGrad+CAT displays higher test accuracy and lower variance than those of RevGrad+CAT and the advantage is particularly obvious when the two domains have different class-conditional structures (,

SVHN to MNIST), so we utilize rRevGrad+CAT on more challenging tasks.

4.3 Experiments on Office-31 and ImageCLEF-DA

We evaluate CAT using two sets of extensive experiments on the widely used Office-31 and ImageCLEF-DA. They contain more realistic and high-dimensional images, providing a good complement to the digits adaptation task. The results are provided in Table 2 and Table 4, respectively. We integrate CAT with rRevGrad (using ResNet-50 [13] and AlexNet [16] as the classifiers) and JAN [24] (using ResNet-50 [13] as the classifier), which is sufficient to testify the effectiveness of discriminative cluster-based alignment and teacher-student paradigm.

We observe that CAT can boost JAN [24] and rRevGrad significantly, especially on the difficult A to W, A to D, D to A and W to A tasks in Office-31, and the combined models surpass the strong baselines RevGrad and JAN by obvious margins (more than 1%). CAT based methods also outperform MSTN on various tasks substantially which proves the class-conditional discriminative alignment is superior to the semantic alignment used by MSTN. The improvement of test accuracy on most tasks of ImageCLEF-DA shows that CAT can still work well when the domains are small containing only 600 images. We further confirm that CAT can diliver a discriminative and aligned cluster-structure feature space by visualizing the learned features in the Appendix. A.

Method I to P P to I I to C C to I C to P P to C Avg
ResNet-50 [13] 91.5
DAN [23] 82.5
RevGrad [8]
JAN [24]
JAN+CAT
rRevGrad+CAT
AlexNet [16]
RTN [25]
RevGrad [8]
JAN [24]
MSTN [49]
rRevGrad+CAT
Table 4: Accuracy on the ImageCLEF-DA datasets in terms of test accuracy (%) (ResNet-50 and AlexNet).
(a) RevGrad
(b) rRevGrad+CAT
Figure 3: (Best viewed in color.) (a) Feature space learned by RevGrad. (b) Feature space learned by rRevGrad+CAT. The features are projected to 2-D using t-SNE. Blue violet denotes the source domain and the other colors denotes different classes of target domain. See Appendix. A for more results.

4.4 Analysis

Visualization of feature space. We visualize the features of the two domains learned by the powerful rRevGrad+CAT and RevGrad [8] on the SVHN to MNIST task using t-SNE [28]. The results are shown in Fig. 3. As expected, using CAT (Fig. 2(b)), the features are concentrated and form tight clusters and those from different classes are separated. In contrast, the features learned by RevGrad [8] (Fig. 2(a)) are more overlapping and less discriminative.

Figure 4: Summary of clustering accuracy(%).

Clustering in the feature space.

We further examine the feature space shaped by CAT and other baselines by conducting K-means 

[22] clustering using the aligned features. We utilize the trained models to infer the hidden features of all the images from two domains. Then the features are clustered into components by K-means in scikit-learn [48]. We set as the number of categories. We greedily set the label of a cluster as the most frequent label in it to calculate clustering accuracy as shown in Fig. 4. As expected, the feature spaces learned by rRevGrad+CAT demonstrate more discriminative cluster structure and this is consistent with classification results of CAT.

Convergence. To inspect how CAT converges, we plot the test accuracy with respect to the number of iterations in Fig. 5. On the two adaptation tasks using AlexNet, CAT shows similar convergence rate with RevGrad [8] but better performance. Appendix. B and C provide more analyses.

(a) DSLR to Amazon
(b) Amazon to DSLR
Figure 5: Test accuracy curves.

5 Conclusion

In this paper we address the challenges of making better alignment between domains and advocate to exploit the discriminative class-conditional structures for effective adaptation in deep UDA. We propose Cluster Alignment with a Teacher (CAT) to achieve the objectives of discriminative learning and class-conditional alignment via a discriminative clustering loss and a cluster-based alignment loss. CAT produces a domain-invariant feature space with improved discriminative power and enhances the performance significantly. CAT establishes new state-of-the-art baselines on benchmarks and additional analyses testify its effectiveness.

References

  • [1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • [2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , volume 1, page 7, 2017.
  • [3] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò. Autodial: Automatic domain alignment layers. In ICCV, pages 5077–5085, 2017.
  • [4] O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
  • [5] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014.
  • [6] G. French, M. Mackiewicz, and M. Fisher. Self-ensembling for visual domain adaptation. ICLR, 2017.
  • [7] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495, 2014.
  • [8] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [9] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstruction-classification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
  • [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [11] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 999–1006. IEEE, 2011.
  • [12] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2765–2773, 2017.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [14] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
  • [15] W. Hong, Z. Wang, M. Yang, and J. Yuan. Conditional generative adversarial network for structured domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1335–1344, 2018.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

    Imagenet classification with deep convolutional neural networks.

    In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [17] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
  • [18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [19] D.-H. Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.
  • [20] Y. Li, K. Swersky, and R. Zemel.

    Generative moment matching networks.

    In International Conference on Machine Learning, pages 1718–1727, 2015.
  • [21] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in neural information processing systems, pages 469–477, 2016.
  • [22] S. Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  • [23] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791, 2015.
  • [24] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. arXiv preprint arXiv:1605.06636, 2016.
  • [25] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pages 136–144, 2016.
  • [26] Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang. Smooth neighbors on teacher graphs for semi-supervised learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [27] Z. Luo, Y. Zou, J. Hoffman, and L. F. Fei-Fei. Label efficient learning of transferable representations acrosss domains and tasks. In Advances in Neural Information Processing Systems, pages 165–177, 2017.
  • [28] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • [29] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain adaptation. arXiv preprint arXiv:1712.00479, 13, 2017.
  • [30] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
  • [31] T. Pang, C. Du, and J. Zhu. Max-mahalanobis linear discriminant analysis networks. In International Conference on Machine Learning, pages 4013–4022, 2018.
  • [32] P. O. Pinheiro. Unsupervised domain adaptation with similarity learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8004–8013, 2018.
  • [33] P. O. Pinheiro. Unsupervised domain adaptation with similarity learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [34] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [35] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo. From source to target and back: symmetric bi-directional adaptive gan. arXiv preprint arXiv:1705.08824, 3, 2017.
  • [36] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213–226. Springer, 2010.
  • [37] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training for unsupervised domain adaptation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2988–2997. JMLR. org, 2017.
  • [38] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [39] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
  • [40] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [41] Y. Shi and F. Sha. Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. arXiv preprint arXiv:1206.6438, 2012.
  • [42] R. Shu, H. H. Bui, H. Narui, and S. Ermon. A dirt-t approach to unsupervised domain adaptation. ICLR, 2018.
  • [43] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
  • [44] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1521–1528. IEEE, 2011.
  • [45] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
  • [46] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [47] R. Volpi, P. Morerio, S. Savarese, and V. Murino. Adversarial feature augmentation for unsupervised domain adaptation. arXiv preprint arXiv:1711.08561, 2017.
  • [48] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
  • [49] S. Xie, Z. Zheng, L. Chen, and C. Chen. Learning semantic representations for unsupervised domain adaptation. In International Conference on Machine Learning, pages 5419–5428, 2018.
  • [50] W. Zhang, W. Ouyang, W. Li, and D. Xu. Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3801–3809, 2018.

Appendix A Class-conditional cluster structure in more tasks

At first, we visualize the learned feature spaces of CAT, RevGrad [8] and MSTN [49] on the imbalanced SVHN to MNIST task using t-SNE [28], as shown in Fig.6. It obvious that CAT can force the samples from the same class to concentrate together to form more tight clusters than those of RevGrad and MSTN, and the clusters present strip pattern in the 2-D space. CAT can also align the class-conditional distributions of the source and the target domains correctly. However, RevGrad and MSTN tend to align the ‘0’ images in SVHN with the ‘1’ images in MNIST, thus the learned feature spaces of them are confusing and not discriminative. This visualization verifies the results in Table. 1.

Secondly, we draw the feature spaces learned by CAT+rRevGrad and RevGrad on MNIST to USPS and USPS to MNIST tasks in Fig. 7 using t-SNE [28]. CAT+rRevGrad can deliver more discriminative feature spaces with separable and tight class-conditional clusters. Therefore, it is sufficient to use the first-order statistics based matching loss to match the class-conditional distributions of the two domains. The aligned clusters of the source and the target domains also verify the effectiveness of the loss .

Furthermore, we examine the feature space learned by CAT on more challenging tasks in Office-31 dataset and ImageCLEF-DA dataset, and results are demonstrated in Fig. 8. These features are outputted by the AlexNet model trained with rRevGrad+CAT method. The class-conditional distributions are shaped to be tight and separable clusters, and the corresponding cluters from the source domain and the target domain are aligned. Therefore, CAT can achieve the objectives of discriminative learning and class-conditional alignment, thus can perform well on the extensive experiments on Office-31 and ImageCLEF-DA datasets.

(a) CAT
(b) RevGrad
(c) MSTN
Figure 6: (Best viewed in color.) Feature space learned on imbalanced SVHN to MNIST task. Green, red, blue and orange points represent ‘0’ images from SVHN, ‘1’ images from SVHN, ‘0’ images from MNIST and ‘1’ images from MNIST, respectively.
(a) RevGrad
(b) rRevGrad+CAT
(c) RevGrad
(d) rRevGrad+CAT
Figure 7: (Best viewed in color.) Feature space learned on MNIST to USPS (Fig. 6(a) and Fig. 6(b)) and USPS to MNIST (Fig. 6(c) and Fig. 6(d)) tasks. Blue violet denotes the source domain and the other colors denotes different classes of target domain.
(a) Amazon to Webcam
(b) Amazon to DSLR
(c) p to i
(d) i to p
Figure 8: (Best viewed in color.) Feature space learned on four challenging tasks. Blue violet (in (a) and (b)) and deep sky blue (in (c) and (d)) denote the source domain and the other colors denotes different classes of target domain.
(a) DSLR to Amazon
(b) Amazon to DSLR
Figure 9: Jensen-Shannon divergence (JSD) curves during training.
(a) SVHN to MNIST
(b) Amazon to DSLR
(c) p to i
Figure 10: The selection rate of the confidence-thresholding technique on different tasks.

Appendix B Quantitative estimate of the divergence between domains

When aligning the source domain and target domain via the combination of RevGrad and CAT, the loss which is maximized w.r.t. the critic can be viewed as a lower bound of (see [10] for the details) where denotes the Jensen-Shannon divergence between distributions. Therefore, we plot

to quantitatively estimate the divergence between the two domains, following [49]. The results are shown in Fig. 

9 and we use the AlexNet as the classifier here. CAT can boost RevGrad significantly, leading to faster and better convergence. This set of experiments verifies that when combining CAT with the marginal distribution alignment approaches, it can provide a discriminative class-conditional alignment and bias the existing approaches to align the cluster-structure marginal distributions better.

Appendix C Verification of confidence-thresholding technique

We use the confidence-thresholding technique in the RevGrad method and we claim that in the training procedure, more and more samples are going to have confidence greater than and are selected into the domain adversarial training. Here we prove it through three experiments on tasks in SVHN-MNIST-USPS, Office-31 and ImageCLEF-DA, respectively. We plot the selection rate of this technique in the rRevGrad+CAT method w.r.t. the number of iterations in Fig. 10. We note after several thousands of iterations, the selection rate will be almost in the Amazon to DSLR and p to i tasks. In SVHN to MNIST task, we use a ramp-up function as after 5000 iterations, suggested by related SSL works. Therefore, after around 15000 iterations, the discriminative clustering structure forms, and then the samples are pushed far away from the decision boundaries. So almost all the samples will have confidence more than and will be selected into the domain adversarial training.

Appendix D Experiment details

On digits adaptation tasks, we use the simple LeNet with Batch Normalization after the convolutional layers and use the probability logits as features for adaptation, following [49, 45]. When combining with RevGrad[8] and rRevGrad, the critic model has a

architecture.

On more challenging tasks, we conduct experiments based on the AlexNet [16] and ResNet-50 [13] equipped with 256-D bottleneck layers after the and layers respectively (following [25, 49]). We use the features outputted by the bottleneck layers as image representations for adaptation and use a three-layer critic with architecture. We finetune all the layers before the bottleneck layers in AlexNet and ResNet-50 and train the bottleneck layers and the classification layers via back propagation.

We use the stochastic gradient descent with 0.9 momentum with an annealed learning rate

where p changes from 0 to 1 in the training progress [7, 49] when using LeNet and AlexNet as the classifiers. The learning rate for finetuned layers is set to be the ten percent of that for layers trained from scratch. We use batches with 128 elements in experiments using LeNet, batches with 200 elements in experiments using AlexNet and batches with 36 elements in experiments using ResNet-50.

We use the same architectures and optimization settings (, batch size, learning rate, optimizer and weight decay) as those of the original methods [42, 38] when combining CAT with them. We will release the codes after review.