Towards Self-similarity Consistency and Feature Discrimination for Unsupervised Domain Adaptation

04/13/2019 ∙ by Chao Chen, et al. ∙ Zhejiang University 0

Recent advances in unsupervised domain adaptation mainly focus on learning shared representations by global distribution alignment without considering class information across domains. The neglect of class information, however, may lead to partial alignment (or even misalignment) and poor generalization performance. For comprehensive alignment, we argue that the similarities across different features in the source domain should be consistent with that of in the target domain. Based on this assumption, we propose a new domain discrepancy metric, i.e., Self-similarity Consistency (SSC), to enforce the feature structure being consistent across domains. The renowned correlation alignment (CORAL) is proven to be a special case, and a sub-optimal measure of our proposed SSC. Furthermore, we also propose to mitigate the side effect of the partial alignment and misalignment by incorporating the discriminative information of the deep representations. Specifically, an embarrassingly simple and effective feature norm constraint is exploited to enlarge the discrepancy of inter-class samples. It relieves the requirements of strict alignment when performing adaptation, therefore improving the adaptation performance significantly. Extensive experiments on visual domain adaptation tasks demonstrate the effectiveness of our proposed SSC metric and feature discrimination approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Convolutional neural networks (CNNs) have shown promising results on supervised learning tasks. However, the generalization ability of the learned model may degrade severely when applied to other related but different domains. It is often expensive or impractical to annotate massive samples on the coming new domains. Domain adaptation which aims to utilize labeled samples from source domain to annotate the target domain samples has, therefore, emerged as a new learning framework to address this problem (Csurka, 2017; Wang and Deng, 2018). In this paper, we mainly focus on the unsupervised domain adaptation (UDA) problems. Recent advances in UDA show satisfactory performance with deep CNNs. Among them, the most successful methods encourage similarities between the latent deep representations across different domains. The similarities are often maximized by some measure of discrepancy metrics (Tzeng et al., 2014; Long et al., 2015; Sun and Saenko, 2016; Long et al., 2017; Zellinger et al., 2017) or adversarial training (Ganin et al., 2016; Tzeng et al., 2017; Hoffman et al., 2018).

Figure 1. We propose a self-similarity consistency measure, which enforces the similarity/distance across different features to be identical in the source and target domains.

Maximum Mean Discrepancy(MMD) and Correlation Alignment (CORAL) are two most commonly used metrics to measure the distribution discrepancy. A great deal of methods based on the MMD (Tzeng et al., 2014; Long et al., 2015, 2017; Kang et al., 2019) and CORAL (Sun et al., 2016; Sun and Saenko, 2016; Zhang et al., 2018) have achieved considerable adaptation performance. However, the MMD-based methods suffer from a very critical limitation. They only align the global distribution statistics without any semantic information, which may even lead to negative transfer (Hoffman et al., 2018; Xie et al., 2018). For example, the features of digit ”2” in the source domain may be aligned well with the features of digit ”3” in the target domain (Hoffman et al., 2018). Although, there are some researches attempting to alleviate this problem by leveraging the pseudo-label (Saito et al., 2017; Xie et al., 2018), this problem has not been solved well due to lack of target label information. The prior adversarial adaptation methods also suffer from this limitation, as the discriminator only guarantees the global alignment of domain statistics lacking crucial semantic information for each category. Several recent works also attempt to address this problem by pixel-level domain alignment (Bousmalis et al., 2017; Hoffman et al., 2018), which performs image-to-image transformation for domain adaptation.

Figure 2. Enforce feature discrimination by the feature norm constraint. Left: With the previous domain alignment methods, the target domain samples distributed in the edge of the cluster are prone to be misclassified. Right:

With the feature discrimination loss, the norm of the deep features will be enlarged to be close to the target norm value, which make the inter-class samples more separable. (Best viewed in color)

Motivation of SSC In this paper, we propose a new self-similarity consistency (SSC) metric to measure the domain discrepancy. As illustrated in Fig. 1, an intuitive motivation is that if the source domain and the target domain are well aligned, the similarity (or distance) across different features in the source domain should be consistent with the similarity in the target domain. For instance, if the distance between -th feature and -th feature in the source domain is , then in the target domain this relationship should be also satisfied, i.e., . From this perspective, we propose a SSC constraint to match the source and target domains and we demonstrate that the Correlation Alignment (CORAL) is only a special case, and a sub-optimal measure of the proposed SSC. Compared with the MMD-based methods which align the centroid of all the samples between the source and target domains, our proposed SSC performing structure similarity constraint of the deep features would be less susceptible to misalignment.

Motivation of Feature Discrimination Existing alignment methods can only reduce, but not remove domain discrepancy, which we call partial alignment. Therefore, the samples distributed in the edge of the cluster are prone to be misclassified. From this perspective, one intuitive approach to improve the adaptation performance is to enforce the deep features across domains with better intra-class compactness and inter-class separability (Chen et al., 2018). As illustrated in Fig. 2, we propose an embarrassingly simple, but extremely effective method to enhance the discrimination of the deep features. Specifically, a feature norm constraint is employed to enforce the deep features to be scaled to a given larger hypersphere (feature norm). In this way, since the large margins between different classes, it is expected to get satisfactory adaptation performance even the source and target domains are not perfectly aligned (partially aligned). Besides, the possibility of misalignment can also be reduced due to the high discrimination of the deep features.

For simplicity, we denote our proposal ”Towards Self-similarity Consistency and Feature Discrimination for Unsupervised Domain Adaptation” as SDDA. The main contributions of this work can be concluded as: (1) We introduce a new metric self-similarity consistency (SSC) to measure the domain discrepancy which performs better than previous metrics on most of transfer tasks, and we demonstrate that CORAL can be viewed as a special case of SSC. (2) From the perspective of feature discrimination, we propose an embarrassingly simple approach to enlarge the separability of inter-class samples, which can improve the adaptation performance significantly.

2. Related Work

Domain-Invariant Feature Learning. Deep domain adaptation has provided promising performance for visual applications. Among them, the typical deep domain adaptation methods follow the Siamese CNN architectures with two streams (Sun and Saenko, 2016; Long et al., 2017; Rozantsev et al., 2018; Chen et al., 2018), representing the source model and the target model respectively. A practical way to perform domain adaptation is to minimize the domain discrepancy to obtain domain-invariant features. There has been a large body of work pays attention to learn the domain-invariant representations by some measures of domain discrepancy. The most widely used discrepancy metrics including MMD (Tzeng et al., 2014; Long et al., 2015, 2017; Kang et al., 2019), CORAL (Sun et al., 2016; Sun and Saenko, 2016; Zhang et al., 2018; Morerio et al., [n. d.])

and Central Moment Discrepancy (CMD)

(Zellinger et al., 2017). Specifically, Long et al. proposed DAN (Long et al., 2015) and JAN (Long et al., 2017) which perform domain matching via multi-kernel MMD or a joint MMD criteria in multiple domain-specific layers across domains. Sun et al. proposed the correlation alignment (CORAL) (Sun et al., 2016; Sun and Saenko, 2016) to align the second order statistics of the source and target distributions. Some recent work also extended the CORAL to mapped CORAL (MCA) (Zhang et al., 2018) and logCORAL (Morerio and Murino, 2017; Morerio et al., [n. d.]). Besides, CMD (Zellinger et al., 2017) which aligns the central moment of each order across domains is also an effective approach for domain alignment. Another fruitful line of work to learn the domain-invariant features is through adversarial training (Ganin et al., 2016; Tzeng et al., 2017)

, which encourages domain confusion by a domain adversarial objective whereby a discriminator (domain classifier) is trained to distinguish between the source and target representations. Also, recent work performing pixel-level adaptation by image-to-image transformation

(Zhu et al., 2017; Murez et al., 2018; Hoffman et al., 2018) has achieved satisfactory performance and obtained much attention, which is also widely used for cross-domain segmentation (Huo et al., 2018; Sankaranarayanan et al., 2018) and person re-identification (Deng et al., 2018). In this work, we propose a novel self-similarity consistency (SSC) metric, which exploits the structure similarity of the feature space for domain matching, instead of aligning the global statistics of all the samples across domains.

2.1. Discriminative Feature Learning.

Recently, there is a trend to improve the performance of CNN with discriminative feature learning, especially in the field of face recognition

(Wen et al., 2016; Zheng et al., 2018; Liu et al., 2016), and person re-identification (Wang et al., 2018). Wen et al. proposed the Center Loss (Wen et al., 2016) to learn the discriminative features by penalizing the distance of each sample to its corresponding class center. Liu et al. proposed the large margin softmax (L-Softmax) (Liu et al., 2016) by enforcing angular constrains to improve the feature discrimination. Besides, some recent work also improved the domain adaptation methods by incorporating the discriminative feature learning (Lu et al., 2017, 2018; Li et al., 2018; Kang et al., 2019; Chen et al., 2018). Chen et al. (Chen et al., 2018) proposed joint domain alignment and discriminative feature learning (JDDA), where an instance-based discriminative feature learning method and a center-based discriminative feature learning method are proposed to guarantee the domain invariant features with better intra-class compactness and inter-class separability. Kang et al. (Kang et al., 2019) proposed the Contrastive Adaptation Network (CAN) which explicitly models the inter-class domain discrepancy and inter-class domain discrepancy by revisiting the MMD. In this paper, we propose an elegant feature norm constraint for discriminative feature learning.

Figure 3. Two-stream CNNs with shared parameters are adopted for unsupervised deep domain adaptation. The first stream operates the source data and the second stream operates the target data. The last FC layer (the input of the output layer) is used as the adapted layer. (Best viewed in color)

3. Methodology

In this work, we consider the unsupervised domain adaptation problem. Let denotes the source domain with labeled samples and denotes the target domain with unlabeled samples. We aim to train a cross-domain CNN classifier which can minimize the target risks with labeled source domain samples and unlabeled target domain samples. Here denotes the outputs of the deep neural networks, denotes the model parameter to be learned. Following (Chen et al., 2018; Long et al., 2017; Sun and Saenko, 2016), we adopt the two-stream CNNs architecture for unsupervised deep domain adaptation. As shown in Fig. 3, the two streams share the same parameters (tied weights), operating the source and target domain samples respectively. And we perform the domain alignment in the last full-connected (FC) layer (Sun and Saenko, 2016; Chen et al., 2018). In this paper, we minimize the domain discrepancy by the proposed SSC metric. Besides, we also optimize the domain-invariant feature representations with better intra-class compactness and inter-class separability. The overall objective can be given as

(1)
(2)

where represents the standard classification loss in the source domain,

represents the cross-entropy loss function.

represents the domain discrepancy loss measured by the proposed SSC metric. and denote intra-class compactness loss and inter-class separability loss, respectively. , and are three trade-off parameters which balance the contributions of the domain discrepancy loss and the feature discrimination loss.

Figure 4. Perform self-similarity consistency constraint to minimize the domain discrepancy. The SSC metric matches the pairwise similarity across the deep features in the source and target domain, which ensures the structure of the source feature space be consistent with the structure of the target feature space. (Best viewed in color)

3.1. Self-similarity Consistency

To minimize the domain discrepancy, we propose a novel self-similarity consistency metric to match the structure of the feature space across domains. Let and

denote the centralized source outputs and target outputs in the last full-connected (FC) layer (as shown in Fig.

3). Here, , denotes the number of hidden nodes in the last FC layer, denotes the batch size during the training stage. As can be seen in Fig. 4, the motivation of the SSC metric is that the similarity or distance across different features in the source domain should be consistent with the corresponding similarity or distance in the target domain, i.e., , for example, . Then, the generic SSC loss can be defined as

(3)

Here and represent the self-similarity matrix across different features in the source and target domains, and denotes the similarity between the -th feature and -th feature in the source (target) domain. The commonly used similarity can be defined as:

  • Dot-product Similarity

  • Euclidean Distance

  • Cosine Similarity

Rethinking the Correlation Alignment (CORAL) The renowned correlation alignment (CORAL) diminishes the domain discrepancy by aligning the covariance of the source and target domains. It is one of the most widely used metric for domain adaptation, which can be expressed as

(4)

Here, and denote the source feature covariance and target feature covariance, respectively. And can be regarded as the dot-product similarity between -th feature and -th feature. In this respect, the CORAL can be viewed as a special case of our proposed SSC metric, in which the dot-product similarity is adopted.

Heat Kernel Similarity In this work, we adopt the heat kernel function to measure the similarity across different features. i.e.,

(5)
(6)

where is the bandwidth to adjust the influence of single pairwise similarity. Note that the heat kernel similarity has the same expression as the Gaussian kernel (or RBF kernel) .

Kernel Embeding Perspective Suppose we have a feature space with features, and a feature map , where is an embedding space. The kernel matrix is defined as , where . For the linear kernel embedding , and for the RBF kernel embedding (or Guassian kernel embedding) . From this perspective, CORAL can be regarded as a kernel embedding method which aligns the source and target features in the linear kernel embedding, while our proposed SSC aligns the features in the RBF kernel embedding. Both theoretical analysis and experimental results show that our method is superior to CORAL, and CORAL can be viewed as a form of the self-similarity consistency metric.

Relationship to Mean Map Kernel Mean map kernel (MMK) (Muandet et al., 2012; Shan and Zhang, 2016) measures the similarity between two distributions, which can be formulated as

(7)

where is a kernel function and denotes the number of samples in . In terms of domain adaption, MMK can be further transformed into a mutual-similarity maximization (MSM) metric when we replace the kernel function and the distributions , with similarity function and centralized features, i.e.,

(8)

Similar to SSC, MSM is also an effective metric for domain matching. What sets SSC apart from MSM is that SSC aligns the self-similarity relationship across domains for domain matching, while MSM aggregates the pairwise similarity over two feature sets to measure the similarity of two distributions. In this work, we focus on the effectiveness of our proposed SSC using the heat kernel similarity. Investigation about mutual-similarity maximization is beyond the scope of this paper.

3.2. Feature Discrimination

Recently, there is a line of work improving the adaptation performance by pursuing the better intra-class compactness and inter-class separability of the domain-invariant features (Lu et al., 2017, 2018; Li et al., 2018; Kang et al., 2019; Chen et al., 2018). However, these methods are quite garish, and most of them are based on traditional classifier (Lu et al., 2017, 2018; Li et al., 2018). JDDA (Chen et al., 2018) provides an elegant approach to learn discriminative features for deep domain adaptation, but the inter-class separability is not well guaranteed. In this paper, we propose an extremely intuitive and effective approach, namely feature norm constraint, to ensure the shared representations with better inter-class separability.

Intra-class Compactness To make the shared representations with better intra-class compactness, we follow (Chen et al., 2018) to penalize the distances between the deep features and their corresponding class center. Let denotes the deep features of the source samples in the last FC layer, denotes the -th class center of the deep features, and is the number of class. Then, the intra-class compactness loss can be formulated as

(9)

the intra-class compactness loss will enforce the distance between the deep features and its corresponding center no more than the given margin . Note that the actual global centers should be calculated by averaging all the samples. However, since we perform update based on mini-batch, the centers might be misestimated because of the noisy samples. Therefore, we use the moving average of the samples as the global centers, which can be updated in each iteration as

(10)
(11)

where is the learning rate of the centers, denotes the batch size, if is satisfied. For simplicity, we set the margin and throughout the experiments. Note that the intra-class compactness constraint only penalizes the source domain compactness since the target samples lack of category information. Fortunately, we obverse that the visual representations learned by deep CNNs are fairly domain invariant, i.e, the source samples and target samples have similar distributions in the feature space. As a result, it is reasonable to make the source features more discriminative, such that the target features maximally aligned with the source domain will become discriminative automatically (Chen et al., 2018).

Inter-class Separabiliy

Recall that the feature norm represents the distance between the hypersphere origin and the feature vector. There should be a better separability among the inter-class samples if the feature norm can be enlarged. As can be seen in Fig.

2, we propose to maximize the inter-class discrepancy by penalizing the difference between the given target norm value and the feature norm in the shared feature space. In this way, the feature representations would be pulled away from the hypersphere origin, making the domain-invariant features more separable. The feature norm constraint loss can be defined as

(12)

Here, is the target norm constant that we would like the domain-invariant features to be scaled to. The gradient with respect to the input features can be calculated as

(13)

There are several advantages to perform the feature norm constraint. (1) The larger feature norm will guarantee the inter-class samples with better separability. (2) The norm constraint will enforce the features to distribute around the same hypersphere, which mitigates the domain discrepancy to a certain degree. (3) The large gaps across different classes can relieve the requirements of strict alignment across domains, and mitigate the side effect of partial alignment and misalignment.

4. Experiments

4.1. Setup

Dataset We evaluate the performance of our proposed SDDA by comparing against several state-of-the-art deep domain adaptation methods on two public unsupervised visual adaptation datasets: digital recognition dataset and office-31 dataset. The digital recognition dataset includes five widely used benchmarks: MNIST, USPS, Street View House Numbers (SVHN), MNIST-M, and SYN (synthetic digits dataset). Following the experimental protocol of (Chen et al., 2018), we evaluate our proposal across four adaptation shifts, including: SVHNMNIST, MNISTMNIST-M, USPSMNIST and SYNMNIST. Office-31 is another commonly used dataset for real-world domain adaptation scenario, which contains 31 categories acquired from the office environment in three distinct image domains: Amazon (product images download form amazon.com), Webcam (low-resolution images taken by a webcam) and Dslr (high-resolution images taken by a digital SLR camera). The office-31 dataset contains 4110 images in total, with 2817 images in A domain, 795 images in W domain and 498 images in D domain. We evaluate our method on all the six transfer tasks , , , , and as (Sun and Saenko, 2016; Long et al., 2017).

Baselines To evaluate the effectiveness of our proposed SDDA, we compare the SDDA with the following competing methods, which are most related to our work.
Source Only: As an empirical lower bound, we train a model with the source samples only, and test it directly with the target samples.
DDC (Tzeng et al., 2014): DDC is the first method that maximizes domain invariance by MMD metric using two-streams CNNs.
DAN (Long et al., 2015): DAN learns more transferable features by embedding deep features of all task-specific layers in a reproducing kernel Hilbert space (RKHSs) and matching domain distributions optimally using multi-kernel MMD.
CORAL (Sun and Saenko, 2016): Deep correlation alignment minimizes domain discrepancy by aligning the second-order statistics of the source and target distributions.
DANN (Ganin et al., 2016): DANN is an adversarial representation learning approach that uses a domain classifier to learn features that are both discriminative and invariant to the change of domains.
ADDA (Tzeng et al., 2017): ADDA uses a discriminative base model, unshared weights, and the standard GAN loss to learn a discriminative mapping of target images to the source feature space by fooling a domain discriminator.
CMD (Zellinger et al., 2017): CMD proposes to align the central moment of each order across domains for domain adaptation.
CyCADA (Hoffman et al., 2018): CyCADA is a pixel-level unsupervised domain adaptation method that unifies cycle-consistent image translation with adversarial adaptation methods.
JDDA (Chen et al., 2018): JDDA is the first work to perform domain matching and discriminative feature learning jointly for deep domain adaptation.

Implementation details. In our experiments on digit recognition dataset, we utilize the modified LeNet whereby a bottleneck layer with 64 hidden nodes is added before the output layer for domain matching. Since the image size is different across different domains, we resize all the images to

and convert the RGB images to grayscale. For the experiments on Office-31, we use ResNet-50 pretrained on ImageNet as our backbone networks. And the activation output of the last full connected layer is used as feature representation for domain matching. Due to the small samples size of Office-31 dataset, we only update the weights of the full-connected layers (fc) as well as the final block (scale5/block3), and fix other parameters pretrained on ImageNet. Follow the standard protocol of

(Long et al., 2015; Sun and Saenko, 2016; Chen et al., 2018), we use all the labeled source domain samples and all the unlabeled target domain samples for training. All the comparison methods are based on the same CNN architecture for a fair comparison, and all the model parameters are shared between the source and target domains.

For MMD-based methods (DDC, DAN), we use a linear combination of 19 RBF kernels with the standard deviation parameters ranging from

to . For DANN regularization, we add a GRL and then a domain classifier with one hidden layer of 100 nodes. When training with ADDA, our adversarial discriminator consists of 3 fully connected layers: two layers with 500 hidden units followed by the final discriminator output.

Method SVHNMNIST MNISTMNIST-M USPSMNIST SYNMNIST Avg
Source Only 67.30.3 62.80.2 0.4 89.70.2 71.6
DDC (Tzeng et al., 2014) 71.90.4 78.40.1 75.80.3 89.90.2 79.0
DAN (Long et al., 2015) 79.50.3 0.2 0.2 75.20.1 81.0
DANN (Ganin et al., 2016) 70.60.2 76.70.4 76.60.3 90.20.2 78.5
CMD (Zellinger et al., 2017) 86.50.3 85.50.2 86.30.4 96.10.2 88.6
ADDA (Tzeng et al., 2017) 72.30.2 0.3 92.10.2 96.30.4 85.4
CORAL (Sun and Saenko, 2016) 89.50.2 81.60.2 96.50.3 96.50.2 91.0
CyCADA (Hoffman et al., 2018) 92.80.1 98.30.2 97.40.3 97.50.1 96.5
JDDA (Chen et al., 2018) 94.2 0.1 88.40.2 96.70.1 97.70.0 94.3
SDDA (w/o FD) 94.20.2 82.20.2 96.50.2 97.30.1 92.6
SDDA () 94.90.1 88.90.2 96.90.1 97.60.0 94.6
SDDA () 96.50.1 87.90.3 98.50.1 98.80.0 95.4
SDDA (Full) 97.30.3 90.50.3 97.60.1 98.80.0 96.1

SDDA (w/o FD) indicates that the feature discrimination loss is not involved in the SDDA, i.e., and .

Table 1. results (accuracy %) on digital recognition dataset for unsupervised domain adaptation based on modified LeNet
Method AW DW WD AD DA WA Avg
Source Only 73.10.2 93.20.2 0.1 72.60.2 55.80.1 56.40.3 75.0
DDC (Tzeng et al., 2014) 74.40.3 94.00.1 98.20.1 74.60.4 56.40.1 56.90.1 75.8
DAN (Long et al., 2015) 78.30.3 0.2 0.1 75.20.2 58.90.2 64.20.3 78.5
DANN (Ganin et al., 2016) 73.60.3 94.50.1 99.50.1 74.40.5 57.20.1 60.80.2 76.7
CMD (Zellinger et al., 2017) 76.90.4 94.60.3 99.20.2 75.40.4 56.80.1 61.90.2 77.5
CORAL (Sun and Saenko, 2016) 79.30.3 94.30.2 99.40.2 74.80.1 56.40.2 63.40.2 78.0
CyCADA (Hoffman et al., 2018) 82.20.3 94.60.2 99.70.1 78.70.1 60.50.2 67.80.2 80.6
JDDA(Chen et al., 2018) 82.60.4 95.20.2 99.70.0 79.80.1 57.40.0 66.70.2 80.2
SDDA (w/o FD) 82.40.2 94.70.1 99.30.0 77.80.2 56.90.1 65.10.3 79.4
SDDA () 83.90.3 95.30.2 99.30.0 80.40.1 59.70.3 67.30.3 81.0
SDDA () 84.70.2 99.10.1 99.80.0 81.20.2 64.90.2 66.70.2 82.7
SDDA (Full) 87.50.3 98.80.0 99.80.0 86.40.3 67.10.2 69.40.3 84.8
Table 2. results (accuracy %) on Office-31 dataset for unsupervised domain adaptation based on ResNet-50

All these methods are implemented via tensorflow and we use the Adam optimizer with the learning rate of 1e-4 to train the network. Regarding the optimal hyper-parameters, they are determined by applying multiple experiments using grid search strategy. The optimal hyper-parameters may distinct across different transfer tasks. Specifically, the trade-off parameters are selected as (or chosen from)

, and throughout the experiments. For the digit recognition tasks, the hyper-parameter in SSC is set as 0.001 and the target norm value is set as . For the experiments on office-31, the hyper-parameter in SSC is set as 0.0001, and the target norm value is set as . When implementing the comparison baselines, we follow the learning rate schedule described in (Long et al., 2015, 2017), i.e., the adaptation factor is gradually updated from 0 to 1 by a progressive schedule: where is a constant set as 10 and is the training progress linearly changing from 0 to 1.

4.2. Experimental results

Digit Recognition Table 1 shows the adaptation performance on four transfer tasks of digit recognition dataset based on the modified LeNet. As can be seen, all the domain adaptation methods outperform the source only (non-adapted) model by a large margin, while our proposed SDDA yields notable improvement over the comparison methods on most of the transfer tasks. In particular, our method improves the adaption performance significantly in the hard transfer tasks, such as, SVHNMNIST and MNISTMNIST-M. To our best knowledge, our approach achieves the highest classification accuracy in the transfer tasks SVHNMNIST over all the unsupervised domain adaptation methods. The SVHN is colored and some images contain multiple digits, while the MNIST is gray scale without messy background, thus this domain shift is a challenging adaptation scenario as well as a representative transfer task. Note that the MNIST-M was created by using each MNIST digit as a binary mask and inverting with it the colors of a background image randomly cropped from the Berkeley Segmentation Data Set (Arbelaez et al., 2011). Therefore, the pixel-level domain adaptation methods, such as (Hoffman et al., 2018; Bousmalis et al., 2017; Murez et al., 2018) can transfer the ”MNIST-like” images to ”MNIST-M-like” images easily. This is why CyCADA achieves much higher classification accuracy compared with other methods on the transfer task MNISTMNIST-M.

(a) (2D)
(b) + (2D)
(c) + (2D)
(d) + + (2D)
Figure 5. 2D visualization of the deep features. The model is trained (a) with the source loss, (b) with the source loss and the intra-class compactness loss, (c) with the source loss and the inter-class separability loss and (d) with the source loss and the full discrimination loss. It is worth noting that we set the feature dimension in the last FC layer as 2, and then illustrate them by class information.
(a)
(b) +
(c) +
(d) + +
(e)
(f) +
(g) +
(h) + ++
Figure 6. The t-SNE visualization of the SVHNMNIST task. The first row illustrates the t-SNE embedding of deep features which are marked by category information, each color represents a category. The second row illustrates the t-SNE embedding of deep features which are marked by domain information, red and blue points represent the samples drawn from the source and target domains.

Office-31 The results on Office-31 dataset for unsupervised domain adaptation based on ResNet-50 are shown in Table 2. It can be seen that our proposal outperforms all the competing methods over all the six transfer tasks. Moreover, our approach improves the classification accuracy substantially on four transfer tasks: AW, DW, AD and DA. It is noteworthy that we haven’t provided the performance of ADDA on the office-31 dataset. This is because we got very bad results on some transfer tasks, which may be caused by insufficient training data or inappropriate parameter setting. To avoid misleading readers, we didn’t write the bad results into the paper.

From Table 1 and Table 2, we have several observations: (1) Both the discrepancy-based methods (Long et al., 2015; Sun and Saenko, 2016; Zellinger et al., 2017) and the adversarial-based methods (Ganin et al., 2016; Tzeng et al., 2017) outperform the source only model by a large margin, which verifies the effectiveness of the deep domain adaptation methods. (2) Our proposed SSC metric outperforms other typical discrepancy metrics, such as MMD, CORAL and CMD on most of the transfer tasks. This can be reflected by comparing the performance of SDDA (w/o FD) and other representative methods. (3) The performance of SDDA (Full) distinctly outperforms the SDDA (w/o FD), which indicates that the proposed feature discrimination constraint can significantly improve the adaptation performance. (4) Compared with the intra-class compactness loss, the introducing inter-class separability loss is more effective to improve the transfer accuracy, which can be demonstrated by comparing the transfer accuracy of SDDA () and SDDA (). (5) The pixel level domain adaptation method CyCADA (Hoffman et al., 2018) performs well on the digit recognition dataset, especially on the domain shift of MNISTMNIST-M, which indicates that CyCADA is more effective for eliminating pixel level and low level domain discrepancy.

(a) Accuracy w.r.t.
(b) Accuracy w.r.t.
(c) Accuracy w.r.t.
(d) Convergence
Figure 7. Parameter sensitivity and convergency analysis. (a-c) The sensitivity of accuracy to , , and respectively. (d) The convergency property of SDDA on SVHNMNIST. The dashed lines in (a) represent the performance of CORAL, while the dashed lines in (b-c) represent the performance of SDDA(w/o FD).

4.3. Analysis

Feature Visualization To verify the effectiveness of the introducing feature discrimination constraint, we set the number of hidden nodes in the last FC layer as 2, and visualize the 2D features of 2000 randomly selected source samples. As illustrated in Fig 5, the features trained only with source loss () follow the strip distribution, while the features trained with source loss and intra-class discrepancy loss (+) are tightly clustered. Besides, the features trained with source loss and inter-class discrepancy loss (+) are distributed around a circumference, while the features trained with the full discrimination loss (++) are well clustered and distributed around a circumference as well. This demonstrates the effectiveness of the introducing feature discrimination loss. We can draw a conclusion that with the feature discrimination constraint, the deep features will be pulled away from the hypersphere origin and be well clustered, which can benefit the domain adaptation tasks.

The visualization of the 2D features only shows the effectiveness of the feature discrimination. Here, we also visualize the t-SNE (Maaten and Hinton, 2008) embedding of the learned deep features to demonstrate the effectiveness of our approach visually. As can be seen in Fig. 6, we choose a representative domain shift SVHNMNIST with different loss for visualization. From the visualizations, we can make several intuitive observations: (1) The feature distributions of the source only model (non-adapted model) in Fig. 5(a) and Fig. 5(e) suggest that the domain shift between SVNH and MNIST datasets is significant, which demonstrates the necessity of domain adaptation. Besides, Fig. 5(a) and Fig. 5(e) also verify our declaration that the deep features are fairly domain invariant, i.e., the samples drawn from the same class across domains are sufficiently close, and the samples drawn from different classes have large enough margins, which suggests the insight of learning discriminative features for domain adaptation. (2) Compared with Fig. 5(b) and Fig. 5(f) (+), there are less scatter points distributed in the interval of different classes in Fig. 5(c) and Fig. 5(g) ( + ), which demonstrates that our proposed SSC metric is more effective to eliminate the domain discrepancy than CORAL. (3) There are less incorrectly clustered samples in Fig. 5(d) and Fig. 5(h) ( + +) than 5(c) and Fig. 5(g) ( + ) which demonstrates the merits of learning discriminative features for domain adaptation. (4) As can be seen in Fig. 5(d) and Fig. 5(h), although the source domain and target domain are not perfectly aligned, there is a distinct gap between different classes which guarantees high transfer accuracy of domain adaptation.

Parameter Sensitivity We conduct empirical parameter sensitivity on two representative transfer tasks SVHNMNIST and AD. Three trade-off parameters involved in our approach , and are evaluated. Fig. 6(a) demonstrates the transfer accuracy by varying , Fig. 6(b) demonstrates the transfer accuracy by varying , while Fig. 6(c) demonstrates the transfer accuracy by varying . We can observe that the accuracy of SDDA first increases and then decreases and shows a bell-shaped curve in all the three illustrations, which confirms the effectiveness of the proposed SSC metric and feature discrimination constraint. It is worth noting that since the different distributions of samples in different datasets, the optimal trade-off parameters are distinct on different transfer tasks. The reasonable choice can be , and . In addition, we can make the conclusion that the model performance is more sensitivity to compared with . It also reflects that enhancing the inter-class separability is more effective to improve the feature discrimination and boost the adaptation performance.

Convergence Performance We also evaluate the convergence performance of our proposal though the test error during the training phase. Fig. 6(d) shows the test errors of different methods on the transfer tasks SVHNMNIST. It suggests that our proposed SDDA performs best and converges fastest compared with other state-of-the-art methods, which reveals the effectiveness of our proposal.

5. Conclusions

To minimize the domain discrepancy for cross-domain knowledge transfer, we propose to match the structure of the deep features in the source and target domains by a self-similarity consistency constraint, instead of aligning the global distribution statistics across domains. The experimental analysis demonstrates the superiority of the proposed SSC metric compared with the widely used MMD and CORAL. Besides, we also propose to improve the adaptation performance by learning more discriminative features when perform domain matching. An elegant feature norm constraint is exploited to enlarge the margins of inter-class samples. Experimental results and feature visualization suggest that learning more discriminative features can ease the requirements of strict domain alignment and improve the transfer accuracy effectively. The source code of this work will be released soon.

References

  • (1)
  • Arbelaez et al. (2011) Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. 2011. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33, 5 (2011), 898–916.
  • Bousmalis et al. (2017) Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . 3722–3731.
  • Chen et al. (2018) Chao Chen, Zhihong Chen, Boyuan Jiang, and Xinyu Jin. 2018. Joint domain alignment and discriminative feature learning for unsupervised deep domain adaptation. arXiv preprint arXiv:1808.09347 (2018).
  • Csurka (2017) Gabriela Csurka. 2017. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374 (2017).
  • Deng et al. (2018) Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin Jiao. 2018. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 994–1003.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks.

    The Journal of Machine Learning Research

    17, 1 (2016), 2096–2030.
  • Hoffman et al. (2018) Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. 2018. CyCADA: Cycle-Consistent Adversarial Domain Adaptation. In International Conference on Machine Learning. 1994–2003.
  • Huo et al. (2018) Yuankai Huo, Zhoubing Xu, Hyeonsoo Moon, Shunxing Bao, Albert Assad, Tamara K Moyo, Michael R Savona, Richard G Abramson, and Bennett A Landman. 2018. SynSeg-Net: Synthetic Segmentation Without Target Modality Ground Truth. IEEE transactions on medical imaging (2018).
  • Kang et al. (2019) Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. 2019. Contrastive Adaptation Network for Unsupervised Domain Adaptation. arXiv preprint arXiv:1901.00976 (2019).
  • Li et al. (2018) Shuang Li, Shiji Song, Gao Huang, Zhengming Ding, and Cheng Wu. 2018. Domain invariant and class discriminative feature learning for visual domain adaptation. IEEE Transactions on Image Processing 27, 9 (2018), 4260–4273.
  • Liu et al. (2016) Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. 2016. Large-Margin Softmax Loss for Convolutional Neural Networks.. In ICML, Vol. 2. 7.
  • Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks. In International Conference on Machine Learning. 97–105.
  • Long et al. (2017) Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2017. Deep Transfer Learning with Joint Adaptation Networks. In International Conference on Machine Learning. 2208–2217.
  • Lu et al. (2017) Hao Lu, Zhiguo Cao, Yang Xiao, and Yanjun Zhu. 2017. Two-dimensional subspace alignment for convolutional activations adaptation. Pattern Recognition 71 (2017), 320–336.
  • Lu et al. (2018) Hao Lu, Chunhua Shen, Zhiguo Cao, Yang Xiao, and Anton van den Hengel. 2018. An embarrassingly simple approach to visual domain adaptation. IEEE Transactions on Image Processing 27, 7 (2018), 3403–3417.
  • Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605.
  • Morerio et al. ([n. d.]) Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. [n. d.]. Minimal-Entropy Correlation Alignment for Unsupervised Deep Domain Adaptation. international conference on learning representations ([n. d.]).
  • Morerio and Murino (2017) Pietro Morerio and Vittorio Murino. 2017. Correlation Alignment by Riemannian Metric for Domain Adaptation. arXiv preprint arXiv:1705.08180 (2017).
  • Muandet et al. (2012) Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Schölkopf. 2012. Learning from distributions via support measure machines. In Advances in neural information processing systems. 10–18.
  • Murez et al. (2018) Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. 2018. Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4500–4509.
  • Rozantsev et al. (2018) Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. 2018. Beyond sharing weights for deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).
  • Saito et al. (2017) Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Asymmetric tri-training for unsupervised domain adaptation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2988–2997.
  • Sankaranarayanan et al. (2018) Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, Ser Nam Lim, and Rama Chellappa. 2018. Learning from synthetic data: Addressing domain shift for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3752–3761.
  • Shan and Zhang (2016) Hongming Shan and Junping Zhang. 2016. Randomized distribution feature for image classification. In

    Proceedings of the Twenty-second European Conference on Artificial Intelligence

    . IOS Press, 426–434.
  • Sun et al. (2016) Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of frustratingly easy domain adaptation.. In AAAI, Vol. 6. 8.
  • Sun and Saenko (2016) Baochen Sun and Kate Saenko. 2016. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision. Springer, 443–450.
  • Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), Vol. 1. 4.
  • Tzeng et al. (2014) Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014).
  • Wang et al. (2018) Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning discriminative features with multiple granularities for person re-identification. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 274–282.
  • Wang and Deng (2018) Mei Wang and Weihong Deng. 2018. Deep visual domain adaptation: A survey. Neurocomputing 312 (2018), 135–153.
  • Wen et al. (2016) Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision. Springer, 499–515.
  • Xie et al. (2018) Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen. 2018. Learning semantic representations for unsupervised domain adaptation. In International Conference on Machine Learning. 5419–5428.
  • Zellinger et al. (2017) Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. 2017. Central moment discrepancy (cmd) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811 (2017).
  • Zhang et al. (2018) Yun Zhang, Nianbin Wang, Shaobin Cai, and Lei Song. 2018. Unsupervised Domain Adaptation by Mapped Correlation Alignment. IEEE Access 6 (2018), 44698–44706.
  • Zheng et al. (2018) Yutong Zheng, Dipan K Pal, and Marios Savvides. 2018. Ring loss: Convex feature normalization for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5089–5097.
  • Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017.

    Unpaired image-to-image translation using cycle-consistent adversarial networks. In

    Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.