HoMM: Higher-order Moment Matching for Unsupervised Domain Adaptation

12/27/2019 ∙ by Chao Chen, et al. ∙ Zhejiang University 0

Minimizing the discrepancy of feature distributions between different domains is one of the most promising directions in unsupervised domain adaptation. From the perspective of distribution matching, most existing discrepancy-based methods are designed to match the second-order or lower statistics, which however, have limited expression of statistical characteristic for non-Gaussian distributions. In this work, we explore the benefits of using higher-order statistics (mainly refer to third-order and fourth-order statistics) for domain matching. We propose a Higher-order Moment Matching (HoMM) method, and further extend the HoMM into reproducing kernel Hilbert spaces (RKHS). In particular, our proposed HoMM can perform arbitrary-order moment tensor matching, we show that the first-order HoMM is equivalent to Maximum Mean Discrepancy (MMD) and the second-order HoMM is equivalent to Correlation Alignment (CORAL). Moreover, the third-order and the fourth-order moment tensor matching are expected to perform comprehensive domain alignment as higher-order statistics can approximate more complex, non-Gaussian distributions. Besides, we also exploit the pseudo-labeled target samples to learn discriminative representations in the target domain, which further improves the transfer performance. Extensive experiments are conducted, showing that our proposed HoMM consistently outperforms the existing moment matching methods by a large margin. Codes are available at <https://github.com/chenchao666/HoMM-Master>

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) have shown promising results on supervised learning tasks. However, the performance of a learned model always degrades severely when dealing with data from the other domains. Considering that constantly annotating massive samples from new domains is expensive and impractical, unsupervised domain adaptation (UDA), therefore, has emerged as a new learning framework to address this problem [7]

. UDA aims to utilize full-labeled samples in source domain to annotate the completely-unlabeled target domain samples. Thanks to deep CNNs, recent advances in UDA show satisfactory performance in several computer vision tasks

[15]. Among them, most methods bridge the source and target domain by learning domain-invariant features. These dominant methods can be further divided into two categories: (1) Learning domain-invariant features by minimizing the discrepancy between feature distributions [25, 36, 26, 46, 4]

. (2) Encouraging domain confusion by a domain adversarial objectives whereby a discriminator (domain classifier) is trained to distinguish between the source and target representations.

[10, 37, 34, 15].

Figure 1: The metrics of using higher-order moment tensor for domain alignment. 300 points in and the level set of the moment tensor with different order. As observed, using higher-order moment tenser captures the shape of the cloud of samples more accurately.

From the perspective of moment matching, most of the existing discrepancy-based methods in UDA are based on Maximum Mean Discrepancy (MMD) [26] or Correlation Alignment (CORAL) [36]

, which are designed to match the first-order (Mean) and second-order (Covariance) statistics of different distributions. However, for the real world applications (such as image recognition), the deep features are always a complex, non-Gaussian distribution

[17, 44], which can not be completely characterized by its first-order or second-order statistics. Therefore, aligning the second-order or lower statistics only guarantees coarse-grained alignment of two distributions. To address this limitation, we propose to perform domain alignment by matching the higher-order moment tensor (mainly refer to third-order and fourth-order moment tensor), which contain more discriminative information and can better represent the feature distribution [30, 12]. Inspired by [30], Fig.1 illustrates the metrics of higher-order moment tensor, where we plot a cloud of points (consists of three different Gaussians) and the level sets of moment tensor with different order. As observed, the higher-order moment tensor characterizes the distribution more accurately.

Our contribution can be concluded as: (1) We propose a Higher-order Moment Matching (HoMM) method to minimize the domain discrepancy, which is expected to perform fine-grained domain alignment. The HoMM integrates the MMD and CORAL into a unified framework and generalizes the first-order and second-order moment matching to higher-order moment tensor matching. Without bells and whistles, the third- and fourth-order moment matching outperform all existing discrepancy-based methods by a large margin. The source code of the HoMM is released. (2) Due to lack of labels in the target domain, we propose to learn discriminative clusters in the target domain by assigning the pseudo-labels for the reliable target samples, which also improves the transfer performance.

2 Related Work

Learning Domain-Invariant Features To minimize the domain discrepancy and learn domain-invariant features, various distribution discrepancy metrics have been introduced. The representative ones include Maximal Mean Discrepancy (MMD) [14, 38, 25, 26], Correlation Alignment [36, 28, 3, 5] and Wasserstein distance [20, 6]. MMD was first introduced for the two-sample tests problem [14], and is currently the most widely used metric to measure the distance between two feature distributions. Specifically, Long et al. proposed DAN [25] and JAN [26] which perform domain matching via multi-kernel MMD or a joint MMD criteria in multiple domain-specific layers across domains. Sun et al. proposed the correlation alignment (CORAL) [35, 36] to align the second order statistics of the source and target distributions. Some recent work also extended the CORAL into reproducing kernel Hilbert spaces (RKHS) [47] or deployed alignment along geodesics by considering the log-Euclidean distance [28]. Interestingly, [23] theoretically demonstrated that matching the second order statistics is equivalent to minimizing MMD with the second order polynomial kernel. Besides, the approach most relevant to our proposal is the Central Moment Discrepancy (CMD) [46]

, which matches the higher order central moments of probability distributions by means of order-wise moment differences. Both CMD and our HoMM propose to match the higher-order statistics for domain alignment. The CMD matches the higher-order central moment while our HoMM matches the higher-order cumulant tensor. Another fruitful line of work tries to learn the domain-invariant features through adversarial training

[10, 37]. These efforts encourage domain confusion by a domain adversarial objective whereby a discriminator (domain classifier) is trained to distinguish between the source and target representations. Also, recent work performing pixel-level adaptation by image-to-image transformation [29, 15] has achieved satisfactory performance and obtained much attention. In this work, we propose a higher-order moment tensor matching approach to minimize the domain discrepancy, which shows great superiority over existing discrepancy-based methods.

Higher-order Statistics

The statistics higher than first-order has been success fully used in many classical and deep learning methods

[9, 19, 30, 22, 12]. Especially in the field of fine-grained image/video recognition [24, 8], second-order statistics such as Covariance and Gaussian descriptors, have demonstrated better performance than descriptors exploiting zeroth- or first-order statistics [22, 24, 40]. However, using second-order or lower statistical information might not be enough when the feature distribution is non-Gaussian [12]. Therefore, the higher-order (greater than two) statistics have been explored in many signal processing problems [27, 16, 9, 12]. In the field of Blind Source Separation (BSS) [9, 27], for example, the fourth-order statistics are widely used to identify different signals from mixtures. Gou et al. utilizes the third-order statistics for person ReID [12], Xu et al. exploits the third-order cumulant for blind image quality assessment [44]. In [16, 19], the authors exploit higher-order statistics for image recognition and detection. Matching the second order statistics can not always ensure two distributions inseparable, just as using the second order statistics can not identifies different signals from underdetermined mixtures [9]. That’s why we explore higher-order moment tensor for domain alignment.

Discriminative Clustering

Discriminative clustering is a critical task in the unsupervised and semi-supervised learning paradigms

[13, 21, 42, 2]. Due to the paucity of labels in the target domain, how to obtain the discriminative representations in the target domain is of great importance for the UDA tasks. Therefore, a large body of work pays attention to learn the discriminative clusters in the target domain via entropy minimization [13, 28, 34], pseudo label [21, 32, 43] or distance-based metrics [3, 18]. Specifically, Saito et al. [32] assign pseudo-labels to the reliable unlabeled samples to learn discriminative representations for the target domain. Shu et al. [34] consider the cluster assumption and minimize the conditional entropy to ensure the decision boundaries not cross high-density data regions. MCD [33] also considers to align distributions of source and target by utilizing the task-specific decision boundaries. Besides, JDDA [3] and CAN [18] propose to model the intra-class domain discrepancy and the inter-class domain discrepancy to learn more discriminative features.

3 Method

Figure 2: Two-stream CNNs with shared parameters are adopted for unsupervised deep domain adaptation. The first stream operates the source data and the second stream operates the target data. The last FC layer (the input of the output layer) is used as the adapted layer.

In this work, we consider the unsupervised domain adaptation problem. Let denotes the source domain with labeled samples and denotes the target domain with unlabeled samples. Given , we aim to train a cross-domain CNN classifier which can minimize the target risks . Here denotes the outputs of the deep neural networks, denotes the model parameter to be learned. Following [26, 3], we adopt the two-stream CNNs architecture for unsupervised deep domain adaptation. As shown in Fig. 2, the two streams share the same parameters (tied weights), operating the source and target domain samples respectively. And we perform the domain alignment in the last full-connected (FC) layer [36, 3]. According to the theory proposed by Ben-David et al. [1], a basic domain adaptation model should, at least, involve the source domain loss and the domain discrepancy loss, i.e.,

(1)
(2)

where represents the classification loss in the source domain,

represents the cross-entropy loss function.

represents the domain discrepancy loss and is the trade-off parameter. As aforementioned, most of existing discrepancy-based methods are designed to minimize distance of the second-order or lower statistics between different domains. In this work, we propose a higher-order moment matching method, which matches the higher-order statistics of different domains.

3.1 Higher-order Moment Matching

To perform fine-grained domain alignment, we propose a higher-order moment matching as

(3)

where ( is the batch size) during the training process. denotes the activation outputs of the adapted layer. As illustrated in Fig. 2, denotes the activation outputs of the -th sample,

is the number of hidden neurons in the adapted layer. Here,

denotes the

-level tensor power of the vector

. That is

(4)

where denotes the outer product (or tensor product). We have , and . The 2-level tensor product defined as

(5)

when , is a -level tensor with .

Figure 3: An illustration of first-order, second-order and third-order moments in the source domain. HoMM matches the higher-order () moment across different domains.

Instantiations According to Eq. (3), when , the first-order moment matching is equivalent to the linear MMD [38], which is expressed as

(6)

When , the second-order HoMM is formulated as,

(7)

where is the Gram matrix, , is the batch size. Therefore, the second-order HoMM is equivalent to the Gram matrix matching, which is also widely used for cross-domain matching in neural style transfer [11, 23] and knowledge distillation [45]. Li et al. [23] theoretically demonstrate that matching the Gram matrix of feature maps is equivalent to minimize the MMD with the second order polynomial kernel. Besides, when the activation outputs are normalized by subtracting the mean value, the centralized Gram matrix turns into the Covariance matrix. In this respect, the second-order HoMM is also equivalent to CORAL, which matches the Covariance matrix for domain matching [36].

As illustrated in Fig. 3, in addition to the first-order moment matching (e.g. MMD) and the second-order moment matching (e.g. CORAL and Gram matrix matching), our proposed HoMM can also perform higher-order moment tensor matching when . Since higher-order statistics can characterize the non-Gaussian distributions better, applying higher-order moment matching is expected to perform fine-grained domain alignment. However, the space complexity of calculating the higher-order tensor () reaches , which makes the higher-order moment matching infeasible in many real-world applications. Adding bottleneck layers to shrink the length of adaptive layer does not even solve the problem. When , for example, the dimension of a third-order tensor still reaches , and the dimension of a fourth-order tensor reaches , which is absolutely computational-unfriendly. To address this problem, we propose two practical techniques to perform the compact tensor matching.

Group Moment Matching. As the space complexity grows exponentially with the number of neurons , one practical approach is to divide the hidden neurons in the adapted layer into groups, with each group neurons. Then we can calculate and match the high-level tensor in each group respectively. That is,

(8)

where is the activation outputs of -th group. In this way, the space complexity can be reduced from to . In practice, need to be satisfied to ensure satisfactory performance.

Random Sampling Matching. The group moment matching can work well when and , but it tends to fail when . Therefore, we also propose a random sampling matching strategy which is able to perform arbitrary-order moment matching. Instead of calculating and matching two high-dimensional tensors, we randomly select values in the high-level tensor, and only calculate and align these values in the source and target domains. In this respect, the -order moment matching with random sampling strategy can be formulated as,

(9)

where denotes the randomly generated position index matrix, . Therefore, denotes a randomly sampled value in the -level tensor . With the random sampling strategy, we can perform arbitrarily-order moment matching, and the space complexity can be reduced from to . In practice, the model can achieve very competitive results even .

3.2 Higher-order Moment Matching in RKHS

Similar to the KMMD [26], we generalize the higher-order moment matching into reproducing kernel Hilbert spaces (RKHS). That is,

(10)

where denotes the feature representation of -th source sample in RKHS. According to the proposed random sampling strategy, and can be approximated by two -dimensional vectors and , where , . In this respect, the domain matching loss can be formulated as

(11)

where is the RBF kernel function. Particularly, when , the kernelized HoMM (KHoMM) is equivalent to the KMMD.

3.3 Discriminative Clustering

When the target domain features are well aligned with the source domain features, the unsupervised domain adaptation turns into the semi-supervised classification problem, where the discriminative clustering in the unlabeled data is always encouraged [13, 42]. There have been a lot of work trying to learn the discriminative clusters in the target domain [34, 28], most of which minimize the conditional entropy to ensure the decision boundaries do not cross high-density data regions,

(12)

where is the number of classes, is the softmax output of -th node in the output layer. We find that the entropy regularization works well when the target domain has high test accuracy, but it helps little or even downgrades the accuracy when the test accuracy is unsatisfactory. The reason can be drawn that the classifier may be misled as a result of entropy regularization enforcing over-confident probability on some misclassified samples. Instead of clustering in the output layer by minimizing the conditional entropy, we propose to cluster in the shared feature space. First, we pick up highly confident predicted target samples whose predicted probabilities are greater than a given threshold , and assign pseudo-labels to these reliable samples. Then, we penalize the distance of each pseudo-labeled sample to its class center. The discriminative clustering loss can be given as

(13)

where is the assigned pseudo-label of ,

denotes its estimated class center. As we perform update based on mini-batch, the centers can not be accurately estimated by a small size of samples. Therefore, we update the class center in each iteration via moving average method. That is,

(14)
(15)

where is the learning rate of the center. is the class center of -th class in -th iteration. if belongs to -th class, otherwise it should be 0.

3.4 Full Objective Function

Based on the aforementioned analysis, to enable effective unsupervised domain adaptation, we propose a holistic approach with an integration of (1) source domain loss minimization, (2) domain alignment with the higher-order moment matching and (3) discriminative clustering in the target domain. The full objective function is as follows,

(16)

where is the classification loss in the source domain, is the domain discrepancy loss measured by the higher-order moment matching, and denotes the discriminative clustering loss. Note that in order to obtain reliable pseudo-labels for discriminative clustering, we set during the initial iterations, and enable the clustering loss after the total loss tends to be stable.

4 Experiments

4.1 Setup

Dataset. We conduct experiments on three public visual adaptation datasets: digits recognition dataset, Office-31 dataset, and Office-Home dataset. The digits recognition dataset includes four widely used benchmarks: MNIST, USPS, Street View House Numbers (SVHN), and SYN (synthetic digits dataset). We evaluate our proposal across three typical transfer tasks, including: SVHNMNIST, USPSMNIST and SYNMNIST. The details of this dataset can be seen in [3]. Office-31 is another commonly used dataset for real-world domain adaptation scenario, which contains 31 categories acquired from the office environment in three distinct image domains: Amazon (product images download from amazon.com), Webcam (low-resolution images taken by a webcam) and Dslr (high-resolution images taken by a digital SLR camera). The office-31 dataset contains 4110 images in total, with 2817 images in A domain, 795 images in W domain and 498 images in D domain. We evaluate our method on all the six transfer tasks as [26]. The Office-Home dataset [39] is a more challenging dataset for domain adaptation, which consists of images from four different domains: Artistic images (A), Clip Art images (C), Product images (P) and Real-world images (R). The dataset contains around 15500 images in total from 65 object categories in office and home scenes.

Baseline Methods. We compare our proposal with the following methods, which are most related to our work: Deep Domain Confusion (DDC) [38], Deep Adaptation Network (DAN) [25], Deep Correlation Alignment (CORAL) [36], Domain-adversarial Neural Network (DANN) [10], Adversarial Discriminative Domain Adaptation (ADDA) [37], Joint Adaptation Network (JAN) [26], Central Moment Discrepancy (CMD) [46] Cycle-consistent Adversarial Domain Adaptation (CyCADA) [15], Joint Discriminative feature Learning and Domain Adaptation (JDDA) [3]. Specifically, DDC, DAN, JAN, CORAL and CMD are representative moment matching based methods, while DANN, ADDA and CyCADA are representative adversarial training based methods.

Method SNMT USMT SYNMT Avg
Source Only 67.30.3 0.4 89.70.2 74.5
DDC 71.90.4 75.80.3 89.90.2 79.2
DAN 79.50.3 0.2 75.20.1 81.5
DANN 70.60.2 76.60.3 90.20.2 79.1
CMD 86.50.3 86.30.4 96.10.2 89.6
ADDA 72.30.2 92.10.2 96.30.4 86.9
CORAL 89.50.2 96.50.3 96.50.2 94.2
CyCADA 92.80.1 97.40.3 97.50.1 95.9
JDDA 94.2 0.1 96.70.1 97.70.0 96.2
HoMM(p=3) 96.50.2 97.80.0 97.60.1 97.3
HoMM(p=4) 95.70.2 97.60.0 97.60.0 96.9
KHoMM(p=3) 97.20.1 97.90.1 98.20.1 97.8
Full 98.80.1 99.00.1 99.00.0 98.9
KHoMM+ 99.00.0 99.10.1 99.20.0 99.1

We denote SVHN, MNIST, USPS as SN, MT and US.

Table 1: Test accuracy (%) on digits recognition dataset for unsupervised domain adaptation based on modified LeNet
Method AW DW WD AD DA WA Avg
Source Only 73.10.2 93.20.2 0.1 72.60.2 55.80.1 56.40.3 75.0
DDC [38] 74.40.3 94.00.1 98.20.1 74.60.4 56.40.1 56.90.1 75.8
DAN [25] 78.30.3 0.2 0.1 75.20.2 58.90.2 64.20.3 78.5
DANN [10] 73.60.3 94.50.1 99.50.1 74.40.5 57.20.1 60.80.2 76.7
CORAL [36] 79.30.3 94.30.2 99.40.2 74.80.1 56.40.2 63.40.2 78.0
JAN [26] 85.40.3 97.40.2 99.80.2 84.70.4 68.60.3 70.00.4 84.3
CMD [46] 76.90.4 94.60.3 99.20.2 75.40.4 56.80.1 61.90.2 77.5
CyCADA [15] 82.20.3 94.60.2 99.70.1 78.70.1 60.50.2 67.80.2 80.6
JDDA[3] 82.60.4 95.20.2 99.70.0 79.80.1 57.40.0 66.70.2 80.2
HoMM(p=3) 87.60.2 96.30.1 99.80.0 83.90.2 66.50.1 68.50.3 83.7
HoMM(p=4) 89.80.3 97.10.1 100.00.0 86.60.1 69.60.3 69.70.3 85.5
KHoMM(p=4) 90.50.2 98.30.1 100.00.0 87.70.2 70.40.2 70.30.2 86.2
Full 91.70.3 98.80.0 100.00.0 89.10.3 71.20.2 70.60.3 86.9
KHoMM+ 90.80.1 99.30.1 100.00.0 87.90.2 69.30.3 69.50.4 86.1
Table 2: Test accuracy (%) on Office-31 dataset for unsupervised domain adaptation based on ResNet-50
Method AP AR CR PR RP
Source Only 50.0 58.0 46.2 60.4 59.5
DDC 54.9 61.3 50.5 64.1 65.9
DAN 57.0 67.9 60.4 67.7 74.3
DANN 59.3 70.1 60.9 68.5 76.8
CORAL 58.6 65.4 59.8 68.3 74.7
JAN 61.2 68.9 61.0 70.3 76.8
HoMM(p=3) 60.7 68.3 61.4 69.2 76.7
HoMM(p=4) 63.5 70.2 64.6 72.6 79.3
KHoMM(p=4) 63.9 70.5 65.3 73.3 79.8
Full 64.7 71.8 66.1 74.5 81.2
KHoMM+ 64.2 70.1 65.5 73.2 80.1
Table 3: Test accuracy (%) on Office-Home dataset for unsupervised domain adaptation based on ResNet-50

Implementation Details. In our experiments on digits recognition dataset, we utilize the modified LeNet whereby a bottleneck layer with hidden neurons is added before the output layer. Since the image size is different across different domains, we resize all the images to

and convert the RGB images to grayscale. For the experiments on Office-31, we use ResNet-50 pretrained on ImageNet as our backbone networks. And we add a bottleneck layer with 180 hidden nodes before the output layer for domain matching. It is worth noting that the

reluactivation function can not be applied to the adapted layer, as relu activation function will make most of the values in the high-level tensor to be zero, which will make our HoMM fail. Therefore, we adopt tanh activation function in the adapted layer. Due to the small samples size of Office-31 and Office-Home datasets, we only update the weights of the full-connected layers (fc) as well as the final block (scale5/block3), and fix other parameters pretrained on ImageNet. Follow the standard protocol of [26], we use all the labeled source domain samples and all the unlabeled target domain samples for training. All the comparison methods are based on the same CNN architecture for a fair comparison. For DDC, DAN, CORAL and CMD, we embed the official implementation code into our model and carefully select the trade-off parameters to get the best results. When training with ADDA, our adversarial discriminator consists of 3 fully connected layers: two layers with 500 hidden units followed by the final discriminator output. For other compared methods, we report the results in the original paper directly.

Parameters.

Our model is trained with Adam Optimizer based on Tensorflow. Regarding the optimal hyper-parameters, they are determined by applying multiple experiments using grid search strategy. The optimal hyper-parameters may be distinct across different transfer tasks. Specifically, the trade-off parameters are selected from

, . For the digits recognition tasks, the hyper-parameter is set to for third-order HoMM and set to for fourth-order HoMM 111Note that the trade-off on the fourth-order HoMM is much larger than the third-order HoMM. This is because most deep features of the digits are very small, higher-order moments calculating the cumulative multiplication between different features become very close to zeros. Therefore, on digits dataset, the higher the order, the larger the trade-off should be.. For the experiments on Office-31 and Office-Home, is set to for the third-order HoMM and set to for the fourth-order HoMM. Besides, the hyper-parameter in RBF kernel is set to 1e-4 across the experiments, the learning rate of the centers is set to for digits dataset and set to for Office-31 and Office-Home dataset. The threshold of the predicted probability is chosen from , and the best results are reported. The parameter sensitivity can be seen in Fig. 5.

4.2 Experimental results

Digits Dataset For the experiments on digits recognition dataset, we set the batch size as 128 for each domain and set the learning rate as 1e-4 throughout the experiments. Table 1 shows the adaptation performance on three typical transfer tasks based on the modified LeNet. As can be seen, our proposed HoMM yields notable improvement over the comparison methods on all of the transfer tasks. In particular, our method improves the adaption performance significantly in the hard transfer tasks SVHNMNIST. Without bells and whistles, the proposed third-order KHoMM achieve 97.2% accuracy, improving the second-order moment matching (CORAL) by +8%. Besides, the results also indicate that the third-order HoMM outperforms the fourth-order HoMM and slightly underperforms the KHoMM.

Office-31 Table 2 lists the test accuracies on Office-31 dataset. We set the batchsize as 70 for each domain. The learning rate of the fc layer parameters is set as 3e-4 and the learning rate of the conv layer (scale5/block3) parameters is set as 3e-5. As we can see, the fourth-order HoMM outperforms the third-order HoMM and achieves the best results among all the moment-matching based methods. Besides, it is worth noting that the fourth-order HoMM outperforms the second-order statistics matching (CORAL) by more than 10% on several representative transfer tasks AW, AD and DA, which demonstrates the merits of our proposed higher-order moment matching.

Office-Home Table 3 gives the results on the challenged Office-Home dataset. The parameter settings are the same as in Office-31. We only evaluate our method on 5 out of 12 representative transfer tasks due to the space limitation. As we can see, on all the five transfer tasks, the HoMM outperforms the DAN, CORAL, DANN by a large margin and also outperforms the JAN by 3%-5%. Note that the experimental results of the compared methods are reported from [41] directly.

The results in Table 1, Table 2 and Table 3

reveal several interesting observations: (1) All the domain adaptation methods outperform the source only model by a large margin, which demonstrates that minimizing the domain discrepancy contributes to learning more transferable representations. (2) Our proposed HoMM significantly outperforms the discrepancy-based methods (DDC, CORAL, CMD), and the adversarial training based methods (DANN, ADDA and CyCADA), which reveals the advantages of matching the higher-order statistics for domain adaptation. (3) The JAN performs slightly better than the third-order HoMM on several transfer tasks, but it’s always not as good as the fourth-order HoMM in spite of aligning the joint distributions of multiple domain-specific layers across domains. The performance of our HoMM will be improved as well if we utilize such a strategy. (4) The kernelized HoMM (KHoMM) consistently outperforms the plain HoMM, but the improvement seems limited. We believe the reason is that, the higher-order statistics are originally the high-dimensional features, which conceals the advantages of embedding the features into RKHS. (5) In all transfer tasks, the performance increases consistently by employing the discriminative clustering in target domain. In contrast, entropy regularization improves the transfer performance when the test accuracy is high, but it helps little or even downgrades the performance when the test accuracy is not that confident.

(a) Source Only
(b) KMMD
(c) CORAL
(d) HoMM(p=3)
(e) Full Loss
(f) Source Only
(g) KMMD
(h) CORAL
(i) HoMM(p=3)
(j) Full Loss
Figure 4: 2D visualization of the deep features generated by different model on SVHNMNIST. The first row illustrates the t-SNE embedding of deep features which are marked by category information, each color represents a category. The second row illustrates the t-SNE embedding of deep features which are marked by domain information, red and blue points represent the samples drawn from the source and target domains.
(a)
(b)
(c)
(d)
(e) Convergence
Figure 5: Analysis of parameter sensitivity (a)-(d) and convergence analysis (e). The dash line in (b) and (d) indicate the performance of HoMM without the clustering loss

4.3 Analysis

Feature Visualization We utilize t-SNE to visualize the deep features on the tasks SVHNMNIST by ResNet-50, KMMD, CORAL, HoMM(p=3) and the Full Loss model. As shown in Fig. 4, the feature distributions of the source only model in (a) suggests that the domain shift between SVNH and MNIST is significant, which demonstrates the necessity of performing domain adaptation. Besides, the global distributions of the source and target samples are well aligned with the KMMD (b) and CORAL (c), but there are still many samples being misclassified. With our proposed HoMM, the source and target samples are aligned better and categories are discriminated better as well.

First/Second-order versus Higher-order Since our proposed HoMM can perform arbitrary-order moment matching, we compare the performance of different order moment matching on three typical transfer tasks. As shown in table 4, the order is chosen from . The results show that the third-order and fourth-order moment matching significantly outperform the other order moment matching. When , the higher the order, the higher the accuracy. When , the accuracy will decrease as the order increases. Besides, the fifth-order moment matching also achieves very competitive results. Regarding why the fifth-order and above perform worse than the fourth-order, one reason we believe is that the fifth-order and above moments can’t be accurately estimated due to the small sample size problem [31].

order 1 2 3 4 5 6 10
SNMT 71.9 89.5 96.5 95.7 94.8 91.5 58.6
AW 74.4 79.3 87.6 89.8 86.6 85.3 80.2
AP 54.9 58.6 60.7 63.5 60.9 58.2 57.3

We denote SVHN and MNIST as SN and MT respectively.

Table 4: Test accuracy (%) comparison of different-order moment matching on three transfer tasks

Parameter Sensitivity and Convergence We conduct empirical parameter sensitivity on SVHNMNIST and AW in Fig. 5(a)-(d). The evaluated parameters include two trade-off parameters , , the number of selected values in Random Sampling Matching , and the threshold of the predicted probability. As we can see, our model is quite sensitive to the change of and the bellshaped curve illustrates the regularization effect of and . The convergence performance is provided in Fig. 5(e), which shows that our proposal converges fastest compared with other methods. It is worth noting that, the test error of the Full Loss model has a obvious mutation at the iteration where we enable the clustering loss , which also demonstrates the effectiveness of the proposed discriminative clustering loss.

5 Conclusion

Minimizing statistic distance between source and target distributions is an important line of work for domain adaptation. Unlike previous methods that utilize the second-order or lower statistics for domain alignment, this paper exploits the higher-order statistics for domain alignment. Specifically, a higher-order moment matching method has been presented, which integrates the MMD and CORAL into a unified framework and generalizes the existing first-order and second-order moment matching to arbitrary-order moment matching. We experimentally demonstrate that the third-order and fourth-order moment matching significantly outperform the existing moment matching methods. Besides, we also extend the HoMM into RKHS and learn the discriminative clusters in the target domain, which further improves the adaptation performance. The proposed HoMM can be easily integrated into other domain adaptation model, and it is also expected to benefit the knowledge distillation and image style transfer.

References

  • [1] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • [2] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze.

    Deep clustering for unsupervised learning of visual features.

    In Proceedings of the European Conference on Computer Vision (ECCV), pages 132–149, 2018.
  • [3] Chao Chen, Zhihong Chen, Boyuan Jiang, and Xinyu Jin. Joint domain alignment and discriminative feature learning for unsupervised deep domain adaptation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 33, pages 3296–3303, 2019.
  • [4] Chao Chen, Zhihang Fu, Zhihong Chen, Zhaowei Cheng, Xinyu Jin, and Xian-Sheng Hua. Towards self-similarity consistency and feature discrimination for unsupervised domain adaptation. arXiv preprint arXiv:1904.06490, 2019.
  • [5] Zhihong Chen, Chao Chen, Zhaowei Cheng, Ke Fang, and Xinyu Jin. Selective transfer with reinforced transfer network for partial domain adaptation. arXiv preprint arXiv:1905.10756, 2019.
  • [6] Zhihong Chen, Chao Chen, Xinyu Jin, Yifu Liu, and Zhaowei Cheng. Deep joint two-stream wasserstein auto-encoder and selective attention alignment for unsupervised domain adaptation. Neural Computing and Applications, pages 1–14.
  • [7] Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374, 2017.
  • [8] Yin Cui, Feng Zhou, Jiang Wang, Xiao Liu, Yuanqing Lin, and Serge Belongie. Kernel pooling for convolutional neural networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 2921–2930, 2017.
  • [9] Lieven De Lathauwer, Josphine Castaing, and Jean-Franois Cardoso. Fourth-order cumulant-based blind identification of underdetermined mixtures. IEEE Transactions on Signal Processing, 55(6):2965–2973, 2007.
  • [10] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [11] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
  • [12] Mengran Gou, Octavia Camps, and Mario Sznaier. mom: Mean of moments feature for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 1294–1303, 2017.
  • [13] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005.
  • [14] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
  • [15] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning, pages 1994–2003, 2018.
  • [16] Jacek Jakubowski, Krzystof Kwiatos, Augustyn Chwaleba, and Stanislaw Osowski. Higher order statistics and neural network for tremor recognition. IEEE Transactions on Biomedical Engineering, 49(2):152–159, 2002.
  • [17] Yangqing Jia and Trevor Darrell. Heavy-tailed distances for gradient based image descriptors. In Advances in Neural Information Processing Systems, pages 397–405, 2011.
  • [18] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4893–4902, 2019.
  • [19] Piotr Koniusz, Fei Yan, Philippe-Henri Gosselin, and Krystian Mikolajczyk. Higher-order occurrence pooling for bags-of-words: Visual concept detection. IEEE transactions on pattern analysis and machine intelligence, 39(2):313–326, 2016.
  • [20] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10285–10295, 2019.
  • [21] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.
  • [22] Peihua Li, Jiangtao Xie, Qilong Wang, and Wangmeng Zuo. Is second-order information helpful for large-scale visual recognition? In Proceedings of the IEEE International Conference on Computer Vision, pages 2070–2078, 2017.
  • [23] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 2230–2236. AAAI Press, 2017.
  • [24] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision, pages 1449–1457, 2015.
  • [25] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105, 2015.
  • [26] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan.

    Deep transfer learning with joint adaptation networks.

    In International Conference on Machine Learning, pages 2208–2217, 2017.
  • [27] Ali Mansour and Christian Jutten. Fourth-order criteria for blind sources separation. IEEE transactions on signal processing, 43(8):2022–2025, 1995.
  • [28] Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. Minimal-entropy correlation alignment for unsupervised deep domain adaptation. international conference on learning representations.
  • [29] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4500–4509, 2018.
  • [30] Edouard Pauwels and Jean B Lasserre. Sorting out typicality with the inverse moment matrix sos polynomial. In Advances in Neural Information Processing Systems, pages 190–198, 2016.
  • [31] Sarunas J Raudys and Anil K. Jain. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Transactions on Pattern Analysis & Machine Intelligence, (3):252–264, 1991.
  • [32] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain adaptation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2988–2997. JMLR. org, 2017.
  • [33] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2018.
  • [34] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735, 2018.
  • [35] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, volume 6, page 8, 2016.
  • [36] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, pages 443–450. Springer, 2016.
  • [37] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
  • [38] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [39] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017.
  • [40] Qilong Wang, Peihua Li, and Lei Zhang. G2denet: Global gaussian distribution embedding network and its application to visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2730–2739, 2017.
  • [41] Ximei Wang, Liang Li, Weirui Ye, Mingsheng Long, and Jianmin Wang. Transferable attention for domain adaptation. In AAAI Conference on Artificial Intelligence (AAAI), 2019.
  • [42] Junyuan Xie, Ross Girshick, and Ali Farhadi.

    Unsupervised deep embedding for clustering analysis.

    In International conference on machine learning, pages 478–487, 2016.
  • [43] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen. Learning semantic representations for unsupervised domain adaptation. In International Conference on Machine Learning, pages 5419–5428, 2018.
  • [44] Jingtao Xu, Peng Ye, Qiaohong Li, Haiqing Du, Yong Liu, and David Doermann. Blind image quality assessment based on high order statistics aggregation. IEEE Transactions on Image Processing, 25(9):4444–4457, 2016.
  • [45] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.
  • [46] Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. Central moment discrepancy (cmd) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811, 2017.
  • [47] Zhen Zhang, Mianzhi Wang, Yan Huang, and Arye Nehorai. Aligning infinite-dimensional covariance matrices in reproducing kernel hilbert spaces for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3437–3445, 2018.