1 Introduction
Convolutional neural networks (CNNs) have shown promising results on supervised learning tasks. However, the performance of a learned model always degrades severely when dealing with data from the other domains. Considering that constantly annotating massive samples from new domains is expensive and impractical, unsupervised domain adaptation (UDA), therefore, has emerged as a new learning framework to address this problem [7]
. UDA aims to utilize fulllabeled samples in source domain to annotate the completelyunlabeled target domain samples. Thanks to deep CNNs, recent advances in UDA show satisfactory performance in several computer vision tasks
[15]. Among them, most methods bridge the source and target domain by learning domaininvariant features. These dominant methods can be further divided into two categories: (1) Learning domaininvariant features by minimizing the discrepancy between feature distributions [25, 36, 26, 46, 4]. (2) Encouraging domain confusion by a domain adversarial objectives whereby a discriminator (domain classifier) is trained to distinguish between the source and target representations.
[10, 37, 34, 15].From the perspective of moment matching, most of the existing discrepancybased methods in UDA are based on Maximum Mean Discrepancy (MMD) [26] or Correlation Alignment (CORAL) [36]
, which are designed to match the firstorder (Mean) and secondorder (Covariance) statistics of different distributions. However, for the real world applications (such as image recognition), the deep features are always a complex, nonGaussian distribution
[17, 44], which can not be completely characterized by its firstorder or secondorder statistics. Therefore, aligning the secondorder or lower statistics only guarantees coarsegrained alignment of two distributions. To address this limitation, we propose to perform domain alignment by matching the higherorder moment tensor (mainly refer to thirdorder and fourthorder moment tensor), which contain more discriminative information and can better represent the feature distribution [30, 12]. Inspired by [30], Fig.1 illustrates the metrics of higherorder moment tensor, where we plot a cloud of points (consists of three different Gaussians) and the level sets of moment tensor with different order. As observed, the higherorder moment tensor characterizes the distribution more accurately.Our contribution can be concluded as: (1) We propose a Higherorder Moment Matching (HoMM) method to minimize the domain discrepancy, which is expected to perform finegrained domain alignment. The HoMM integrates the MMD and CORAL into a unified framework and generalizes the firstorder and secondorder moment matching to higherorder moment tensor matching. Without bells and whistles, the third and fourthorder moment matching outperform all existing discrepancybased methods by a large margin. The source code of the HoMM is released. (2) Due to lack of labels in the target domain, we propose to learn discriminative clusters in the target domain by assigning the pseudolabels for the reliable target samples, which also improves the transfer performance.
2 Related Work
Learning DomainInvariant Features To minimize the domain discrepancy and learn domaininvariant features, various distribution discrepancy metrics have been introduced. The representative ones include Maximal Mean Discrepancy (MMD) [14, 38, 25, 26], Correlation Alignment [36, 28, 3, 5] and Wasserstein distance [20, 6]. MMD was first introduced for the twosample tests problem [14], and is currently the most widely used metric to measure the distance between two feature distributions. Specifically, Long et al. proposed DAN [25] and JAN [26] which perform domain matching via multikernel MMD or a joint MMD criteria in multiple domainspecific layers across domains. Sun et al. proposed the correlation alignment (CORAL) [35, 36] to align the second order statistics of the source and target distributions. Some recent work also extended the CORAL into reproducing kernel Hilbert spaces (RKHS) [47] or deployed alignment along geodesics by considering the logEuclidean distance [28]. Interestingly, [23] theoretically demonstrated that matching the second order statistics is equivalent to minimizing MMD with the second order polynomial kernel. Besides, the approach most relevant to our proposal is the Central Moment Discrepancy (CMD) [46]
, which matches the higher order central moments of probability distributions by means of orderwise moment differences. Both CMD and our HoMM propose to match the higherorder statistics for domain alignment. The CMD matches the higherorder central moment while our HoMM matches the higherorder cumulant tensor. Another fruitful line of work tries to learn the domaininvariant features through adversarial training
[10, 37]. These efforts encourage domain confusion by a domain adversarial objective whereby a discriminator (domain classifier) is trained to distinguish between the source and target representations. Also, recent work performing pixellevel adaptation by imagetoimage transformation [29, 15] has achieved satisfactory performance and obtained much attention. In this work, we propose a higherorder moment tensor matching approach to minimize the domain discrepancy, which shows great superiority over existing discrepancybased methods.Higherorder Statistics
The statistics higher than firstorder has been success fully used in many classical and deep learning methods
[9, 19, 30, 22, 12]. Especially in the field of finegrained image/video recognition [24, 8], secondorder statistics such as Covariance and Gaussian descriptors, have demonstrated better performance than descriptors exploiting zeroth or firstorder statistics [22, 24, 40]. However, using secondorder or lower statistical information might not be enough when the feature distribution is nonGaussian [12]. Therefore, the higherorder (greater than two) statistics have been explored in many signal processing problems [27, 16, 9, 12]. In the field of Blind Source Separation (BSS) [9, 27], for example, the fourthorder statistics are widely used to identify different signals from mixtures. Gou et al. utilizes the thirdorder statistics for person ReID [12], Xu et al. exploits the thirdorder cumulant for blind image quality assessment [44]. In [16, 19], the authors exploit higherorder statistics for image recognition and detection. Matching the second order statistics can not always ensure two distributions inseparable, just as using the second order statistics can not identifies different signals from underdetermined mixtures [9]. That’s why we explore higherorder moment tensor for domain alignment.Discriminative Clustering
Discriminative clustering is a critical task in the unsupervised and semisupervised learning paradigms
[13, 21, 42, 2]. Due to the paucity of labels in the target domain, how to obtain the discriminative representations in the target domain is of great importance for the UDA tasks. Therefore, a large body of work pays attention to learn the discriminative clusters in the target domain via entropy minimization [13, 28, 34], pseudo label [21, 32, 43] or distancebased metrics [3, 18]. Specifically, Saito et al. [32] assign pseudolabels to the reliable unlabeled samples to learn discriminative representations for the target domain. Shu et al. [34] consider the cluster assumption and minimize the conditional entropy to ensure the decision boundaries not cross highdensity data regions. MCD [33] also considers to align distributions of source and target by utilizing the taskspecific decision boundaries. Besides, JDDA [3] and CAN [18] propose to model the intraclass domain discrepancy and the interclass domain discrepancy to learn more discriminative features.3 Method
In this work, we consider the unsupervised domain adaptation problem. Let denotes the source domain with labeled samples and denotes the target domain with unlabeled samples. Given , we aim to train a crossdomain CNN classifier which can minimize the target risks . Here denotes the outputs of the deep neural networks, denotes the model parameter to be learned. Following [26, 3], we adopt the twostream CNNs architecture for unsupervised deep domain adaptation. As shown in Fig. 2, the two streams share the same parameters (tied weights), operating the source and target domain samples respectively. And we perform the domain alignment in the last fullconnected (FC) layer [36, 3]. According to the theory proposed by BenDavid et al. [1], a basic domain adaptation model should, at least, involve the source domain loss and the domain discrepancy loss, i.e.,
(1) 
(2) 
where represents the classification loss in the source domain,
represents the crossentropy loss function.
represents the domain discrepancy loss and is the tradeoff parameter. As aforementioned, most of existing discrepancybased methods are designed to minimize distance of the secondorder or lower statistics between different domains. In this work, we propose a higherorder moment matching method, which matches the higherorder statistics of different domains.3.1 Higherorder Moment Matching
To perform finegrained domain alignment, we propose a higherorder moment matching as
(3) 
where ( is the batch size) during the training process. denotes the activation outputs of the adapted layer. As illustrated in Fig. 2, denotes the activation outputs of the th sample,
is the number of hidden neurons in the adapted layer. Here,
denotes thelevel tensor power of the vector
. That is(4) 
where denotes the outer product (or tensor product). We have , and . The 2level tensor product defined as
(5) 
when , is a level tensor with .
Instantiations According to Eq. (3), when , the firstorder moment matching is equivalent to the linear MMD [38], which is expressed as
(6) 
When , the secondorder HoMM is formulated as,
(7) 
where is the Gram matrix, , is the batch size. Therefore, the secondorder HoMM is equivalent to the Gram matrix matching, which is also widely used for crossdomain matching in neural style transfer [11, 23] and knowledge distillation [45]. Li et al. [23] theoretically demonstrate that matching the Gram matrix of feature maps is equivalent to minimize the MMD with the second order polynomial kernel. Besides, when the activation outputs are normalized by subtracting the mean value, the centralized Gram matrix turns into the Covariance matrix. In this respect, the secondorder HoMM is also equivalent to CORAL, which matches the Covariance matrix for domain matching [36].
As illustrated in Fig. 3, in addition to the firstorder moment matching (e.g. MMD) and the secondorder moment matching (e.g. CORAL and Gram matrix matching), our proposed HoMM can also perform higherorder moment tensor matching when . Since higherorder statistics can characterize the nonGaussian distributions better, applying higherorder moment matching is expected to perform finegrained domain alignment. However, the space complexity of calculating the higherorder tensor () reaches , which makes the higherorder moment matching infeasible in many realworld applications. Adding bottleneck layers to shrink the length of adaptive layer does not even solve the problem. When , for example, the dimension of a thirdorder tensor still reaches , and the dimension of a fourthorder tensor reaches , which is absolutely computationalunfriendly. To address this problem, we propose two practical techniques to perform the compact tensor matching.
Group Moment Matching. As the space complexity grows exponentially with the number of neurons , one practical approach is to divide the hidden neurons in the adapted layer into groups, with each group neurons. Then we can calculate and match the highlevel tensor in each group respectively. That is,
(8) 
where is the activation outputs of th group. In this way, the space complexity can be reduced from to . In practice, need to be satisfied to ensure satisfactory performance.
Random Sampling Matching. The group moment matching can work well when and , but it tends to fail when . Therefore, we also propose a random sampling matching strategy which is able to perform arbitraryorder moment matching. Instead of calculating and matching two highdimensional tensors, we randomly select values in the highlevel tensor, and only calculate and align these values in the source and target domains. In this respect, the order moment matching with random sampling strategy can be formulated as,
(9) 
where denotes the randomly generated position index matrix, . Therefore, denotes a randomly sampled value in the level tensor . With the random sampling strategy, we can perform arbitrarilyorder moment matching, and the space complexity can be reduced from to . In practice, the model can achieve very competitive results even .
3.2 Higherorder Moment Matching in RKHS
Similar to the KMMD [26], we generalize the higherorder moment matching into reproducing kernel Hilbert spaces (RKHS). That is,
(10) 
where denotes the feature representation of th source sample in RKHS. According to the proposed random sampling strategy, and can be approximated by two dimensional vectors and , where , . In this respect, the domain matching loss can be formulated as
(11) 
where is the RBF kernel function. Particularly, when , the kernelized HoMM (KHoMM) is equivalent to the KMMD.
3.3 Discriminative Clustering
When the target domain features are well aligned with the source domain features, the unsupervised domain adaptation turns into the semisupervised classification problem, where the discriminative clustering in the unlabeled data is always encouraged [13, 42]. There have been a lot of work trying to learn the discriminative clusters in the target domain [34, 28], most of which minimize the conditional entropy to ensure the decision boundaries do not cross highdensity data regions,
(12) 
where is the number of classes, is the softmax output of th node in the output layer. We find that the entropy regularization works well when the target domain has high test accuracy, but it helps little or even downgrades the accuracy when the test accuracy is unsatisfactory. The reason can be drawn that the classifier may be misled as a result of entropy regularization enforcing overconfident probability on some misclassified samples. Instead of clustering in the output layer by minimizing the conditional entropy, we propose to cluster in the shared feature space. First, we pick up highly confident predicted target samples whose predicted probabilities are greater than a given threshold , and assign pseudolabels to these reliable samples. Then, we penalize the distance of each pseudolabeled sample to its class center. The discriminative clustering loss can be given as
(13) 
where is the assigned pseudolabel of ,
denotes its estimated class center. As we perform update based on minibatch, the centers can not be accurately estimated by a small size of samples. Therefore, we update the class center in each iteration via moving average method. That is,
(14) 
(15) 
where is the learning rate of the center. is the class center of th class in th iteration. if belongs to th class, otherwise it should be 0.
3.4 Full Objective Function
Based on the aforementioned analysis, to enable effective unsupervised domain adaptation, we propose a holistic approach with an integration of (1) source domain loss minimization, (2) domain alignment with the higherorder moment matching and (3) discriminative clustering in the target domain. The full objective function is as follows,
(16) 
where is the classification loss in the source domain, is the domain discrepancy loss measured by the higherorder moment matching, and denotes the discriminative clustering loss. Note that in order to obtain reliable pseudolabels for discriminative clustering, we set during the initial iterations, and enable the clustering loss after the total loss tends to be stable.
4 Experiments
4.1 Setup
Dataset. We conduct experiments on three public visual adaptation datasets: digits recognition dataset, Office31 dataset, and OfficeHome dataset. The digits recognition dataset includes four widely used benchmarks: MNIST, USPS, Street View House Numbers (SVHN), and SYN (synthetic digits dataset). We evaluate our proposal across three typical transfer tasks, including: SVHNMNIST, USPSMNIST and SYNMNIST. The details of this dataset can be seen in [3]. Office31 is another commonly used dataset for realworld domain adaptation scenario, which contains 31 categories acquired from the office environment in three distinct image domains: Amazon (product images download from amazon.com), Webcam (lowresolution images taken by a webcam) and Dslr (highresolution images taken by a digital SLR camera). The office31 dataset contains 4110 images in total, with 2817 images in A domain, 795 images in W domain and 498 images in D domain. We evaluate our method on all the six transfer tasks as [26]. The OfficeHome dataset [39] is a more challenging dataset for domain adaptation, which consists of images from four different domains: Artistic images (A), Clip Art images (C), Product images (P) and Realworld images (R). The dataset contains around 15500 images in total from 65 object categories in office and home scenes.
Baseline Methods. We compare our proposal with the following methods, which are most related to our work: Deep Domain Confusion (DDC) [38], Deep Adaptation Network (DAN) [25], Deep Correlation Alignment (CORAL) [36], Domainadversarial Neural Network (DANN) [10], Adversarial Discriminative Domain Adaptation (ADDA) [37], Joint Adaptation Network (JAN) [26], Central Moment Discrepancy (CMD) [46] Cycleconsistent Adversarial Domain Adaptation (CyCADA) [15], Joint Discriminative feature Learning and Domain Adaptation (JDDA) [3]. Specifically, DDC, DAN, JAN, CORAL and CMD are representative moment matching based methods, while DANN, ADDA and CyCADA are representative adversarial training based methods.
Method  SNMT  USMT  SYNMT  Avg 
Source Only  67.30.3  0.4  89.70.2  74.5 
DDC  71.90.4  75.80.3  89.90.2  79.2 
DAN  79.50.3  0.2  75.20.1  81.5 
DANN  70.60.2  76.60.3  90.20.2  79.1 
CMD  86.50.3  86.30.4  96.10.2  89.6 
ADDA  72.30.2  92.10.2  96.30.4  86.9 
CORAL  89.50.2  96.50.3  96.50.2  94.2 
CyCADA  92.80.1  97.40.3  97.50.1  95.9 
JDDA  94.2 0.1  96.70.1  97.70.0  96.2 
HoMM(p=3)  96.50.2  97.80.0  97.60.1  97.3 
HoMM(p=4)  95.70.2  97.60.0  97.60.0  96.9 
KHoMM(p=3)  97.20.1  97.90.1  98.20.1  97.8 
Full  98.80.1  99.00.1  99.00.0  98.9 
KHoMM+  99.00.0  99.10.1  99.20.0  99.1 
We denote SVHN, MNIST, USPS as SN, MT and US.
Method  AW  DW  WD  AD  DA  WA  Avg 
Source Only  73.10.2  93.20.2  0.1  72.60.2  55.80.1  56.40.3  75.0 
DDC [38]  74.40.3  94.00.1  98.20.1  74.60.4  56.40.1  56.90.1  75.8 
DAN [25]  78.30.3  0.2  0.1  75.20.2  58.90.2  64.20.3  78.5 
DANN [10]  73.60.3  94.50.1  99.50.1  74.40.5  57.20.1  60.80.2  76.7 
CORAL [36]  79.30.3  94.30.2  99.40.2  74.80.1  56.40.2  63.40.2  78.0 
JAN [26]  85.40.3  97.40.2  99.80.2  84.70.4  68.60.3  70.00.4  84.3 
CMD [46]  76.90.4  94.60.3  99.20.2  75.40.4  56.80.1  61.90.2  77.5 
CyCADA [15]  82.20.3  94.60.2  99.70.1  78.70.1  60.50.2  67.80.2  80.6 
JDDA[3]  82.60.4  95.20.2  99.70.0  79.80.1  57.40.0  66.70.2  80.2 
HoMM(p=3)  87.60.2  96.30.1  99.80.0  83.90.2  66.50.1  68.50.3  83.7 
HoMM(p=4)  89.80.3  97.10.1  100.00.0  86.60.1  69.60.3  69.70.3  85.5 
KHoMM(p=4)  90.50.2  98.30.1  100.00.0  87.70.2  70.40.2  70.30.2  86.2 
Full  91.70.3  98.80.0  100.00.0  89.10.3  71.20.2  70.60.3  86.9 
KHoMM+  90.80.1  99.30.1  100.00.0  87.90.2  69.30.3  69.50.4  86.1 
Method  AP  AR  CR  PR  RP 

Source Only  50.0  58.0  46.2  60.4  59.5 
DDC  54.9  61.3  50.5  64.1  65.9 
DAN  57.0  67.9  60.4  67.7  74.3 
DANN  59.3  70.1  60.9  68.5  76.8 
CORAL  58.6  65.4  59.8  68.3  74.7 
JAN  61.2  68.9  61.0  70.3  76.8 
HoMM(p=3)  60.7  68.3  61.4  69.2  76.7 
HoMM(p=4)  63.5  70.2  64.6  72.6  79.3 
KHoMM(p=4)  63.9  70.5  65.3  73.3  79.8 
Full  64.7  71.8  66.1  74.5  81.2 
KHoMM+  64.2  70.1  65.5  73.2  80.1 
Implementation Details. In our experiments on digits recognition dataset, we utilize the modified LeNet whereby a bottleneck layer with hidden neurons is added before the output layer. Since the image size is different across different domains, we resize all the images to
and convert the RGB images to grayscale. For the experiments on Office31, we use ResNet50 pretrained on ImageNet as our backbone networks. And we add a bottleneck layer with 180 hidden nodes before the output layer for domain matching. It is worth noting that the
reluactivation function can not be applied to the adapted layer, as relu activation function will make most of the values in the highlevel tensor to be zero, which will make our HoMM fail. Therefore, we adopt tanh activation function in the adapted layer. Due to the small samples size of Office31 and OfficeHome datasets, we only update the weights of the fullconnected layers (fc) as well as the final block (scale5/block3), and fix other parameters pretrained on ImageNet. Follow the standard protocol of [26], we use all the labeled source domain samples and all the unlabeled target domain samples for training. All the comparison methods are based on the same CNN architecture for a fair comparison. For DDC, DAN, CORAL and CMD, we embed the official implementation code into our model and carefully select the tradeoff parameters to get the best results. When training with ADDA, our adversarial discriminator consists of 3 fully connected layers: two layers with 500 hidden units followed by the final discriminator output. For other compared methods, we report the results in the original paper directly.Parameters.
Our model is trained with Adam Optimizer based on Tensorflow. Regarding the optimal hyperparameters, they are determined by applying multiple experiments using grid search strategy. The optimal hyperparameters may be distinct across different transfer tasks. Specifically, the tradeoff parameters are selected from
, . For the digits recognition tasks, the hyperparameter is set to for thirdorder HoMM and set to for fourthorder HoMM ^{1}^{1}1Note that the tradeoff on the fourthorder HoMM is much larger than the thirdorder HoMM. This is because most deep features of the digits are very small, higherorder moments calculating the cumulative multiplication between different features become very close to zeros. Therefore, on digits dataset, the higher the order, the larger the tradeoff should be.. For the experiments on Office31 and OfficeHome, is set to for the thirdorder HoMM and set to for the fourthorder HoMM. Besides, the hyperparameter in RBF kernel is set to 1e4 across the experiments, the learning rate of the centers is set to for digits dataset and set to for Office31 and OfficeHome dataset. The threshold of the predicted probability is chosen from , and the best results are reported. The parameter sensitivity can be seen in Fig. 5.4.2 Experimental results
Digits Dataset For the experiments on digits recognition dataset, we set the batch size as 128 for each domain and set the learning rate as 1e4 throughout the experiments. Table 1 shows the adaptation performance on three typical transfer tasks based on the modified LeNet. As can be seen, our proposed HoMM yields notable improvement over the comparison methods on all of the transfer tasks. In particular, our method improves the adaption performance significantly in the hard transfer tasks SVHNMNIST. Without bells and whistles, the proposed thirdorder KHoMM achieve 97.2% accuracy, improving the secondorder moment matching (CORAL) by +8%. Besides, the results also indicate that the thirdorder HoMM outperforms the fourthorder HoMM and slightly underperforms the KHoMM.
Office31 Table 2 lists the test accuracies on Office31 dataset. We set the batchsize as 70 for each domain. The learning rate of the fc layer parameters is set as 3e4 and the learning rate of the conv layer (scale5/block3) parameters is set as 3e5. As we can see, the fourthorder HoMM outperforms the thirdorder HoMM and achieves the best results among all the momentmatching based methods. Besides, it is worth noting that the fourthorder HoMM outperforms the secondorder statistics matching (CORAL) by more than 10% on several representative transfer tasks AW, AD and DA, which demonstrates the merits of our proposed higherorder moment matching.
OfficeHome Table 3 gives the results on the challenged OfficeHome dataset. The parameter settings are the same as in Office31. We only evaluate our method on 5 out of 12 representative transfer tasks due to the space limitation. As we can see, on all the five transfer tasks, the HoMM outperforms the DAN, CORAL, DANN by a large margin and also outperforms the JAN by 3%5%. Note that the experimental results of the compared methods are reported from [41] directly.
The results in Table 1, Table 2 and Table 3
reveal several interesting observations: (1) All the domain adaptation methods outperform the source only model by a large margin, which demonstrates that minimizing the domain discrepancy contributes to learning more transferable representations. (2) Our proposed HoMM significantly outperforms the discrepancybased methods (DDC, CORAL, CMD), and the adversarial training based methods (DANN, ADDA and CyCADA), which reveals the advantages of matching the higherorder statistics for domain adaptation. (3) The JAN performs slightly better than the thirdorder HoMM on several transfer tasks, but it’s always not as good as the fourthorder HoMM in spite of aligning the joint distributions of multiple domainspecific layers across domains. The performance of our HoMM will be improved as well if we utilize such a strategy. (4) The kernelized HoMM (KHoMM) consistently outperforms the plain HoMM, but the improvement seems limited. We believe the reason is that, the higherorder statistics are originally the highdimensional features, which conceals the advantages of embedding the features into RKHS. (5) In all transfer tasks, the performance increases consistently by employing the discriminative clustering in target domain. In contrast, entropy regularization improves the transfer performance when the test accuracy is high, but it helps little or even downgrades the performance when the test accuracy is not that confident.
4.3 Analysis
Feature Visualization We utilize tSNE to visualize the deep features on the tasks SVHNMNIST by ResNet50, KMMD, CORAL, HoMM(p=3) and the Full Loss model. As shown in Fig. 4, the feature distributions of the source only model in (a) suggests that the domain shift between SVNH and MNIST is significant, which demonstrates the necessity of performing domain adaptation. Besides, the global distributions of the source and target samples are well aligned with the KMMD (b) and CORAL (c), but there are still many samples being misclassified. With our proposed HoMM, the source and target samples are aligned better and categories are discriminated better as well.
First/Secondorder versus Higherorder Since our proposed HoMM can perform arbitraryorder moment matching, we compare the performance of different order moment matching on three typical transfer tasks. As shown in table 4, the order is chosen from . The results show that the thirdorder and fourthorder moment matching significantly outperform the other order moment matching. When , the higher the order, the higher the accuracy. When , the accuracy will decrease as the order increases. Besides, the fifthorder moment matching also achieves very competitive results. Regarding why the fifthorder and above perform worse than the fourthorder, one reason we believe is that the fifthorder and above moments can’t be accurately estimated due to the small sample size problem [31].
order  1  2  3  4  5  6  10 

SNMT  71.9  89.5  96.5  95.7  94.8  91.5  58.6 
AW  74.4  79.3  87.6  89.8  86.6  85.3  80.2 
AP  54.9  58.6  60.7  63.5  60.9  58.2  57.3 
We denote SVHN and MNIST as SN and MT respectively.
Parameter Sensitivity and Convergence We conduct empirical parameter sensitivity on SVHNMNIST and AW in Fig. 5(a)(d). The evaluated parameters include two tradeoff parameters , , the number of selected values in Random Sampling Matching , and the threshold of the predicted probability. As we can see, our model is quite sensitive to the change of and the bellshaped curve illustrates the regularization effect of and . The convergence performance is provided in Fig. 5(e), which shows that our proposal converges fastest compared with other methods. It is worth noting that, the test error of the Full Loss model has a obvious mutation at the iteration where we enable the clustering loss , which also demonstrates the effectiveness of the proposed discriminative clustering loss.
5 Conclusion
Minimizing statistic distance between source and target distributions is an important line of work for domain adaptation. Unlike previous methods that utilize the secondorder or lower statistics for domain alignment, this paper exploits the higherorder statistics for domain alignment. Specifically, a higherorder moment matching method has been presented, which integrates the MMD and CORAL into a unified framework and generalizes the existing firstorder and secondorder moment matching to arbitraryorder moment matching. We experimentally demonstrate that the thirdorder and fourthorder moment matching significantly outperform the existing moment matching methods. Besides, we also extend the HoMM into RKHS and learn the discriminative clusters in the target domain, which further improves the adaptation performance. The proposed HoMM can be easily integrated into other domain adaptation model, and it is also expected to benefit the knowledge distillation and image style transfer.
References
 [1] Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Vaughan. A theory of learning from different domains. Machine learning, 79(12):151–175, 2010.

[2]
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze.
Deep clustering for unsupervised learning of visual features.
In Proceedings of the European Conference on Computer Vision (ECCV), pages 132–149, 2018. 
[3]
Chao Chen, Zhihong Chen, Boyuan Jiang, and Xinyu Jin.
Joint domain alignment and discriminative feature learning for
unsupervised deep domain adaptation.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 33, pages 3296–3303, 2019.  [4] Chao Chen, Zhihang Fu, Zhihong Chen, Zhaowei Cheng, Xinyu Jin, and XianSheng Hua. Towards selfsimilarity consistency and feature discrimination for unsupervised domain adaptation. arXiv preprint arXiv:1904.06490, 2019.
 [5] Zhihong Chen, Chao Chen, Zhaowei Cheng, Ke Fang, and Xinyu Jin. Selective transfer with reinforced transfer network for partial domain adaptation. arXiv preprint arXiv:1905.10756, 2019.
 [6] Zhihong Chen, Chao Chen, Xinyu Jin, Yifu Liu, and Zhaowei Cheng. Deep joint twostream wasserstein autoencoder and selective attention alignment for unsupervised domain adaptation. Neural Computing and Applications, pages 1–14.
 [7] Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374, 2017.

[8]
Yin Cui, Feng Zhou, Jiang Wang, Xiao Liu, Yuanqing Lin, and Serge Belongie.
Kernel pooling for convolutional neural networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 2921–2930, 2017.  [9] Lieven De Lathauwer, Josphine Castaing, and JeanFranois Cardoso. Fourthorder cumulantbased blind identification of underdetermined mixtures. IEEE Transactions on Signal Processing, 55(6):2965–2973, 2007.
 [10] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domainadversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
 [11] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
 [12] Mengran Gou, Octavia Camps, and Mario Sznaier. mom: Mean of moments feature for person reidentification. In Proceedings of the IEEE International Conference on Computer Vision, pages 1294–1303, 2017.
 [13] Yves Grandvalet and Yoshua Bengio. Semisupervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005.
 [14] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
 [15] Judy Hoffman, Eric Tzeng, Taesung Park, JunYan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycleconsistent adversarial domain adaptation. In International Conference on Machine Learning, pages 1994–2003, 2018.
 [16] Jacek Jakubowski, Krzystof Kwiatos, Augustyn Chwaleba, and Stanislaw Osowski. Higher order statistics and neural network for tremor recognition. IEEE Transactions on Biomedical Engineering, 49(2):152–159, 2002.
 [17] Yangqing Jia and Trevor Darrell. Heavytailed distances for gradient based image descriptors. In Advances in Neural Information Processing Systems, pages 397–405, 2011.
 [18] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4893–4902, 2019.
 [19] Piotr Koniusz, Fei Yan, PhilippeHenri Gosselin, and Krystian Mikolajczyk. Higherorder occurrence pooling for bagsofwords: Visual concept detection. IEEE transactions on pattern analysis and machine intelligence, 39(2):313–326, 2016.
 [20] ChenYu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10285–10295, 2019.
 [21] DongHyun Lee. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.
 [22] Peihua Li, Jiangtao Xie, Qilong Wang, and Wangmeng Zuo. Is secondorder information helpful for largescale visual recognition? In Proceedings of the IEEE International Conference on Computer Vision, pages 2070–2078, 2017.
 [23] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 2230–2236. AAAI Press, 2017.
 [24] TsungYu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for finegrained visual recognition. In Proceedings of the IEEE international conference on computer vision, pages 1449–1457, 2015.
 [25] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105, 2015.

[26]
Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan.
Deep transfer learning with joint adaptation networks.
In International Conference on Machine Learning, pages 2208–2217, 2017.  [27] Ali Mansour and Christian Jutten. Fourthorder criteria for blind sources separation. IEEE transactions on signal processing, 43(8):2022–2025, 1995.
 [28] Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. Minimalentropy correlation alignment for unsupervised deep domain adaptation. international conference on learning representations.
 [29] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4500–4509, 2018.
 [30] Edouard Pauwels and Jean B Lasserre. Sorting out typicality with the inverse moment matrix sos polynomial. In Advances in Neural Information Processing Systems, pages 190–198, 2016.
 [31] Sarunas J Raudys and Anil K. Jain. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Transactions on Pattern Analysis & Machine Intelligence, (3):252–264, 1991.
 [32] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tritraining for unsupervised domain adaptation. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2988–2997. JMLR. org, 2017.
 [33] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2018.
 [34] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A dirtt approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735, 2018.
 [35] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, volume 6, page 8, 2016.
 [36] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, pages 443–450. Springer, 2016.
 [37] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
 [38] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
 [39] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017.
 [40] Qilong Wang, Peihua Li, and Lei Zhang. G2denet: Global gaussian distribution embedding network and its application to visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2730–2739, 2017.
 [41] Ximei Wang, Liang Li, Weirui Ye, Mingsheng Long, and Jianmin Wang. Transferable attention for domain adaptation. In AAAI Conference on Artificial Intelligence (AAAI), 2019.

[42]
Junyuan Xie, Ross Girshick, and Ali Farhadi.
Unsupervised deep embedding for clustering analysis.
In International conference on machine learning, pages 478–487, 2016.  [43] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen. Learning semantic representations for unsupervised domain adaptation. In International Conference on Machine Learning, pages 5419–5428, 2018.
 [44] Jingtao Xu, Peng Ye, Qiaohong Li, Haiqing Du, Yong Liu, and David Doermann. Blind image quality assessment based on high order statistics aggregation. IEEE Transactions on Image Processing, 25(9):4444–4457, 2016.
 [45] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.
 [46] Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne SamingerPlatz. Central moment discrepancy (cmd) for domaininvariant representation learning. arXiv preprint arXiv:1702.08811, 2017.
 [47] Zhen Zhang, Mianzhi Wang, Yan Huang, and Arye Nehorai. Aligning infinitedimensional covariance matrices in reproducing kernel hilbert spaces for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3437–3445, 2018.
Comments
There are no comments yet.