In many real-world applications, data are generally collected under different conditions, thus hardly satisfying the identical probability distribution hypothesis which is known as a foundation of statistical learning theory. This situation naturally leads to a crucial issue that a classifier trained on a well-annotated source domain cannot be applied to a related but different target domain directly. To surmount this issue, as an important branch of transfer learning, considerable efforts have been devoted to domain adaptation. By far, domain adaptation has been a fundamental technology for cross-domain knowledge discovery, and been considered in various tasks, such as object recognition [2, 3]4, 5] and person re-identification .
The major issue for domain adaptation is how to reduce the difference in distributions between the source and target domains . Most of recent works aim to seek a common feature space where the distribution difference across domains are minimized [8, 9, 10, 11, 12]. To achieve this goal, various metrics have been proposed to measure the distribution discrepancy, among which the Maximum Mean Discrepancy (MMD) 
is probably the most widely used one. The typical procedure for MMD based methods includes three key steps in each iteration: 1) projecting the original source and target data to a common feature space; 2) training a standard supervised learning algorithm on projected source domain; 3) assigning pseudo-labels to target data with the source classifier. Generally, this procedure makes label prediction for target samples independently, while ignores the data distribution structure of two domains that can be crucial to the pseudo-label assignment of target data.
To illustrate this more explicitly, a toy example is shown in Fig. 1
. The red line is a discriminant hyperplane trained on source data in the projected feature space. As we can see, the hyperplane tends to misclassify the target data due to the distribution discrepancy between two domains. In such case, the misclassified samples will seriously mislead the learning of the common feature space in the subsequent iterations, and ultimately cause significant performance drop. Actually, from the perspective of sample distribution in two domains, the class centroids in target domain can be readily matched to their corresponding class centroids in source domain. Motivated by this insight, in this paper, instead of labeling target samples individually, we aim to introduce a novel approach that assigns pseudo-labels to target samples with the guidance of class centroids in two domains, such that the data distribution structure of both source and target domains can be emphasized.
To achieve this goal, the first key issue to be handled is how to determine the class centroids of target domain under the situation where the labels are absent. For this problem, we resort to the classical -means clustering algorithm  which has been widely used to partition unlabeled data into several groups where similar samples in the same group can be represented by a specific cluster prototype. Intuitively, the cluster prototypes obtained by -means algorithm can be regarded as a good approximation for the class centroids of target domain. After obtaining the cluster prototypes of target data, the distribution discrepancy minimization problem in domain adaptation can be reformulated as the class centroid matching problem which can be solved efficiently by the nearest neighbor search.
Clearly, in the process of cluster prototype learning of target data, the quality of cluster prototypes can be vital to the performance of our approach. Actually, it has been shown that the clustering performance can be significantly enhanced if the local manifold structure is exploited [15, 16]. Nevertheless, most of existing manifold learning methods highly depend on the predefined similarity matrix built in the original feature space [17, 18]
, and thus may fail to capture the inherent local structure of high-dimensional data due to the curse of dimensionality. To tackle this problem, inspired by the recently proposed adaptive neighbors learning method, we introduce a local structure self-learning strategy into our proposal. Specifically, we learn the data similarity matrix according to the local connectivity in the projected low-dimensional feature space rather than the original high-dimensional space, such that the intrinsic local manifold structure of target data can be captured adaptively.
Based on above analysis, a novel domain adaptation method, which can adequately exploit the data distribution structure by jointly class Centroid Matching and local Manifold Self-learning (CMMS), is naturally proposed. It is noteworthy that, more recently, the need for tackling semi-supervised domain adaptation (SDA) problem is growing as there may be some labeled target samples in practice [20, 21, 22, 23, 24]. While unsupervised domain adaptation (UDA) methods are well established, most of them cannot be naturally applied to the semi-supervised scenario. Excitingly, the proposed CMMS can be extended to SDA including both homogeneous and heterogeneous settings in a direct but elegant way. The flowchart of our proposed CMMS is shown in Fig. 2. The main contributions of this paper are summarized as follows:
We propose a novel domain adaptation method called CMMS, which can thoroughly explore the structure information of data distribution via jointly class centroid matching and local manifold self-learning.
We present an efficient optimization algorithm to solve the objective function of the proposal, with theoretical convergence guarantee.
In addition to unsupervised domain adaptation, we further extend our approach to the semi-supervised scenario including both homogeneous and heterogeneous settings.
We conduct extensive evaluation of our method on five benchmark datasets, which validates the superior performance of our method in both unsupervised and semi-supervised manners.
The rest of this paper is organized as follows. Section II previews some related literature. Section III shows our proposed method, the optimization algorithm, the convergence and complexity analysis. We describe our semi-supervised extension in Section IV. Massive experimental results are shown in Section V. Finally, we conclude this paper in Section VI.
Ii Related Work
In this section, we review some previous works closely related to this paper. First, we briefly review the unsupervised domain adaptation methods. Next, related studies of semi-supervised domain adaptation are reviewed. Finally, we introduce some local manifold learning techniques.
Ii-a Unsupervised Domain Adaptation
Unsupervised domain adaptation aims to handle the scenario where labeled samples are only available from the source domain and there exists different distributions between source and target domains. In the past decades, numerous of methods have been proposed to overcome the distribution discrepancy.
Existing UDA methods can be classified as: 1) instance reweighting [25, 26], 2) classifier adaptation [7, 27], and 3) feature adaptation [8, 9, 10, 11]. We refer the interested readers to , which contains an excellent survey. Our proposal falls into the third category, i.e., feature adaptation, which addresses domain shift by either searching intermediate subspaces to achieve domain transfer [10, 11] or learning a common feature space where the source and target domains have similar distributions [8, 9]. In this paper, we focus on the latter line. Among existing works, TCA  is a pioneering approach, which learns a transformation matrix to align marginal distribution between two domains via MMD. Later, JDA  considers conditional distribution alignment by forcing the class means to be close to each other. In the subsequent research, several works further propose to employ the discriminative information to facilitate classification performance. For instance, Li et al.  utilize the discriminative information for the source and target domains by encouraging intra-class compactness and inter-class dispersion. Liang et al.  achieve this goal by promoting class clustering. Despite the promising performance, all the above methods classify target samples independently, which may cause misclassification since the structure information of data distribution is ignored.
To tackle this issue, several recent works attempt to exploit the data distribution structure via clustering. For example, Liang et al. 
propose to seek a subspace where the target centroids are forced to approach those in the source domain. Inspired by the fact that the target samples are well clustered in the deep feature space, Wanget al.  propose a selective pseudo-labeling approach based on structured prediction. Note that, the basic framework of our proposal is completely different from them. More recently proposed SPL  is the most relevant to our proposal. Nevertheless, our proposal is significantly different from it. First, the subspace learning and the clustering structure discovery are regarded as two separated steps in SPL, and thus the projection matrix may not be the optimal one for clustering. Second, the local manifold structure information is ignored by SPL, which is crucial to the exploration of target data structure. By contrast, we integrate the projection matrix learning, the -means clustering in the projected space, the class centroid matching and the local manifold structure self-learning for target data into a unified optimization objective, thus the data distribution structure can be exploited more thoroughly.
Ii-B Semi-supervised Domain Adaptation
Unlike the unsupervised domain adaptation that no labels are available, in practice, a more common scenario is that the target domain contains a few labeled samples. Such scenario leads to a promising research direction, which is referred to the semi-supervised domain adaptation.
According to the property of sample features, SDA algorithms are developed in two different settings: 1) homogeneous setting, i.e., the source and target data are sampled from the same feature space; 2) heterogeneous setting, i.e., the source and target data often have different feature dimensions. In the homogeneous setting, the labeled target samples are used in various ways. For example, Hoffman et al.  jointly learn the transformation matrix and classifier parameters, forcing the source and target samples with identical label have high similarity. Similarly, Herath et al.  propose to learn the structure of a Hilbert space to reduce the dissimilarity between labeled samples and further match the source and target domains via the second order statistics. Recently, based on Fredholm integral, Wang et al. 
propose to learn a cross-domain kernel classifier that can classify the labeled target data correctly using square loss function or hinge loss function. In the heterogeneous setting, relieving feature discrepancy and reducing distribution divergence are two inevitable issues. For the first issue, one incredibly simple approach  is to use the original features or zeros to augment each transformed sample into same size. Another natural approach  is to learn two projection matrices to derive a domain-invariant feature subspace, one for each domain. Recently, after employing two matrices to project the source and target data to a common feature space, Li et al.  employ a shared codebook to match the new feature representations on the same bases. For the second issue, one favorite solution is to minimize the MMD distance of the source and target domains [24, 33]. Additionally, Tsai et al. 
propose a representative landmark selection approach, which is similar to instance reweighting in the UDA scenario. When we obtain a limited amount of labeled target samples, manifold regularization, an effective strategy for semi-supervised learning, has also been employed by several previous works[35, 36].
In contrast to these SDA methods, our semi-supervised extension is quite simple and intuitive. To be specific, the labeled target data are used to improve the cluster prototypes learning of the unlabeled target data. Besides, connections between the labeled and unlabeled target data are built, which is a common strategy to develop a semi-supervised model. Different from the homogeneous setting where a unified projection is learnt, we learn two projection matrices in the heterogeneous setting like . Notably, the resulting optimization problems in two settings own the same standard formula and can be solved by the same algorithm in UDA scenario with just very tiny modifications.
Ii-C Local Manifold Learning
The goal of local manifold learning is to capture the underlying local manifold structure of the given data in the original high-dimensional space and preserve it in the low-dimensional embedding. Generally, local manifold learning methods contain three main steps: 1) selecting neighbors; 2) computing affinity matrix; 3) calculating the low-dimensional embedding.
Local linear embedding  and Laplacian eigenmaps  are two typical methods. In local linear embedding, the local manifold structure is captured by linearly reconstructing each sample using the corresponding neighbors in the original space and the reconstruction coefficients are preserved in the low-dimensional space. In Laplacian eigenmaps, the adjacency matrix of given data is obtained in the original feature space using Gaussian function. However, the local manifold structure is artificially captured using pairwise distances with heat kernel, which brings relatively weak representation for the ignorance of the properties of local neighbors . Recently, to learn a more reliable adjacency matrix, Nie et al.  propose to assign the neighbors of each sample adaptively based on the Euclidean distances in the low-dimensional space. This strategy has been widely utilized in clustering 39] and feature representation learning .
In domain adaptation problems, several works have borrowed the advantages of local manifold learning. For example, Long et al.  and Wang et al.  employ manifold regularization to maintain the manifold consistency underlying the marginal distributions of two domains. Hou et al.  and Li et al.  use label propagation to predict target labels. However, they all calculate adjacency matrix in the original high-dimensional space with the predefined distance measurement, which is unreliable due to the curse of dimensionality. By contrast, our proposal can capture and employ the inherent local manifold structure of target data adaptively, thus lead to superior performance.
Iii Proposed Method
In this section, we first introduce the notations and basic concepts used throughout this paper. Then, the details of our approach are described. Next, an efficient algorithm is designed to solve the optimization problem of our proposal. Finally, the convergence and complexity analysis of the optimization algorithm are given.
A domain contains a feature space and a marginal probability distribution , where . For a specific domain, a task consists of a label space and a labeling function , denoted by . For simplicity, we use subscripts and to describe the source domain and target domain, respectively.
We denote the source domain data as , where is a source sample and is the corresponding label. Similarly, we denote the target domain data as , where . For clarity, we show the key notations used in this paper and the corresponding descriptions in Table I.
|source/target original data|
|number of source/target samples|
|target cluster centroids|
|target label matrix|
|target adjacency matrix|
identity matrix with dimension
|dimension of original features|
|dimension of projected features|
|number of shared class|
|number of source samples in class|
|a matrix of size with all elements as|
a column vector of sizewith all elements as
Iii-B Problem Formulation
The core idea of our CMMS lies in the emphasis on data distribution structure by class centroid matching of two domains and local manifold structure self-learning for target data. The overall framework of CMMS can be stated by the following formula:
The first term is used to match class centroids. is the clustering term for target data in the projected space. is employed to capture the data structure information. is the regularization term to avoid overfitting. Hyper-parameters , and are employed to balance the influence of different terms. Next, we will introduce these items in detail.
Iii-B1 Clustering for target data
In our CMMS, we borrow the idea of clustering to obtain the cluster prototypes which can be regarded as the pseudo class centroids. In such case, the sample distribution structure information of target data can be acquired. To achieve this goal, various existing clustering algorithms can be our candidates. Without loss of generality, for the sake of simplicity, we adopt the classical -means algorithm to get the cluster prototypes in this paper. Thus, we have the following formula:
where is the projection matrix, is the cluster centroids of target data, is the cluster indicator matrix of target data which is defined as if the cluster label of is , and otherwise.
Iii-B2 Class Centroid Matching of Two Domains
Once the cluster prototypes of target data are obtained, we can reformulate the distribution discrepancy minimization problem in domain adaptation as the class centroid matching problem. Note that the class centroids of source data can be obtained exactly by calculating the mean value of sample features in the identical class. In this paper, we solve the class centroid matching problem by the nearest neighbor search since it is simple and efficient. Specifically, we search the nearest source class centroid for each target cluster centroid, and minimize the sum of distance of each pair of class centroids. Finally, the class centroid matching of two domains is formulated as:
where is a constant matrix used to calculate the class centroids of source data in the projected space with each element if , and otherwise.
Iii-B3 Local Manifold Structure Self-learning for Target Data
In our proposed CMMS, the cluster prototypes of target samples are actually the approximation of their corresponding class centroids. Hence, the quality of cluster prototypes plays an important role in the final performance of our CMMS. Existing works have proven that the performance of clustering can be significantly improved by the exploiting of local manifold structure. Nevertheless, most of them highly depend on the predefined adjacent matrix in the original feature space, and thus fail to capture the inherent local manifold structure of high-dimensional data due to the curse of dimensionality. For this issue, inspired by the recent work , we propose to introduce a local manifold self-learning strategy into our CMMS. Instead of predefining the adjacent matrix in the original high-dimensional space, we adaptively learn the data similarity according to the local connectivity in the projected low-dimensional space, such that the intrinsic local manifold structure of target data can be captured. The formula of local manifold self-learning is shown as follows:
where is the adjacency matrix in target domain and is a hyper-parameter. is the corresponding graph laplacian matrix calculated by , where is a diagonal matrix with each element .
The above descriptions have highlighted the main components of our CMMS. Intuitively, a reasonable hypothesis for source data is that the samples in the identical class should be as close as possible in the projected space, such that the discriminative structure information of source domain can be preserved. As one trivial but effective trick, inspired by , we formulate this thought as follows:
where is the trace operator and
The coefficient is used to remove the effects of different class sizes .
Besides, to avoid overfitting and improve the generalization capacity, we further add an -norm regularization term to the projection matrix :
where is an identity matrix of dimension and is centering matrix defined as . The first constraint in (89]. For the sake of simplified format, we reformulate the objective function in (8) as the following standard formula:
where , , , and .
Iii-C Optimization Procedure
According to the objective function of our CMMS in Eq.(9), there are four variables , , , that need to be optimized. Since it is not jointly convex for all variables, we update each of them alternatively while keeping the other variables fixed. Specifically, each subproblem is solved as follows:
1. -subproblem: When , and are fixed, the optimization problem (9) becomes:
By setting the derivative of (10) with respect to as 0, we obtain:
. The above problem can be transformed to a generalized eigenvalue problem as follows:
is a diagonal matrix with each element as a Lagrange Multiplier. Then the optimal solution is obtained by calculating the eigenvectors of Eq.(13) corresponding to the -smallest eigenvalues.
3. -subproblem: In variable , only needs to be updated. With , and fixed, the optimization problem with regard to is equal to minimizing Eq.(2). Like -means clustering, we can solve it by assigning the label of each target sample to its nearest cluster centroid. To this end, we have:
4. -subproblem: When , and are fixed, the optimization problem with regard to is equal to minimizing Eq.(5). Actually, we can divide it into independent subproblems with each formulated as:
where is the -th row of . By defining , the above problem can be written as:
The corresponding Lagrangian function is:
where and are the Lagrangian multipliers.
To explore the data locality and reduce computation time, we prefer to learn a sparse , i.e., only the -nearest neighbors of each sample are preserved to be locally connected. Based on the KKT condition, Eq.(17) has a closed-form solution:
where is the element of matrix , obtained by sorting the entries for each row of in an ascending order. According to , we define and set the value of parameter as:
Similar to , we also define as the element of matrix which is obtained by sorting the entries for each row of from small to large.
Iii-D Convergence and Complexity Analysis
Iii-D1 Convergence Analysis
We can prove the convergence of the proposed Algorithm 1 via the following proposition:
Assume that at the -th iteration, we get , , , . We denote the value of the objective function in (9) at the -th iteration as . In our Algorithm 1, we divide problem (9) into four subproblems (10), (12), (14) and (15), and each of them is a convex problem with respect to their corresponding variables. By solving the subproblems alternatively, our proposed algorithm can ensure finding the optimal solution of each subproblem, i.e., , , , . Therefore, as the combination of four subproblems, the objective function value of (9) in the -th iteration satisfies:
In light of this, the proof is completed and the algorithm will converge to local solution at least. ∎
Iii-D2 Complexity Analysis
The optimization Algorithm 1 of our CMMS comprises four subproblems. The complexity of these four subproblems are induced as follows: First, the cost of initializing is and we ignore the time to initialize since the base classifier is very fast. Then, in each iteration, the complexity to construct and solve the generalized eigenvalue problem (12) for is . The target cluster centroids can be obtained with a time cost of . The complexity of updating the target labels matrix is . The adjacency matrix is updated with the cost of . Generally, we have . Therefore, the overall computational complexity is , where is the number of iteration.
Iv Semi-supervised Extension
In this section, we further extend our CMMS to semi-supervised domain adaptation including both homogeneous and heterogeneous settings.
Iv-1 Homogeneous Setting
We denote the target data as where is the labeled data with the corresponding labels denoted by and is the unlabeled data. In the SDA scenario, except for the class centroids of source data, the few but precise labeled target data can provide additional valuable reference for determining the cluster centroids of unlabeled data. In this paper, we provide a simple but effective strategy to adaptively combine these two kinds of information. Specifically, our proposed semi-supervised extension is formulated as:
where , . In addition to the balanced factors and , the other variables in Eq.(22) can be readily solved with our Algorithm 1. Since the objective function is convex with respect to and , they can be solved easily with the closed-form solution: , , where ,
Iv-2 Heterogeneous Setting
In the heterogeneous setting, the source and target data usually own different feature dimensions. Our proposed Eq.(21) can be naturally extended to the heterogeneous manner, only by replacing the projection matrix with two separate ones :
where is the new feature representations of two domains with the same dimension. By defining , , Eq.(23) can be transformed to the standard formula as Eq.(22), and thus can be solved with the same algorithm.
|Dataset||Subsets (Abbr.)||Samples||Feature (Dim)||Classes|
In this section, we first describe all involved datasets. Next, the details of experimental setup including comparison methods in UDA and SDA scenarios, training protocol and parameter setting are given. Then, the experimental results in UDA scenario, ablation study, parameter sensitivity and convergence analysis are presented. Finally, we show the results in SDA scenario. The source code of this paper is available at https://github.com/LeiTian-qj/CMMS/tree/master.
V-a Datasets and Descriptions
We apply our method to five benchmark datasets which are widely used in domain adaptation. These datasets are represented with different kinds of features including Alexnet-FC, SURF, DeCAF, pixel, Resnet50 and BoW. Table II shows the overall descriptions of these datasets. We will introduce them in detail as follows.
Office31  contains 4,110 images of office objects in 31 categories from three domains: Amazon (A), DSLR (D) and Webcam (W). Amazon images are downloaded from the online merchants. The images from DSLR domain are captured by a digital SLR camera while those from Webcam domain by a web camera. We adopt the AlexNet-FC features222 https://github.com/VisionLearningGroup/CORAL/tree/master/dataset fine-tuned on source domain. Following , we have 6 cross-domain tasks, i.e., ”AD”,”AW”, …, ”WD”.
Office-Caltech10  includes 2,533 images of objects in 10 shared classes between Office31 dataset and the Caltech256 (C) dataset. The Caltech256 dataset is a widely used benchmark for object recognition. We use the 800-dim SURF features333http://boqinggong.info/assets/GFK.zip and 4,096-dim DeCAF features444 https://github.com/jindongwang/transferlearning/blob/master/data/ . Following , we construct 12 cross-domain tasks, i.e., ”AC”, ”AD”, …, ”WD”.
MSRC-VOC2007  consists of two subsets: MSRC (M) and VOC2007 (V). It is constructed by selecting 1,269 images in MSRC and 1,530 images in VOC2007 which share 6 semantic categories: aeroplane, bicycle, bird, car, cow, sheep. We utilize the 256-dim pixel features555 http://ise.thss.tsinghua.edu.cn/~mlong/. Finally, we establish 2 tasks, ”MV” and ”VM”.
Office-Home  involves 15,585 images of daily objects in 65 shared classes from four domains: Art (artistic depictions of objects, Ar), Clipart (collection of clipart images, Cl), Product (images of objects without background, Pr) and Real-World (images captured with a regular camera, Re). We use the 4,096-dim Resnet50 features666 https://github.com/hellowangqian/domainadaptation-capls released by . Similarly, we obtain 12 tasks, i.e., ”ArCl”, ”ArPr”, …, ”RePr”.
Multilingual Reuters Collection  is a cross-lingual text dataset with about 11,000 articles from six common classes in five languages: English, French, German, Italian, and Spanish. All articles are sampled by BoW features777 http://archive.ics.uci.edu/ml/datasets/Reuters+RCV1+RCV2+Multilingual,+Multiview+Text+Categorization+Test+collection with TF-IDF. Then, they are processed by PCA for dimension reduction and the reduced dimensionality for English, French, German, Italian and Spanish are 1,131, 1,230, 1,417, 1,041 and 807, respectively. We pick the Spanish as the target and each of the rest as the source by turns. Eventually, we gain four tasks.
V-B Experimental Setup
V-B1 Comparison Methods in UDA Scenario
V-B2 Comparison Methods in SDA Scenario
V-B3 Training Protocol
For UDA scenario, all source samples are utilized for training like . We exploit -score standardization  on all kinds of features. For SDA scenario, in homogeneous setting, we use the Office-Caltech10 and MSRC-VOC2007 datasets following the same protocol with . Specifically, for the Office-Caltech10 dataset, we randomly choose 20 samples per category for amazon domain while 8 for the others as the sources. Three labeled target samples per class are selected for training while the rest for testing. For fairness, we use the train/test splits released by . For the MSRC-VOC2007 dataset, all source samples are utilized for training, and 2 or 4 labeled target samples per category are randomly selected for training leaving the remaining to be recognized. In heterogeneous setting, we employ the Office-Caltech10 and Multilingual Reuters Collection datasets using the experiment setting of . For the Office-Caltech10 dataset, the SURF and DeCAF features are served as the source and target. The source domain contains 20 instances per class, and 3 labeled target instances per category are selected for training with the rest for testing. For the Multilingual Reuters Collection dataset, Spanish is chose as the target and the remaining as the source by turns. 100 articles per category are randomly selected to build the source domain, and 10 labeled target articles per category are selected for training with 500 articles per class from the rest to be classified.
V-B4 Parameter Setting
In both UDA and SDA scenarios, we do not have massive labeled target samples, so we can not perform a standard cross-validation procedure to obtain the optimal parameters. For a fair comparison, we cite the results from the original papers or run the code provided by the authors. Following , we grid-search the hyper-parameter space and report the best results. For GFK, JDA, DICD, JGSA, DICE and MEDA, the optimal reduced dimension is searched in . The best value of regularization parameter for projection is searched in the range of . For the two recent methods, SPL and MSC, we adopt the default parameters used in their public codes or follow the procedures for tuning parameters according to the corresponding original papers. For our method, we fix , and leaving , tunable. We obtain the optimal parameters by searching , .
V-C Unsupervised Domain Adaptation
V-C1 Experimental results on UDA
Results on Office31 dataset. Table III summarizes the classification results on the Office31 dataset, where the highest accuracy of each cross-domain task is boldfaced. We can observe that CMMS has the best average performance, with a 1.1 improvement over the optimal competitor MSC. CMMS achieves the highest results on 2 out of 6 tasks, while MSC only works the best for task AW with just 0.4 higher than CMMS. Generally, SPL, MSC and CMMS perform better than those methods that classify target samples independently, which demonstrates that exploring the structure information of data distribution can facilitate classification performance. However, compared with SPL and MSC, CMMS further mines and exploits the inherent local manifold structure of target data to promote cluster prototypes learning, thus can lead to a better performance.
Results on Office-Caltech10 dataset. The results on Office-Caltech10 dataset with SURF features are listed in Table IV. Regarding the average accuracy, CMMS shows a large advantage which improves 1.7 over the second best method MEDA. CMMS is the best method on 4 out of 12 tasks, while MEDA only wins two tasks. On CA, DA and CW, CMMS leads MEDA by over 4.5 margin. Following , we also employ the DeCAF features, and the classification results are shown in Table V. CMMS is superior to all competitors with regard to the average accuracy and works the best or second best for all tasks except for CW. Carefully comparing the results of SURF features and DeCAF features, we can find that SPL and MSC prefer to deep features. Nevertheless, CMMS does not have such a preference, which illustrates that CMMS owns better generalization capacity.
Results on MSRC-VOC2007 dataset. The experimental results on the MSRC-VOC2007 dataset are reported in Table VI. The average classification accuracy of CMMS is 55.4, which is significant higher than those of all competitors. Especially, on task VM, CMMS gains a huge performance improvement of 15.3 compared with the second best method SPL, which verifies the significant effectiveness of our proposal.
Results on Office-Home dataset. For fairness, we employ the deep features recently released by 
, which are extracted using the Resnet50 model pre-trained on ImageNet. TableVII summarizes the classification accuracies. CMMS outperforms the second best method SPL in average performance, and achieves the best performance on 10 out of all 12 tasks while SPL only works the best for task RePr with just 0.8 superiority to CMMS. This phenomenon shows that even the target samples are well clustered within the deep feature space, exploiting the inherent local manifold structure is still crucial to the improvement of the classification performance.
V-C2 Ablation Study
To understand our CMMS more deeply, we propose four variants of CMMS: a) CMMS, only considers class Centroid Matching for two domains, i.e., the combination of Eq.(2), Eq.(3) and Eq.(7); b) CMMS, does not utilize the local manifold structure of target data, i.e., removing Eq.(4) from our objective function Eq.(8); c) CMMS, considers local manifold structure of target data by Predefining Adjacency matrix in the original feature space, i.e, replacing Eq.(4) with the Laplacian regularization; d) CMMS, exploits the Discriminative Structure information of target domain via assigning pseudo-labels to target data and then minimizing the intra-class scatter in the projected space like Eq.(5). Table VIII shows the results of CMMS and all variants. The results of classical JDA are also provided. Based on this table, we will analyze our approach in more detail as follows.
Effectiveness of class centroid matching. CMMS consistently precedes JDA on five datasets, which confirms the remarkable superiority of our proposal to the MMD based pioneering approach. By the class centroid matching strategy, we can make full use of the structure information of data distribution, thus target samples are supposed to present favourable cluster distribution. To have a clear illustration, in Fig. 3, we display the -SNE  visualization of the target features in the projected space on task VM of MSRC-VOC2007 dataset. We can observe that JDA features are mixed together while CMMS features are well-separated with cluster structure, which verifies the significant effectiveness of our class centroid matching strategy.
Effectiveness of local manifold self-learning strategy for target data. CMMS performs better than CMMS on all datasets, which indicates that exploiting the local manifold structure of target samples help to classify them more successfully, even though the manifold structure is not so reliable. However, if we can capture it more faithfully, we can achieve a superior performance, which is verified by comparing CMMS with CMMS. For a better understanding, we show the visualization of target adjacency matrix on task AD (SURF) in Fig. 4. These matrices are obtained by either the self-learned distance or the predefined distances which include Euclidean distance, heatkernel distance with kernel width 1.0 and cosine distance. As we can see from Fig. 4, all predefined distances tend to incorrectly connect unrelated samples, and hardly capture the inherent local manifold structure of target data. However, the self-learned distance can adaptively build the connections between intrinsic similar samples, thus can improve the classification performance. Generally, CMMS performs much worse than CMMS and even worse than CMMS which verifies that utilizing the discriminative information of target domain via assigning pseudo-labels to target samples independently is far from enough to achieve satisfactory results. The reason is that the pseudo-labels may be inaccurate and could cause error accumulation during learning, and thus the performance is degraded dramatically. In summary, our local manifold self-learning strategy can effectively enhance the utilization of structure information contained in target data.
|n = 2||MV||28.9||38.5||35.1||36.8||30.2||34.2||38.2||36.6||35.8|
|n = 4||MV||30.2||39.0||36.0||36.9||31.7||35.4||38.4||36.8||36.2|
V-C3 Parameters Sensitivity and Convergence Analysis
In our CMMS, there are two tunable parameters: and . We have conducted extensive parameter sensitivity analysis on all datasets with a wide range. We vary one parameter once and fix the others as the optimal values. The results of CW (SURF), VM, AD (Alexnet) and Ar Pr are reported in Fig. 5 (a) (b). Meanwhile, to demonstrate the effectiveness of our CMMS, we also display the results of the best competitor as the dashed lines.
First, we run our CMMS as varies from 0.001 to 10.0. From Fig. 5 (a), it is observed that when the value of is much small, it may not be contributed to the improvement of performance. Whereas, with the appropriate increase of , the clustering process of target data is emphasized, and thus our CMMS can exploit the cluster structure information more effectively. We find that, when is located within a wide range , our proposal can achieve consistently optimal performance. Then, we evaluate the influence of parameter on our CMMS by varying the value from 0.001 to 10.0. It is infeasible to determine the optimal value of , since it highly depends on the domain prior knowledge of the datasets. However, we empirically find that, when is located within the range , our CMMS can obtain better classification results than the most competitive competitor. We also display the convergence analysis in Fig. 5 (c). We can see that CMMS can quickly converge within 10 iterations.
V-D Semi-supervised Domain Adaptation
V-D1 Results in Homogeneous Setting
The averaged classification results of all methods on the Office-Caltech10 dataset over 20 random splits are shown in Table IX. We can observe that regarding the total average accuracy, our CMMS can obtain improvement compared with the second best method TFMKL-S. The results on the MSRC-VOC2007 dataset over 5 random splits are shown in Table X. Some results are cited from . Compared with the most comparative competitors, CMMS can achieve and improvement when the number of labeled target samples in per class is set to 2 and 4, respectively.
|Task||SVM||MMDT||SHFA||CDLS||Li et al. ||CMMS|
|Source||SVM||MMDT||SHFA||CDLS||Li et al. ||CMMS|
V-D2 Results in Heterogeneous Setting
The results on the Office-Caltech10 and Multilingual Reuters Collection datasets are listed in Table XI and Table XII. Some results are cited from . We can observe that our CMMS achieves the optimal performance in terms of the average accuracy on both datasets. Specifically, compared with the best competitors, and improvements are obtained. Especially, CMMS works the best for 8 out of all 10 tasks on two datasets, which adequately confirms the excellent generalization capacity of our CMMS in the heterogeneous setting.
Vi Conclusions and Future Work
In this paper, a novel domain adaptation method named CMMS is proposed. Unlike most of existing methods that generally assign pseudo-labels to target data independently, CMMS makes label prediction for target samples by the class centroid matching of source and target domains, such that the data distribution structure of two domains can be exploited. To explore the structure information of target data more thoroughly, a local manifold self-learning strategy is further introduced into CMMS, which can capture the inherent local manifold structure of target data by adaptively learning the data similarity in the projected space. The CMMS optimization problem is not convex with all variables, and thus an iterative optimization algorithm is designed to solve it, whose computational complexity and convergence are carefully analyzed. We further extend CMMS to the semi-supervised scenario including both homogeneous and heterogeneous settings which are appealing and promising. Extensive experimental results on five datasets reveal that CMMS significantly outperforms the baselines and several state-of-the-art methods in both unsupervised and semi-supervised scenarios.
Future research will include the following: 1) Considering the computational bottleneck of CMMS optimization, we will design more efficient algorithm for the local manifold self-learning; 2) Except for the class centroids, we can introduce additional measure to represent the structure of data distribution, such as the covariance; 3) In this paper, we extend CMMS to the semi-supervised scenario in a direct but effective way. In the future, more elaborate design of semi-supervised methods is worth further exploration.
The authors are thankful for the financial support by the National Natural Science Foundation of China (61432008, 61472423, U1636220 and 61772524).
-  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345-1359, Oct. 2010.
-  Z. Guo and Z. Wang, “Cross-domain object recognition via input-output kernel analysis,” IEEE Trans. Image Process., vol. 22, no. 8, pp. 3108-3119, Aug. 2013.
-  A. Rozantsev, M. Salzmann, and P. Fua, “Beyond sharing weights for deep domain adaptation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 4, pp. 801-814, Apr. 2019.
-  C.-X. Ren, D.-Q. Dai, K.-K. Huang, and Z.-R. Lai, “Transfer learning of structured representation for face recognition,” IEEE Trans. Image Process., vol. 23, no. 12, pp. 5440-5454, Dec. 2014.
-  Q. Qiu and R. Chellappa, “Compositional dictionaries for domain adaptive face recognition,” IEEE Trans. Image Process., vol. 24, no. 12, pp. 5152-5165, Dec. 2015.
-  A. J. Ma, J. Li, P. C. Yuen, and P. Li, “Cross-domain person reidentification using domain adaptation ranking SVMs,” IEEE Trans. Image Process., vol. 24, no. 5, pp. 1599-1613, May 2015.
-  M. Long, J. Wang, G. Ding, S. J. Pan, and P. S. Yu, “Adaptation regularization: A general framework for transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 5, pp. 1076-1089, May 2014.
-  S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Trans. Neural Netw., vol. 22, no. 2, pp. 199-210, Feb. 2011.
M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Dec. 2013, pp. 2200-2207.
-  B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 2066-2073.
-  B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in Proc. Int. Conf. Comput. Vis., Aus. 2013, pp. 2960-2967.
-  B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Proc. Amer. Assoc. Artif. Intell. Conf., 2016, pp. 2058-2065.
-  A. Gretton, K. M. Borgwardt, M. Rasch, B. Scholkopf, and A. J. Smola, “A kernel method for the two-sample-problem,” in Proc. Adv. in Neural Inf. Process. Syst., 2007, pp. 513-520.
-  J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Math. Statist. Probab., 1967, pp. 281-297.
-  A. Goh and R. Vidal, “Segmenting motions of different types by unsupervised manifold clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2007, pp. 1-6.
-  A. Goh and R. Vidal, “Clustering and dimensionality reduction on Riemannian manifolds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1-7.
L. K. Saul and S. T. Roweis, “Think globally, fit locally: unsupervised learning of low dimensional manifolds,”J. Mach. Learn. Res., vol. 4, pp. 119–155, Dec. 2003.
-  M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Proc. Adv. Neural Inf. Process. Syst., Dec. 2001, pp. 585-591.
-  F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering with adaptive neighbors,” in Proc. 20th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2014, pp. 977-986.
-  J. Hoffman, E. Rodner, J. Donahue, B. Kulis, and K. Saenko, “Asymmetric and category invariant feature transformations for domain adaptation,” Int. J. Comput. Vis., vol. 41, nos. 1-2, pp. 28-41, 2014.
-  S. Herath, M. Harandi, and F. Porikli, “Learning an invariant hilbert space for domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 3845-3854.
-  W. Wang, H. Wang, Z. X. Zhang, C. Zhang, and Y. Gao, “Semi-supervised domain adaptation via Fredholm integral based kernel methods,” Pattern Recognit., vol. 85, pp. 185-197, Jan. 2019.
-  W. Li, L. Duan, D. Xu, and I. W. Tsang, “Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 6, pp. 1134-1148, Jun. 2014.
-  Y.-T. Hsieh, S.-Y. Tao, Y.-H. H. Tsai, Y.-R. Yeh, and Y.-C. F. Wang, “Recognizing heterogeneous cross-domain data via generalized joint distribution adaptation,” in Proc. IEEE Int. Conf. Multimedia Expo., Jul. 2016, pp. 1-6.
-  M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer joint matching for unsupervised domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 1410-1417.
-  S. Chen, F. Zhou, and Q. Liao, “Visual domain adaptation using weighted subspace alignment,” in Proc. SPIE Int. Conf. Vis. Commun. Image Process., Nov. 2016, pp. 1-4.
-  J. Wang, W. Feng, Y. Chen, H. Yu, M. Huang, and P. S. Yu, “Visual domain adaptation with manifold embedded distribution alignment,” in ACM Multimedia Conference on Multimedia 2018 ACM Multimedia Conf. Multimedia Conf., May 2018, pp. 402-410.
-  L. Zhang. (2019). “Transfer adaptation learning: a decade survey.” [Online]. Available: https://arxiv.xilesou.top/abs/1903.04687
-  S. Li, S. Song, G. Huang, and Z. Ding, “Domain invariant and class discriminative feature learning for visual domain adaptation,” IEEE Trans. Image Process., vol. 27, no. 9, pp. 4260-4273, Sept. 2018.
-  J. Liang, R. He, and T. Tan, “Aggregating randomized clustering-promoting invariant projections for domain adaptation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 5, pp. 1027-1042, May 2019.
-  J. Liang, R. He, Z. Sun and T. Tan, “Distant Supervised Centroid Shift: A Simple and Efficient Approach to Visual Domain Adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 2975-2984.
-  Q. Wang and T. P. Breckon. (2019). “Unsupervised domain adaptation via structured prediction based selective pseudo-labeling.” [Online]. Available: https://arxiv.xilesou.top/abs/1911.07982
-  J. Li, K. Lu, Z. Huang, L. Zhu, and H. Shen, “Heterogeneous domain adaptation through progressive alignment,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 5, pp. 1381-1391, May 2019.
-  Y.-H. H. Tsai, Y.-R. Yeh, and Y.-C. F. Wang, “Learning cross-domain landmarks for heterogeneous domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 5081-5090.
-  M. Xiao and Y. Guo, “Feature space independent semi-supervised domain adaptation via kernel matching,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 1, pp. 54-66, Jan. 2015.
-  T. Yao, Y. Pan, C.-W. Ngo, H. Li, and T. Mei, “Semi-supervised domain adaptation with subspace learning for visual recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 2142-2150.
-  D. Hong, N. Yokoya, and X. Zhu, “Learning a robust local manifold representation for hyperspectral dimensionality reduction,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Ses.., vol. 10, no. 6, pp. 2960-2975, Jun. 2017.
-  K. Zhan, F. Nie, J. Wang, and Y. Yang, “Multiview consensus graph clustering,” IEEE Trans. Image Process., vol. 28, no. 3, pp. 1261-1270, Mar. 2019.
-  C. Hou, F. Nie, H. Tao, and D. Yi, “Multi-view unsupervised feature selection with adaptive similarity and view weight,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 9, pp. 1998-2011, Sept. 2017.
-  W. Wang, Y. Yan, F. Nie, S. Yan, and N. Sebe, “Flexible manifold learning with optimal graph for image and video representation,” IEEE Trans. Image Process., vol. 27, no. 6, pp. 2664-2675, Jun. 2018.
-  C.-A. Hou, Y.-H. H. Tsai, Y.-R. Yeh, and Y.-C. F. Wang, “Unsupervised domain adaptation with label and structural consistency,” IEEE Trans. Image Process., vol. 25, no. 12, pp. 5552-5562, Dec. 2016.
-  J. Li, M. Jing, K. Lu, L. Zhu and H. Shen, “Locality preserving joint transfer for domain adaptation,” IEEE Trans. Image Process., vol. 28, no. 12, pp. 6103-6115, Dec. 2019.
-  S. Wang, J. Lu, X. Gu, H. Du, and J. Yang, “Semi-supervised linear discriminant analysis for dimension reduction and classification,” Pattern Recognit., vol. 57, pp. 179-189, Sept. 2016.
-  K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 213-226.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “DeCAF: A deep convolutional activation feature for generic visual recognition,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 647-655.
-  H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan, “Deep hashing network for unsupervised domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2017, pp. 5018-5027.
-  M.-R. Amini, N. Usunier, and C. Goutte, “Learning from multiple partially observed views-an application to multilingual text categorization,” in Proc. Adv. in Neural Inf. Process. Syst., Dec. 2009, pp. 28-36.
-  J. Zhang, W. Li, and P. Ogunbona, “Joint geometrical and statistical alignment for visual domain adaptation,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2017, pp. 1859-1867.
-  L. Duan, I. W. Tsang, and D. Xu, “Domain transfer multiple kernel learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 3, pp. 465-479, Mar. 2012.
-  L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579-2605, Nov. 2008.