I Introduction
Traditional machine learning tasks assume that both training and testing data are drawn from a same data distribution
[29, 31, 5]. However, in many reallife applications, due to different factors as diverse as sensor difference, lighting changes, viewpoint variations, etc., data from a target domain may have a different data distribution w.r.t. the labeled data in a source domain where a predictor can be can not be reliably learned due to the data distribution shift. On the other hand, manually labeling enough target data for the purpose of training an effective predictor can be very expensive, tedious and thus prohibitive.Domain adaptation (DA) [29, 31, 5] aims to leverage possibly abundant labeled data from a source domain to learn an effective predictor for data in a target domain despite the data distribution discrepancy between the source and target. While DA can be semisupervised by assuming a certain amount of labeled data is available in the target domain, in this paper we are interested in unsupervised DA[32] where we assume that the target domain has no labels.
State of the art DA methods can be categorized into instancebased [29, 7], featurebased [30, 22, 42], or classifier
based. Classifierbased DA is not suitable to unsupervised DA as it aims to fit a classifier trained on the source data to the target data through adaptation of its parameters, and thereby require some labels in the target domain
[38] . The instancebased approach generally assumes that 1) the conditional distributions of source and target domain are identical[44], and 2) certain portion of the data in the source domain can be reused[29] for learning in the target domain through reweighting. Featurebased adaptation relaxes such a strict assumption and only requires that there exists a mapping from the input data space to a latent shared feature representation space. This latent shared feature space captures the information necessary for training classifiers for source and target tasks. In this paper, we propose a featurebased adaptation DA method.A common method to approach feature adaptation is to seek a lowdimensional latent subspace[31, 30] via dimension reduction. State of the art features two main lines of approaches, namely data geometric structure alignmentbased or data distribution centered. Data geometric structure alignmentbased approaches, e.g., LTSL[35] , LRSR[42], seek a subspace where source and target data can be well aligned and interlaced in preserving inherent hidden geometric data structure via low rank constraint and/or sparse representation. Data distribution centered methods aim to search a latent subspace where the discrepancy between the source and target data distributions is minimized, via various distances, e.g., Bregman divergence[36] based distance, Geodesic distance[13] or Maximum Mean Discrepancy (MMD) [14]. The most popular distance is MMD due to its simplicity and solid theoretical foundations.
A cornerstone theoretical result in DA [2, 17] is achieved by BenDavid et al., who estimate an error bound of a learned hypothesis on a target domain:
(1) 
Eq.(1) provides insight on the way to improve DA algorithms as it states that the performance of a hypothesis on a target domain is determined by: 1) the classification error on the source domain ; 2) data divergence which measures the divergence[17] between two distributions(, ); 3) the difference in labeling functions across the two domains. In light of this theoretical result, we can see that data distribution centered DA methods only seek to minimize the second term in reducing data distribution discrepancies, whereas data geometric structure alignmentbased methods account for the underlying data geometric structure and expect but without theoretical guarantee the alignment of data distributions.
In this paper, we argue that an effective DA method should: P1) search a shared feature subspace where source and target data are not only aligned in terms of distributions as most state of the art DA methods do, e.g., TCA[28], JDA[22], but also discriminative in that instances of different classes are well separated; P2) account for the geometric structure of the underlying data manifold when inferring data labels on the target domain.
As a result, we propose in this paper a novel Discriminative Geometry Aware DA (DGADA) method which provides a unified framework for a simultaneous optimization of the three terms in the upper error bound in Eq.(1). Specifically, the proposed DGADA also seeks a latent feature subspace to align data distributions as most state of the art DA methods do, but also introduces a repulsive force term in the proposed model so as to increase interclass distances and thereby facilitate discriminative learning and minimize the classification error of the learned hypothesis on source data. Furthermore, the proposed DGADA also introduces in its model two additional constraints, namely Label Smoothness Consistency and Geometric Structure Consistency, to account for the geometric structure of the underlying data manifold when inferring data labels in the target domain, thereby minimizing the third term of the error bound of the underlying learned hypothesis on the target domain. Fig.1 illustrates the proposed DA method.
To gain insight into the proposed method and highlight the contribution of P1) and P2) in comparison with a baseline DA method, i.e., JDA [22], which only cares about data distribution alignment, we further derive two partial DA methods from our DA model, namely Close yet Discriminative DA (CDDA) which implements P1), Geometry Aware DA (GADA) based on P2), in addition to our Discriminative and Geometry Aware DA (DGADA) which integrates jointly P1) and P2). Comprehensive experiments carried out on standard DA benchmarks, i.e., 36 crossdomain image classification tasks through 6 datasets, verify the effectiveness of the proposed method, which consistently outperforms the stateoftheart DA methods. Indepth analysis using both synthetic data and two additional partial models further provide insight into the proposed DA model and highlight its interesting properties.
To sum up, the contributions of this paper are fourfold:

We propose a novel repulsive force term in the DA model to increase the discriminative power of the shared latent subspace, aside from narrowing discrepancies of both the marginal and conditional distributions between the source and target domains.

We introduce data geometry awareness, through Label Smoothness and Geometric Structure Consistencies, for label inference in the proposed DA model and thereby account for the geometric structures of the underlying data manifold.

We derive from our DA model three novel DA methods, namely CDDA, GADA and DGADA, which successively implement data discriminativeness, geometry awareness and both, and quantify the contribution of each term beyond a baseline DA method, i.e., JDA, which only cares alignment of data distributions.

We perform extensive experiments on 36 image classification DA tasks through 6 popular DA benchmarks and verify the effectiveness of the proposed method which consistently outperforms twentytwo stateoftheart DA algorithms with a significant margin. Moreover, we also carry out indepth analysis of the proposed DA methods, in particular w.r.t. their hyperparameters and convergence speed. In addition, using both synthetic and real data, we also provide insights into the proposed DA model in visualizing the effect of data discriminativeness and geometry awareness.
The paper is organized as follows. Section 2 discusses the related work. Section 3 presents the method. Section 4 benchmarks the proposed DA method and provides indepth analysis. Section 5 draws conclusion.
Ii Related Work
Unsupervised Domain Adaptation assumes no labeled data are provided in the target domain. Thus in order to achieve satisfactory classification performance on the target domain, one needs to learn a classifier with labeled samples provided only from the source domain as well as unlabelled samples from the target domain. In earlier days, this problem is also known as covariant shift and can be solved by sample reweighting [37]. These methods aim to reduce the distribution difference by reweighting the source samples according to their relevance to the target samples. While proving useful when the data divergence between the source and target domain is small, these methods fall short to align source and target data when this divergence becomes large.
As a result, recent research in DA has focused its attention on featurebased adaptation approach [22, 44, 35, 23, 42, 25], which only assumes a shared latent feature space between the source and target domain. In the learned latent space, the divergence between the projected source and target data distributions is supposed to be minimized. Therefore a classifier learned with the projected labeled source samples could be applied for classification on target samples. To find such a latent shared feature space, many existing methods, e.g.,[28, 22, 44, 23, 1], embrace the dimensionality reduction and propose to explicitly minimize some predefined distance measures to reduce the mismatch between source and target in terms of marginal distribution [36] [27] [28], or conditional distribution [33], or both [22]. For example, [36] proposed a Bregman Divergence based regularization schema, which combines Bregman divergence with conventional dimensionality reduction algorithms. In [28], the authors use a similar dimensionality reduction framework while making use of the Maximum Mean Discrepancy (MMD) based on the Reproducing Hilbert Space (RKHS) [3] to estimate the distance between distributions. In [22]
, the authors further improve this work by minimizing not only the mismatch of the crossdomain marginal probability distributions, but also the mismatch of conditional probability distributions.
In line with the focus of manifold learning [45], an increasing number of DA methods, e.g., [24, 35, 42], emphasize the importance of aligning the underlying data manifold structures between the source and the target domain for effective DA. In these methods, lowrank and sparse constraints are introduced into DA to extract a lowdimension feature subspace where target samples can be sparsely reconstructed from source samples [35], or interleaved by source samples [42], thereby aligning the geometric structures of the underlying data manifolds. A few recent DA methods, e.g., RSACDDA[24], JGSA[44], further propose unified frameworks to reduce the shift between domains both statistically and geometrically.
However, in light of the upper error bound as defined in Eq.(1), we can see that data distribution centered DA methods only seek to minimize the second term in reducing data distribution discrepancies, whereas data geometric structure alignmentbased methods account for the underlying data geometric structure and expect but without theoretical guarantee the alignment of data distributions. In contrast, the proposed DGADA method optimizes altogether the three error terms of the upper error bound in Eq.(1).
The proposed DGADA builds on JDA [22] in seeking a latent feature subspace while minimizing the mismatch of both the marginal and conditional probability distributions across domains, thereby decreasing the data divergence term in Eq.(1). But DGADA goes beyond and differs from JDA as we introduce in the proposed DA model a repulsive force term so as to increase interclass distances for discriminative DA, thereby optimizing the first term of the upper error bound in Eq.(1), i.e., the error rate of the learned hypothesis on the source domain. Furthermore, the proposed DGADA also accounts in its model for the geometric structures of the underlying data manifolds, through label smoothness consistency (LSC) and geometric structure consistency (GSC) which require the inferred labels on the source and target data be smooth and have similar labels on nearby data. These two constraints thus further optimize the third term of the upper error bound in Eq.(1). DGADA also differs much from a recent DA method, i.e., SCA[11], which also tries to introduce data discriminativeness through the between and within class scatter only defined on the source domain. However, besides data geometry awareness that it does not consider, SCA does not seek explicitly data distribution alignment as we do in heritage of JDA, nor it has the repulsive force term as we introduce in our model in pushing away interclass data based on both source and target domain. Using both synthetic and real data, sect.IVF provides insights into and visualizes the differences of the proposed model with a number of state of the art DA methods, e.g., SCA, and highlights its interesting properties, in particular data distribution alignment, data discriminativeness and geometry awareness.
Iii Discriminative Geometry Aware Domain Adaptation
We first introduce the notations and formalize the problem in sect.IIIA, then present in sect.IIIB the proposed model for Discriminative and Geometry Aware Domain Adaptation (DGADA), and solve the model in sect.IIIC. Sect.IIID further analyzes the kernelization of the proposed DA model for nonlinear DA problems.
Iiia Notations and Problem Statement
Matrices are written as boldface uppercase letters. Vectors are written as boldface lowercase letters. For matrix
, its th row is denoted as , and its th column is denoted by . We define the Frobenius norm as: .A domain is defined as an mdimensional feature space and a marginal probability distribution , i.e., with . Given a specific domain , a task is composed of a Ccardinality label set and a classifier , i.e., , where can be interpreted as the class conditional probability distribution for each input sample .
In unsupervised domain adaptation, we are given a source domain with labeled samples , which are associated with their class labels , and an unlabeled target domain with unlabeled samples , whose labels are are unknown. Here, source domain labels is a binary vector in which if belongs to the th class. We define the data matrix in packing both the source and target data. The source domain and target domain are assumed to be different, i.e., , , , .
We also define the notion of subdomain, denoted as , representing the set of samples in with the label . Similarly, a subdomain can be defined for the target domain as the set of samples in with the label . However, as samples in the target domain are unlabeled, the definition of subdomains in the target domain, requires a base classifier, e.g., Nearest Neighbor (NN), to attribute pseudo labels for samples in .
The maximum mean discrepancy (MMD) is an effective nonparametric distancemeasure that compares the distributions of two sets of data by mapping the data to Reproducing Kernel Hilbert Space[3] (RKHS). Given two distributions and , the MMD between and is defined as:
(2) 
where and
are two random variable sets from distributions
and , respectively, and is a universal RKHS with the reproducing kernel mapping : , .The aim of the Discriminative and Geometry Aware Domain Adaptation (DGADA) is to learn a latent feature subspace with the following properties: P1) the distances of both marginal and conditional probabilities between the source and target domains are reduced; P2) The distances between each subdomain to the others are increased so as to increase interclass distances and thereby enable discriminative DA; and P3) label inference accounts for the underlying data geometric structure.
IiiB The model
The proposed DA model (sect.IIIB5) builds on TCA (sect.IIIB1) and JDA (sect.IIIB2) to which discriminative DA (CDDA) is introduced (sect.IIIB3) and the data geometry awareness (GADA) is accounted for in label inference and the search of the shared latent feature subspace (sect.IIIB4).
IiiB1 Search of a Latent Feature Space with Dimensionality Reduction (Tca)
The search of a latent feature subspace with dimensionality reduction has been demonstrated useful for DA in several previous works, e.g., [28, 22, 24, 35, 44]. In projecting original raw data into a lower dimensional space, the principal
data structure is preserved while decreasing its complexities. In the proposed method, we also apply the Principal Component Analysis (PCA) to capture the major data structure. Mathematically, given an input data matrix
, , the centering matrix is defined as , where is the matrix of ones. The optimization of PCA is to find a projection transformationwhich maximizes the embedded data variance.
(3) 
where denotes the trace of a matrix, is the data covariance matrix, and with the feature dimension and the dimension of the projected subspace. The optimal solution is calculated by solving an eigendecomposition problem: , where are the
largest eigenvalues. Finally, the original data
is projected into the optimal dimensional subspace using .IiiB2 Joint Marginal and Conditional Distribution Domain Adaptation (Jda)
However, the previous feature subspace calculated via PCA does not align explicitly data distributions between the source and target domain. Following [22, 21], we also empirically measure the distance of both marginal and conditional distributions across domain via the nonparametric distance measurement MMD in RKHS [3] once the original data projected into a lowdimensional feature space. Formally, the empirical distance of the two domains is defined as:
(4) 
where represents the marginal distribution between and and its calculation is obtained by:
(5) 
where . The difference between the marginal distributions and is reduced in minimizing .
Similarly, the distance of conditional probability distributions is defined as the sum of the empirical distances over the class labels between the subdomains of a same label in the source and target domain:
(6) 
where is the number of classes, represents the subdomain in the source domain, is the number of samples in the source subdomain. and are defined similarly for the target domain. Finally, represents the conditional distribution between subdomains in and and it is defined as:
(7) 
In minimizing , the mismatch of conditional distributions between and is reduced.
IiiB3 Close yet Discriminative Domain Adaptation (Cdda)
However, the previous joint alignment of the marginal and conditional distributions across domain does not explicitly render data discriminative in the searched feature subspace. As a result, we introduce a Discriminative domain adaption via a repulsive force term, so as to increase the distances of subdomains with different labels, and improve the discriminative power of the latent shared features, thereby making it possible for a better predictive model for both the source and target data.
Specifically, the repulsive force term is defined as: , where and index the distances computed from to and to , respectively. represents the sum of the distances between each source subdomain and all the target subdomains except the one with the label . The sum of these distances is explicitly defined as:
(8) 
where is defined as
(9) 
Symmetrically, represents the sum of the distances from each target subdomain to all the the source subdomains except the source subdomain with the label . Similarly, the sum of these distances is explicitly defined as:
(10) 
where is defined as
(11) 
Finally, we obtain
(12) 
We define as the repulsive force matrix.While the minimization of Eq.(4) and Eq.(6) makes closer both marginal and conditional distributions between source and target, the maximization of Eq.(12) increases the distances between source and target subdomains, thereby improve the discriminative power of the searched latent feature subspace.
IiiB4 Geometry Aware Domain Adaptation (GaDa)
In a number of state of the art DA methods, e.g.,[27, 28, 22], the simple Nearest Neighbor (NN) classifier is applied for label inference. In JDA and LRSR[42], NNbased label deduction is applied twice at each iteration. NN is first applied to the target domain in order to generate the pseudo labels of the target data and enable the computation of the conditional probability distance as defined in sect. IIIB2. Once the optimized latent subspace identified, NN is then applied once again at the end of an iteration for the label prediction of the target domain. However, given the neighborhood usually based on the or distance, NN could fall short to measure the similarity of source and target domain data which may be embedded into a manifold with complex geometric structures.
To account for the underlying data manifold structure in data similarity measurement, we further introduce two consistency constraints, namely label smoothness consistency and geometric structure consistency for both the pseudo and final label inference.
Label Smoothness Consistency (LSC)：LSC is a constraint designed to prevent too much changes from the initial query assignment .
(13) 
where , is the calculated probability of data belonging to class. Each data has a predicted label . is the initial prediction. As for unlabeled target data , traditional ranking methods[18, 43] assign the labels . However, this definition lacks discriminative properties due to the equal probability assignments in . In this work, we define the initial as:
(14) 
where is defined as pseudo labels, generated via a base classifier, e.g., NN.
Geometric Structure Consistency (GSC): GSC is designed to ensure that inferred data labels comply with the geometric structures of the underlying data manifolds. We propose to characterize alignment of label inference with the underlying data geometric structure through the Laplace matrix L:
(15) 
where
is an affinity matrix
[26], with giving the affinity between two data samples and and defined as if and otherwise, is the degree matrix with . When Eq.(15) is minimized, the geometric structure consistency ensures that the label space does not change too much between nearby data.IiiB5 the final model (DgaDa)
Our final DA model integrates: 1) alignment of both marginal and conditional distributions across domain as defined by Eq.(4) and Eq.(6), 2) the repulsive force as in Eq.(12), and 3) data geometry aware label inference through both the label smoothness (Eq.(13)) and geometric structure (Eq.(15)) consistencies. Therefore, our final model is defined as:
(16) 
It can be rewritten mathematically as:
(17) 
where the constraint removes an arbitrary scaling factor in the embedding and prevents the above optimization collapse onto a subspace of dimension less than the required dimensions. is a regularization parameter to guarantee the optimization problem to be welldefined. is a tradeoff parameter which balances LSC and GSC.
IiiC Solving the model
Direct solution to Eq.(17) is nontrivial. We divide it into two subproblems.
Subproblem (a):
(18) 
Subproblem (b):
(19) 
These two subproblems are then iteratively optimized.
Subproblem (a) amounts to solving the generalized eigendecomposition problem. Augmented Lagrangian method [10, 22] can be used to solve this problem. In setting its partial derivation w.r.t. equal to zero, we obtain:
(20) 
where is the Lagrange multiplier. The optimal subspace is reduced to solving Eq.(20
) for the k smallest eigenvectors. Then, we obtain the projection matrix
and the underlying embedding space .Subproblem (b) is nontrivial. Inspired by the solution proposed in [45] [18] [43], the minimum is approached where the derivative of the function is zero. An approximate solution can be provided by:
(21) 
where is the probability of prediction of the target domain corresponding to different class labels, is an affinity matrix and is the diagonal matrix.
(22) 
To sum up, at a given iteration, subproblem (a) as in Eq.(18) searches a latent feature subspace Z in closering both marginal and conditional data distributions between source and target while making use of source and current target labels in pushing away interclass data; subproblem (b) as in Eq.(19) infers through Eq.(22) novel labels for target data in line with source data labels while making use of the geometric structures of the underlying data manifolds in the current subspace Z. This iterative process eventually ends up in a latent feature subspace where : 1) the discrepancies of both marginal and conditional data distributions between source and target are narrowed; 2) source and target data are rendered more discriminative thanks to the increase of interclass distances; and 3) the geometric structures of the underlying data manifolds are aligned.
The complete learning algorithm is summarized in Algorithm 1  DGADA.
IiiD Kernelization Analysis
The proposed DGADA method can be extended to nonlinear problems in a Reproducing Kernel Hilbert Space via the kernel mapping , or , and the kernel matrix . We utilize the Representer theorem to formulate Kernel DGADA as
(23) 
Iv Experiments
In this section, we verify and analyze indepth the effectiveness of our proposed domain adaptation model, i.e., DGADA, on 36 cross domain image classification tasks generated by permuting six datasets (see Fig.2). Sect.IVA describes the benchmarks and the features. Sect.IVB lists the baseline methods which the proposed DGADA is compared to. Sect.IVC presents the experimental setup and introduces in particular two partial DA methods, namely CDDA and GADA, in addition to the proposed DGADA based on our full DA model. Sect.IVD discusses the experimental results in comparison with the state of the art. Sect.IVE analyzes the convergence and parameter sensitivity of the proposed method. Sect.IVF further provides insight into the proposed DA model in visualizing the achieved feature subspaces through both synthetic and real data.
Iva Benchmarks and Features
As illustrated in Fig.2, USPS[15]+MINIST[20], COIL20[22], PIE[22] and office+Caltech[22, 42, DBLP:journals/tip/HouTYW16, 35] are standard benchmarks for the purpose of evaluation and comparison with stateoftheart in DA. In this paper, we follow the data preparation as most previous works[40, 42, 12, 11, 6, 24] do. We construct 36 datasets for different image classification tasks.
Office+Caltech consists of 2533 images of ten categories (8 to 151 images per category per domain)[11]. These images come from four domains: (A) AMAZON, (D) DSLR, (W) WEBCAM, and (C) CALTECH. AMAZON images were acquired in a controlled environment with studio lighting. DSLR consists of high resolution images captured by a digital SLR camera in a home environment under natural lighting. WEBCAM images were acquired in a similar environment to DSLR, but with a lowresolution webcam. CALTECH images were collected from Google Images.
We use two types of image features extracted from these datasets,
i.e., SURF and DeCAF6, that are publicly available. The SURF[13] features are shallowfeatures extracted and quantized into an 800bin histogram using a codebook computed with Kmeans on a subset of images from Amazon. The resultant histograms are further standardized by zscore. The
Deep Convolutional Activation Features (DeCAF6)[8] are deep features computed as in AELM[40]which makes of use VLFeat MatConvNet library with different pretrained CNN models, including in particular the Caffe implementation of
AlexNet[19]which is trained on the ImageNet dataset. The outputs from the 6th layer are used as
deep features, leading to 4096 dimensional DeCAF6 features. In this experiment, we denote the dataset Amazon,Webcam,DSLR,and Caltech256 as A,W,D,and C, respectively.In denoting the direction from “source” to “target” by an arrow “” is the direction from “source” to “target”, DA tasks can then be constructed, namely A W C D, respectively. For example, “W D” means the Webcam image dataset is considered as the labeled source domain whereas the DSLR image dataset the unlabeled target domain.
USPS+MNIST shares ten common digit categories from two subsets, namely USPS and MNIST, but with very different data distributions (see Fig.2). We construct a first DA task USPS vs MNIST by randomly sampling 1,800 images in USPS to form the source data, and randomly sampling 2,000 images in MNIST to form the target data. Then, we switch the source/target pair to get another DA task, i.e., MNIST vs USPS. We uniformly rescale all images to size 16×16, and represent each one by a feature vector encoding the grayscale pixel values. Thus the source and target data share the same feature space. As a result, we have defined two crossdomain DA tasks, namely USPS MNIST and MNIST USPS.
COIL20 contains 20 objects with 1440 images (Fig.2). The images of each object were taken in varying its pose about 5 degrees, resulting in 72 poses per object. Each image has a resolution of 32×32 pixels and 256 gray levels per pixel. In this experiment, we partition the dataset into two subsets, namely COIL 1 and COIL 2[42]. COIL 1 contains all images taken within the directions in (quadrants 1 and 3), resulting in 720 images. COIL 2 contains all images taken in the directions within (quadrants 2 and 4) and thus the number of images is 720. In this way, we construct two subsets with relatively different distributions. In this experiment, the COIL20 dataset with 20 classes is split into two DA tasks, i.e., COIL1 COIL2 and COIL2 COIL1
PIE face database consists of 68 subjects with each under 21 various illumination conditions[6, 22]. We adopt five pose subsets: C05, C07, C09, C27, C29, which provides a rich basis for domain adaptation, that is, we can choose one pose as the source and any rest one as the target. Therefore, we obtain different source/target combinations. Finally, we combine all five poses together to form a single dataset for largescale transfer learning experiment. We crop all images to 32 × 32 and only adopt the pixel values as the input. Finally, with different face poses, of which five subsets are selected, denoted as PIE1, PIE2, etc., resulting in DA tasks, i.e., PIE1 vs PIE 2 PIE5 vs PIE 4, respectively.
IvB Baseline Methods
The proposed DGADA method is compared with twentytwo
methods of the literature, including deep learningbased approaches for unsupervised domain adaption. They are: (1)1Nearest Neighbor Classifier(
NN); (2) Principal Component Analysis (PCA) +NN; (3) Geodesic Flow Kernel(GFK) [13] + NN; (4) Transfer Component Analysis(TCA) [28] +NN; (5) Transfer Subspace Learning(TSL) [36] +NN; (6) Joint Domain Adaptation (JDA) [22] +NN; (7) Extreme Learning Machine (ELM) [40] +NN; (8) Augmented Extreme Learning Machine (AELM) [40] +NN; (9) Subspace Alignment (SA)[9]; (10) Marginalized Stacked Denoising Autoencoder (mSDA)[4]; (11) Transfer Joint Matching (TJM)[23]; (12) Robust Transfer Metric Learning (RTML)[6]; (13) Scatter Component Analysis (SCA)[11]; (14) CrossDomain Metric Learning (CDML)[41]; (15)Deep Domain Confusion (DDC)[39]; (16)LowRank Transfer Subspace Learning (LTSL)[35]; (17)LowRank and Sparse Representation (LRSR)[42]; (18)Kernel Principal Component Analysis (KPCA)[34]; (19)Joint geometric and statistical alignment (JGSA) [44]; (20)Deep Adaptation Networks (DAN) [21]; (21)Deep Convolutional Neural Network (
AlexNet) [19] and (22)Domain adaptation with lowrank reconstruction (RVDLR) [16].In addition, for the purpose of fair comparison, we follow the experiment settings of JGSA, AlexNet and SCA, and apply DeCAF6 as the features for some methods to be evaluated. Whenever possible, the reported performance scores of the twentytwo methods of the literature are directly collected from previous research [22, 40, 6, 11, 42, 44]. They are assumed to be their best performance.
IvC Experimental Setup
For the problem of domain adaptation, it is not possible to tune a set of optimal hyperparameters, given the fact that the target domain has no labeled data. Following the setting of previous research[24, 22, 42] , we also evaluate the proposed DGADA by empirically searching in the parameter space for the optimal settings. Specifically, the proposed DGADA method has three hyperparameters, i.e., the subspace dimension , regularization parameters and . In our experiments, we set and 1) , and for USPS, MNIST， COIL20 and PIE, 2) , for Office and Caltech256.
In our experiment, accuracy on the test dataset as defined by Eq.(24) is the evaluation measurement. It is widely used in literature, e.g.,[27, 21, 24, 22, 42], etc.
(24) 
where is the target domain treated as test data, is the predicted label and is the ground truth label for a test data .
To provide insight into the proposed DA method and highlight the individual contribution of each term in our final model, i.e., the discriminative term using the repulsive force as defined in Eq.(12) and the geometry aware term through label smooth consistency as in Eq.(13) and geometry structure consistency as in Eq.(15), we evaluate the proposed DA method using three settings:

CDDA: In this setting, subproblem (b) in sect. IIIC as defined in Eq.(19) is simply replaced by the Nearest Neighbor (NN) predictor. This correspond to our final DA model as defined in Eq.(18) which only makes use of the repulse force term but without geometry aware label inference as defined by Eq.(13) and Eq.(15). This setting makes it possible to understand how important discriminative DA is w.r.t. state of the art baseline DA methods only focused on data distribution alignment, e.g., JDA.

GADA: In this setting, we extend popular data distribution alignmentbased DA methods, e.g., JDA, with geometry aware label inference but ignore the repulsive force term, i.e., , in our final model reformulated in Eq.(17). This setting thus jointly consider across domain conditional and marginal distribution alignment (Eq.(5) and Eq.(6)) and geometry aware label inference (Eq.(13) and Eq.(15)). This setting enables quantification of the contribution of the geometry aware label inference term as defined by Eq.(13) and Eq.(15) in comparison with state of the art baseline DA methods only focused on data distribution alignment, e.g., JDA.
IvD Experimental Results and Discussion
IvD1 Experiments on the COIL 20 Dataset
The COIL dataset (see fig.2) features the challenge of pose variations between the source and target domain. Fig.3 depicts the experimental results on the COIL dataset. As can be seen in this figure where top results are highlighted in red color, the two partial models, i.e., CDDA, DAGA and the proposed final model, DGADA, depict an overall average accuracy of , and , respectively. They both outperform the eight baseline DA algorithms with a significant margin.
It is worth noting that, when adding label inference based on the underlying data manifold structure, the proposed DGADA improves its sibling CDDA by a margin as high as roughly 7 points, thereby highlighting the importance of data geometry aware label inference as introduced in DGADA. As compared to JDA, the proposed CDDA, which adds a discriminative repulsive force term w.r.t. JDA, also shows its effectiveness and improves the latter by more than 3 points.
IvD2 Experiments on the Office+Caltech256 Data Sets
synthesize the experimental results in comparison with the state of the art when deep features (
i.e., DeCAF6 features) and classic shallow features (i.e., SURE features) are used, respectively.
As can be seen in Fig.5, both CDDA and DGADA outperform the state of the art method in terms of average accuracy, thereby demonstrating the effectiveness of the proposed DA method. In particular, in comparison with JDA which only cares about data distribution alignment between source and target and the proposed DA method is built upon, CDDA improves JDA by 2 points thanks to the discriminative repulsive force term introduced in our model. When label inference accounts for the underlying data structure, our final model DGADA further improves CDDA by roughly 1 point.

Fig.4 compares the proposed DA method using deep features w.r.t. the state of the art, in particular endtoend deep learningbased DA methods. As can be seen in Fig.4, the use of deep features has enabled impressive accuracy improvement over shallow features. Simple baseline methods, e.g., NN, PCA, see their accuracy soared by roughly 40 points, demonstrating the power of deep learning paradigm. Our proposed DA method also takes advantage of this jump and sees its accuracy soared from 48.22 to 89.13 for CDDA and from 49.02 to 90.43 for DGADA. As for shallow features, CDDA improves JDA by 3 points and DGADA further ameliorates CDDA by 1 point when label inference accounts for the underlying data geometric structure. As a result, DGADA displays the best average accuracy and outperforms slightly DAN.
IvD3 Experiments on the USPS+MNIST Data Set
The UPS+MNIST dataset features different writing styles between source and target. Fig.6 lists the experimental results in comparison with 14 state of the art DA methods. As can be seen in the table, CDDA displays a 69.14% average accuracy and ranks the third best performer. It shows its effectiveness once more as it improves its baseline JDA by more than 5 points on average. When accounting for the underlying data geometry structure, the proposed DGADA further improves its sibling CDDA by a margin more than 7 points and displays the state of the art performance of a 76.54% accuracy. It is worth noting that the second best DA performer on this dataset, i.e., JGSA, also suggests aligning both statistically and geometrically data, and thereby corroborates our data geometry aware DA approach.
IvD4 Experiments on the CMU PIE Data Set
The CMU PIE dataset is a large face dataset featuring both illumination and pose variations. Fig.7 synthesizes the experimental results for DA using this dataset. As can be seen in the figure, similarly as in the previous experiments, the proposed DGADA displays the best average accuracy over 20 crossdomain adaptation experiments. In aligning both marginal and conditional data distributions, JDA performs quite well and displays a 60.24% average accuracy. In integrating the discriminative repulsive force term, CDDA improves JDA by roughly 3 points. DGADA further ameliorates CDDA by more than 1 point.
It is interesting to note that the second best performer on this dataset, namely LRSR, also tries to align geometrically source and target data through both low rank and sparse constraints so that source and target data are interleaved within a novel shared feature subspace.
IvE Convergence and Parameter Sensitivity
While the proposed DGADA displays state of the art performance over 36 DA tasks through six datasets (USPS, MINIST, COIL20, PIE, Amazon, Caltech), an important question is how fast the proposed method converges (sect.IVE2) as well as its sensitivity w.r.t. its hyperparameters (Sect.IVE1).
IvE1 Parameter sensitivity
Three hyperparameters, namely , and , are introduced in the proposed methods. is the dimension of the extracted feature subspace which determines the structure of lowdimension embedding. In Fig.8, we plot the classification accuracies of the proposed DA method w.r.t different values of on the COIL and USPS+MINIST datasets. As shown in Fig.8, the subspace dimensionality varies with , yet the proposed 3 DA variants, namely, CDDA, GADA and DGADA, remain stable w.r.t. a wide range of with . In our experiments, we set to balance efficiency and accuracy.
as introduced in Eq.(17) and Eq.(18) aims to regularize the projection matrix A to avoid overfitting the chosen shared feature subspace with respect to both source and target data. as defined in Eq.(22) is a tradeoff parameter which balances LSC and GSC. We study the sensitivity of the proposed GADA and DGADA methods with a wide range of parameter values, i.e., and . We plot in Fig.9 the results on D W COIL1 COIL2 datasets on both methods with held fixed at . As can be seen from Fig.9, the proposed GADA and DGADA display their stability as the resultant classification accuracies remain roughly the same despite a wide range of and values.
IvE2 Convergence analysis
In Fig.10, we further perform convergence analysis of the proposed CDDA, GADA and DGADA methods using the DeCAF6 features on the Office+Caltech datasets. The question here is how fast a DA method achieves its best performance w.r.t. the number of iterations . Fig.10 reports 12 cross domain adaptation experiments ( C A, C W … D A , D W ) with the number of iterations .
As shown in Fig.10, CDDA, GADA and DGADA converge within 35 iterations during optimization.
IvF Analysis and Verification
To further gain insight of the proposed CDDA, GADA and DGADA w.r.t. its domain adaptation skills, we also evaluate the proposed methods using a synthetic dataset in comparison with several state of the art DA methods. Fig.11 visualizes the original data distributions with 4 classes and the resultant shared feature subspaces as computed by TCA, JDA, TJM, SCA, CDDA, GADA and DGADA, respectively. In this experiment, we focus our attention on the ability of the DA methods to: : (a) narrow the discrepancies of data distributions between source and target; (b) increase data discriminativeness; and (c) align data geometric structures between source and target. As such, the original synthetic data depicts slight distribution discrepancies between source and target for the first two class data, wide distribution mismatch for the third and fourth class data. Fourth class data further depict a moon like geometric structure.
As can be seen in Fig.11, baseline methods, e.g., TCA, SCA, TJM have difficulties to align data distributions with wide discrepancies, e.g., third class data. JDA narrow data distribution discrepancies but lacks class data discriminativeness. The proposed variant CDDA ameliorates JDA and makes class data well separated thanks to the introduced repulsive force term but falls short to preserve data geometric structure (see the fourth moon like class data. The variant GADA align data distributions and preserves the underlying data geometric structures thanks to label smoothness consistency (LSC) and geometric structure consistency (GSC) but lacks data discriminativeness. In contrast, thanks to the joint consideration of data discriminativeness and geometric structure awareness, the proposed DGADA not only align data distributions compactly but also separate class data very distinctively. Furthermore, it also preserves the underlying data geometric structures.
The above findings can be further verified using real data through the MNISTUSPS DA task where the proposed DA methods achieves remarkable results (See Fig.6). Fig.12 visualizes class explicit data distributions in their original subspace and the resultant shared feature subspace using JDA and the three variants of the proposed DA method, namely CDDA, GADA , DGADA, with the same experimental setting.

Data distributions and geometric structures. Fig.12(a,b,c) visualize the MNIST, USPS, MNISTUSPS datasets in their Original data space, respectively. As shown in these figures, the MNIST and USPS datasets depict different data distributions and various data structures. In particular, yellow dots represent digit 2. They show a long and narrow shape in MNIST (Fig.12(a)) while a circle like shape in USPS (Fig.12(b)). They further display large data discrepancies across domain (Fig.12(c)) as for all the other classes.

Contribution of the repulsive force term. Visualization results in Fig.12(h,i,j,k) show that, in comparison with their respective baseline DA methods, i.e., JDA (Fig.12(h)) and GADA (Fig.12(j)), the proposed two DA variants, i.e., CDDA(Fig.12(i)) and DGADA(Fig.12(k)) which integrate in their model the repulsive force term as introduced in Sect.IIIB3, achieve data discriminativeness in compacting intraclass instances and separating interclass data, respectively. As a result, as shown in Fig.6, DGADA outperforms GADA by points, and CDDA outperforms JDA by points,respectively, thereby illustrating the importance of increasing data discriminativeness in DA.

Contribution of Geometric Structure Awareness. Visualization results in Fig.12(d,e) show that the JDA and CDDA’s subspaces fail to preserve the geometric structures of the underlying data manifold. For instance, the long and narrow shape of the orange dots in the source MNIST domain and the corresponding circle blob orange cloud in the target USPS domain (Fig.12(c)) are not preserved anymore in the JDA (Fig.12(d)) and CDDA (Fig.12(e)) subspaces. In contrast, thanks to the geometry awareness constraints, i.e., label smoothness consistency (LSC) and geometric structure consistency (GSC), as introduced in Sect.IIIB4, the two variants of the proposed DA methods, i.e., DAGA (Fig.12(f)) and DGADA (Fig.12(g)), succeed to preserve the geometric structures of the underlying data, and thereby inherent data similarities and consistencies of label inference. As a result, DGADA outperforms CDDA by points, and GADA outperforms JDA by points. They thus suggest the importance of Geometric Structure Awareness in DA.
V Conclusion and Future Work
In this paper, we have proposed a novel Discriminative and Geometry Aware Unsupervised DA method based on feature adaptation. Comprehensive experiments on 36 crossdomain image classification tasks through six popular DA datasets highlight the interest of enhancing the data discriminative properties within the model and label propagation in respect of the geometric structure of the underlying data manifold, and verify the effectiveness of the proposed method compared with twentytwo baseline DA methods of the literature. Using both synthetic and real data and three variants of the proposed DA method, we have further provided indepth analysis and insights into the proposed DGADA, in quantifying and visualizing the contribution of the data discriminativeness and data geometry awareness.
Our future work will concentrate on embedding the proposed method in deep networks and study other vision tasks, e.g., object detection, within the setting of transfer learning. Our future work will concentrate on embedding the proposed method in deep networks and study other vision tasks, e.g., object detection, within the setting of transfer learning.
References
 [1] Mahsa Baktashmotlagh, Mehrtash Harandi, and Mathieu Salzmann. Distributionmatching embedding for visual domain adaptation. Journal of Machine Learning Research, 17(108):1–30, 2016.
 [2] Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
 [3] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, HansPeter Kriegel, Bernhard Schölkopf, and Alex J Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
 [4] Minmin Chen, Zhixiang Eddie Xu, Kilian Q. Weinberger, and Fei Sha. Marginalized denoising autoencoders for domain adaptation. CoRR, abs/1206.4683, 2012.
 [5] Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey. CoRR, abs/1702.05374, 2017.
 [6] Zhengming Ding and Yun Fu. Robust transfer metric learning for image classification. IEEE Trans. Image Processing, 26(2):660–670, 2017.

[7]
Jeff Donahue, Judy Hoffman, Erik Rodner, Kate Saenko, and Trevor Darrell.
Semisupervised domain adaptation with instance constraints.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 668–675, 2013.  [8] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pages 647–655, 2014.
 [9] Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 18, 2013, pages 2960–2967, 2013.
 [10] Michel Fortin and Roland Glowinski. Augmented Lagrangian methods: applications to the numerical solution of boundaryvalue problems, volume 15. Elsevier, 2000.
 [11] Muhammad Ghifary, David Balduzzi, W. Bastiaan Kleijn, and Mengjie Zhang. Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE Trans. Pattern Anal. Mach. Intell., 39(7):1414–1430, 2017.
 [12] Boqing Gong, Kristen Grauman, and Fei Sha. Connecting the dots with landmarks: Discriminatively learning domaininvariant features for unsupervised domain adaptation. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013, pages 222–230, 2013.
 [13] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2066–2073. IEEE, 2012.
 [14] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
 [15] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell., 16(5):550–554, 1994.
 [16] IHong Jhuo, Dong Liu, DT Lee, and ShihFu Chang. Robust visual domain adaptation with lowrank reconstruction. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2168–2175. IEEE, 2012.
 [17] Daniel Kifer, Shai BenDavid, and Johannes Gehrke. Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data basesVolume 30, pages 180–191. VLDB Endowment, 2004.
 [18] T. H. Kim, K. M. Lee, and S. U. Lee. Learning full pairwise affinities for spectral segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7):1690–1703, July 2013.
 [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [20] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
 [21] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. In ICML, pages 97–105, 2015.

[22]
Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu.
Transfer feature learning with joint distribution adaptation.
In Proceedings of the IEEE International Conference on Computer Vision, pages 2200–2207, 2013.  [23] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S. Yu. Transfer joint matching for unsupervised domain adaptation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 2328, 2014, pages 1410–1417, 2014.
 [24] Lingkun Luo, Xiaofang Wang, Shiqiang Hu, and Liming Chen. Robust data geometric structure aligned close yet discriminative domain adaptation. CoRR, abs/1705.08620, 2017.
 [25] Lingkun Luo, Xiaofang Wang, Shiqiang Hu, Chao Wang, Yuxing Tang, and Liming Chen. Close yet distinctive domain adaptation. CoRR, abs/1704.04235, 2017.

[26]
Andrew Y. Ng, Michael I. Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.
In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 849–856. MIT Press, 2002.  [27] Sinno Jialin Pan, James T Kwok, and Qiang Yang. Transfer learning via dimensionality reduction. In AAAI, volume 8, pages 677–682, 2008.
 [28] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
 [29] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
 [30] Pau Panareda Busto and Juergen Gall. Open set domain adaptation. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [31] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3):53–69, May 2015.
 [32] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tritraining for unsupervised domain adaptation. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pages 2988–2997, 2017.
 [33] Sandeepkumar Satpal and Sunita Sarawagi. Domain adaptation of conditional probability models via feature subsetting. In PKDD, volume 4702, pages 224–235. Springer, 2007.
 [34] Bernhard Schölkopf, Alexander J. Smola, and KlausRobert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998.
 [35] Ming Shao, Dmitry Kit, and Yun Fu. Generalized transfer subspace learning through lowrank constraint. International Journal of Computer Vision, 109(12):74–93, 2014.
 [36] S. Si, D. Tao, and B. Geng. Bregman divergencebased regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering, 22(7):929–942, July 2010.
 [37] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1433–1440, 2008.
 [38] Yuxing Tang, Josiah Wang, Xiaofang Wang, Boyang Gao, Emmanuel Dellandrea, Robert Gaizauskas, and Liming Chen. Visual and semantic knowledge transfer for large scale semisupervised object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
 [39] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. CoRR, abs/1412.3474, 2014.
 [40] Muhammad Uzair and Ajmal S. Mian. Blind domain adaptation with augmented extreme learning machine features. IEEE Trans. Cybernetics, 47(3):651–660, 2017.

[41]
Hao Wang, Wei Wang, Chen Zhang, and Fanjiang Xu.
Crossdomain metric learning based on information theory.
In
Proceedings of the TwentyEighth AAAI Conference on Artificial Intelligence, July 27 31, 2014, Québec City, Québec, Canada.
, pages 2099–2105, 2014.  [42] Yong Xu, Xiaozhao Fang, Jian Wu, Xuelong Li, and David Zhang. Discriminative transfer subspace learning via lowrank and sparse representation. IEEE Trans. Image Processing, 25(2):850–863, 2016.
 [43] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. H. Yang. Saliency detection via graphbased manifold ranking. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 3166–3173, June 2013.
 [44] Jing Zhang, Wanqing Li, and Philip Ogunbona. Joint geometrical and statistical alignment for visual domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [45] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16, pages 321–328. MIT Press, 2004.
Comments
There are no comments yet.