Discriminative and Geometry Aware Unsupervised Domain Adaptation

12/28/2017
by   Lingkun Luo, et al.
0

Domain adaptation (DA) aims to generalize a learning model across training and testing data despite the mismatch of their data distributions. In light of a theoretical estimation of upper error bound, we argue in this paper that an effective DA method should 1) search a shared feature subspace where source and target data are not only aligned in terms of distributions as most state of the art DA methods do, but also discriminative in that instances of different classes are well separated; 2) account for the geometric structure of the underlying data manifold when inferring data labels on the target domain. In comparison with a baseline DA method which only cares about data distribution alignment between source and target, we derive three different DA models, namely CDDA, GA-DA, and DGA-DA, to highlight the contribution of Close yet Discriminative DA(CDDA) based on 1), Geometry Aware DA (GA-DA) based on 2), and finally Discriminative and Geometry Aware DA (DGA-DA) implementing jointly 1) and 2). Using both synthetic and real data, we show the effectiveness of the proposed approach which consistently outperforms state of the art DA methods over 36 image classification DA tasks through 6 popular benchmarks. We further carry out in-depth analysis of the proposed DA method in quantifying the contribution of each term of our DA model and provide insights into the proposed DA methods in visualizing both real and synthetic data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 17

02/21/2018

Discriminative Label Consistent Domain Adaptation

Domain adaptation (DA) is transfer learning which aims to learn an effec...
05/24/2017

Robust Data Geometric Structure Aligned Close yet Discriminative Domain Adaptation

Domain adaptation (DA) is transfer learning which aims to leverage label...
01/09/2021

Discriminative Noise Robust Sparse Orthogonal Label Regression-based Domain Adaptation

Domain adaptation (DA) aims to enable a learning model trained from a so...
09/26/2019

Task-Discriminative Domain Alignment for Unsupervised Domain Adaptation

Domain Adaptation (DA), the process of effectively adapting task models ...
10/29/2020

Domain adaptation under structural causal models

Domain adaptation (DA) arises as an important problem in statistical mac...
04/13/2017

Close Yet Distinctive Domain Adaptation

Domain adaptation is transfer learning which aims to generalize a learni...
04/21/2020

Unsupervised Domain Adaptation through Inter-modal Rotation for RGB-D Object Recognition

Unsupervised Domain Adaptation (DA) exploits the supervision of a label-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Traditional machine learning tasks assume that both training and testing data are drawn from a same data distribution

[29, 31, 5]. However, in many real-life applications, due to different factors as diverse as sensor difference, lighting changes, viewpoint variations, etc., data from a target domain may have a different data distribution w.r.t. the labeled data in a source domain where a predictor can be can not be reliably learned due to the data distribution shift. On the other hand, manually labeling enough target data for the purpose of training an effective predictor can be very expensive, tedious and thus prohibitive.

Domain adaptation (DA) [29, 31, 5] aims to leverage possibly abundant labeled data from a source domain to learn an effective predictor for data in a target domain despite the data distribution discrepancy between the source and target. While DA can be semi-supervised by assuming a certain amount of labeled data is available in the target domain, in this paper we are interested in unsupervised DA[32] where we assume that the target domain has no labels.

State of the art DA methods can be categorized into instance-based [29, 7], feature-based [30, 22, 42], or classifier

-based. Classifier-based DA is not suitable to unsupervised DA as it aims to fit a classifier trained on the source data to the target data through adaptation of its parameters, and thereby require some labels in the target domain

[38] . The instance-based approach generally assumes that 1) the conditional distributions of source and target domain are identical[44], and 2) certain portion of the data in the source domain can be reused[29] for learning in the target domain through re-weighting. Feature-based adaptation relaxes such a strict assumption and only requires that there exists a mapping from the input data space to a latent shared feature representation space. This latent shared feature space captures the information necessary for training classifiers for source and target tasks. In this paper, we propose a feature-based adaptation DA method.

A common method to approach feature adaptation is to seek a low-dimensional latent subspace[31, 30] via dimension reduction. State of the art features two main lines of approaches, namely data geometric structure alignment-based or data distribution centered. Data geometric structure alignment-based approaches, e.g., LTSL[35] , LRSR[42], seek a subspace where source and target data can be well aligned and interlaced in preserving inherent hidden geometric data structure via low rank constraint and/or sparse representation. Data distribution centered methods aim to search a latent subspace where the discrepancy between the source and target data distributions is minimized, via various distances, e.g., Bregman divergence[36] based distance, Geodesic distance[13] or Maximum Mean Discrepancy (MMD) [14]. The most popular distance is MMD due to its simplicity and solid theoretical foundations.

A cornerstone theoretical result in DA [2, 17] is achieved by Ben-David et al., who estimate an error bound of a learned hypothesis on a target domain:

(1)

Eq.(1) provides insight on the way to improve DA algorithms as it states that the performance of a hypothesis on a target domain is determined by: 1) the classification error on the source domain ; 2) data divergence which measures the -divergence[17] between two distributions(, ); 3) the difference in labeling functions across the two domains. In light of this theoretical result, we can see that data distribution centered DA methods only seek to minimize the second term in reducing data distribution discrepancies, whereas data geometric structure alignment-based methods account for the underlying data geometric structure and expect but without theoretical guarantee the alignment of data distributions.

Fig. 1: Illustration of the proposed DGA-DA method. Fig.1 (a): source data and target data, e.g., mouse, bike, smartphone images, with different distributions and inherent hidden data geometric structures between the source in red and the target in blue. Samples of different class labels are represented by different geometrical shapes, e.g., round, triangle and square; Fig.1 (b) illustrates JDA which closers data distributions whereas CDDA (Fig.1 (c)) further makes data discriminative using inter-class repulsive force. Both of them makes use of the nonparametric distance, i.e., Maximum Mean Discrepancy (MMD). Fig.1 (d): accounts for geometric structures of the underlying data manifolds and initial label knowledge in the source domain for label inference; In the proposed DA methods, MMD matrix and label matrix are updated iteratively within the processes in Fig.1 (b-d); Fig.1 (e): the achieved latent joint subspace where both marginal and class conditional data distributions are aligned between source and target as well as their data geometric structures; Furthermore, data instances of different classes are well separated from each other, thereby enabling discriminative DA.

In this paper, we argue that an effective DA method should: P1) search a shared feature subspace where source and target data are not only aligned in terms of distributions as most state of the art DA methods do, e.g., TCA[28], JDA[22], but also discriminative in that instances of different classes are well separated; P2) account for the geometric structure of the underlying data manifold when inferring data labels on the target domain.

As a result, we propose in this paper a novel Discriminative Geometry Aware DA (DGA-DA) method which provides a unified framework for a simultaneous optimization of the three terms in the upper error bound in Eq.(1). Specifically, the proposed DGA-DA also seeks a latent feature subspace to align data distributions as most state of the art DA methods do, but also introduces a repulsive force term in the proposed model so as to increase inter-class distances and thereby facilitate discriminative learning and minimize the classification error of the learned hypothesis on source data. Furthermore, the proposed DGA-DA also introduces in its model two additional constraints, namely Label Smoothness Consistency and Geometric Structure Consistency, to account for the geometric structure of the underlying data manifold when inferring data labels in the target domain, thereby minimizing the third term of the error bound of the underlying learned hypothesis on the target domain. Fig.1 illustrates the proposed DA method.

To gain insight into the proposed method and highlight the contribution of P1) and P2) in comparison with a baseline DA method, i.e., JDA [22], which only cares about data distribution alignment, we further derive two partial DA methods from our DA model, namely Close yet Discriminative DA (CDDA) which implements P1), Geometry Aware DA (GA-DA) based on P2), in addition to our Discriminative and Geometry Aware DA (DGA-DA) which integrates jointly P1) and P2). Comprehensive experiments carried out on standard DA benchmarks, i.e., 36 cross-domain image classification tasks through 6 datasets, verify the effectiveness of the proposed method, which consistently outperforms the state-of-the-art DA methods. In-depth analysis using both synthetic data and two additional partial models further provide insight into the proposed DA model and highlight its interesting properties.

To sum up, the contributions of this paper are fourfold:

  • We propose a novel repulsive force term in the DA model to increase the discriminative power of the shared latent subspace, aside from narrowing discrepancies of both the marginal and conditional distributions between the source and target domains.

  • We introduce data geometry awareness, through Label Smoothness and Geometric Structure Consistencies, for label inference in the proposed DA model and thereby account for the geometric structures of the underlying data manifold.

  • We derive from our DA model three novel DA methods, namely CDDA, GA-DA and DGA-DA, which successively implement data discriminativeness, geometry awareness and both, and quantify the contribution of each term beyond a baseline DA method, i.e., JDA, which only cares alignment of data distributions.

  • We perform extensive experiments on 36 image classification DA tasks through 6 popular DA benchmarks and verify the effectiveness of the proposed method which consistently outperforms twenty-two state-of-the-art DA algorithms with a significant margin. Moreover, we also carry out in-depth analysis of the proposed DA methods, in particular w.r.t. their hyper-parameters and convergence speed. In addition, using both synthetic and real data, we also provide insights into the proposed DA model in visualizing the effect of data discriminativeness and geometry awareness.

The paper is organized as follows. Section 2 discusses the related work. Section 3 presents the method. Section 4 benchmarks the proposed DA method and provides in-depth analysis. Section 5 draws conclusion.

Ii Related Work

Unsupervised Domain Adaptation assumes no labeled data are provided in the target domain. Thus in order to achieve satisfactory classification performance on the target domain, one needs to learn a classifier with labeled samples provided only from the source domain as well as unlabelled samples from the target domain. In earlier days, this problem is also known as co-variant shift and can be solved by sample re-weighting [37]. These methods aim to reduce the distribution difference by re-weighting the source samples according to their relevance to the target samples. While proving useful when the data divergence between the source and target domain is small, these methods fall short to align source and target data when this divergence becomes large.

As a result, recent research in DA has focused its attention on feature-based adaptation approach [22, 44, 35, 23, 42, 25], which only assumes a shared latent feature space between the source and target domain. In the learned latent space, the divergence between the projected source and target data distributions is supposed to be minimized. Therefore a classifier learned with the projected labeled source samples could be applied for classification on target samples. To find such a latent shared feature space, many existing methods, e.g.,[28, 22, 44, 23, 1], embrace the dimensionality reduction and propose to explicitly minimize some predefined distance measures to reduce the mismatch between source and target in terms of marginal distribution [36] [27] [28], or conditional distribution [33], or both [22]. For example, [36] proposed a Bregman Divergence based regularization schema, which combines Bregman divergence with conventional dimensionality reduction algorithms. In [28], the authors use a similar dimensionality reduction framework while making use of the Maximum Mean Discrepancy (MMD) based on the Reproducing Hilbert Space (RKHS) [3] to estimate the distance between distributions. In [22]

, the authors further improve this work by minimizing not only the mismatch of the cross-domain marginal probability distributions, but also the mismatch of conditional probability distributions.

In line with the focus of manifold learning [45], an increasing number of DA methods, e.g., [24, 35, 42], emphasize the importance of aligning the underlying data manifold structures between the source and the target domain for effective DA. In these methods, low-rank and sparse constraints are introduced into DA to extract a low-dimension feature subspace where target samples can be sparsely reconstructed from source samples [35], or interleaved by source samples [42], thereby aligning the geometric structures of the underlying data manifolds. A few recent DA methods, e.g., RSA-CDDA[24], JGSA[44], further propose unified frameworks to reduce the shift between domains both statistically and geometrically.

However, in light of the upper error bound as defined in Eq.(1), we can see that data distribution centered DA methods only seek to minimize the second term in reducing data distribution discrepancies, whereas data geometric structure alignment-based methods account for the underlying data geometric structure and expect but without theoretical guarantee the alignment of data distributions. In contrast, the proposed DGA-DA method optimizes altogether the three error terms of the upper error bound in Eq.(1).

The proposed DGA-DA builds on JDA [22] in seeking a latent feature subspace while minimizing the mismatch of both the marginal and conditional probability distributions across domains, thereby decreasing the data divergence term in Eq.(1). But DGA-DA goes beyond and differs from JDA as we introduce in the proposed DA model a repulsive force term so as to increase inter-class distances for discriminative DA, thereby optimizing the first term of the upper error bound in Eq.(1), i.e., the error rate of the learned hypothesis on the source domain. Furthermore, the proposed DGA-DA also accounts in its model for the geometric structures of the underlying data manifolds, through label smoothness consistency (LSC) and geometric structure consistency (GSC) which require the inferred labels on the source and target data be smooth and have similar labels on nearby data. These two constraints thus further optimize the third term of the upper error bound in Eq.(1). DGA-DA also differs much from a recent DA method, i.e., SCA[11], which also tries to introduce data discriminativeness through the between and within class scatter only defined on the source domain. However, besides data geometry awareness that it does not consider, SCA does not seek explicitly data distribution alignment as we do in heritage of JDA, nor it has the repulsive force term as we introduce in our model in pushing away inter-class data based on both source and target domain. Using both synthetic and real data, sect.IV-F provides insights into and visualizes the differences of the proposed model with a number of state of the art DA methods, e.g., SCA, and highlights its interesting properties, in particular data distribution alignment, data discriminativeness and geometry awareness.

Iii Discriminative Geometry Aware Domain Adaptation

We first introduce the notations and formalize the problem in sect.III-A, then present in sect.III-B the proposed model for Discriminative and Geometry Aware Domain Adaptation (DGA-DA), and solve the model in sect.III-C. Sect.III-D further analyzes the kernelization of the proposed DA model for nonlinear DA problems.

Iii-a Notations and Problem Statement

Matrices are written as boldface uppercase letters. Vectors are written as boldface lowercase letters. For matrix

, its -th row is denoted as , and its -th column is denoted by . We define the Frobenius norm as: .

A domain is defined as an m-dimensional feature space and a marginal probability distribution , i.e., with . Given a specific domain , a task is composed of a C-cardinality label set and a classifier , i.e., , where can be interpreted as the class conditional probability distribution for each input sample .

In unsupervised domain adaptation, we are given a source domain with labeled samples , which are associated with their class labels , and an unlabeled target domain with unlabeled samples , whose labels are are unknown. Here, source domain labels is a binary vector in which if belongs to the -th class. We define the data matrix in packing both the source and target data. The source domain and target domain are assumed to be different, i.e., , , , .

We also define the notion of sub-domain, denoted as , representing the set of samples in with the label . Similarly, a sub-domain can be defined for the target domain as the set of samples in with the label . However, as samples in the target domain are unlabeled, the definition of sub-domains in the target domain, requires a base classifier, e.g., Nearest Neighbor (NN), to attribute pseudo labels for samples in .

The maximum mean discrepancy (MMD) is an effective non-parametric distance-measure that compares the distributions of two sets of data by mapping the data to Reproducing Kernel Hilbert Space[3] (RKHS). Given two distributions and , the MMD between and is defined as:

(2)

where and

are two random variable sets from distributions

and , respectively, and is a universal RKHS with the reproducing kernel mapping : , .

The aim of the Discriminative and Geometry Aware Domain Adaptation (DGA-DA) is to learn a latent feature subspace with the following properties: P1) the distances of both marginal and conditional probabilities between the source and target domains are reduced; P2) The distances between each sub-domain to the others are increased so as to increase inter-class distances and thereby enable discriminative DA; and P3) label inference accounts for the underlying data geometric structure.

Iii-B The model

The proposed DA model (sect.III-B5) builds on TCA (sect.III-B1) and JDA (sect.III-B2) to which discriminative DA (CDDA) is introduced (sect.III-B3) and the data geometry awareness (GA-DA) is accounted for in label inference and the search of the shared latent feature subspace (sect.III-B4).

Iii-B1 Search of a Latent Feature Space with Dimensionality Reduction (Tca)

The search of a latent feature subspace with dimensionality reduction has been demonstrated useful for DA in several previous works, e.g., [28, 22, 24, 35, 44]. In projecting original raw data into a lower dimensional space, the principal

data structure is preserved while decreasing its complexities. In the proposed method, we also apply the Principal Component Analysis (PCA) to capture the major data structure. Mathematically, given an input data matrix

, , the centering matrix is defined as , where is the matrix of ones. The optimization of PCA is to find a projection transformation

which maximizes the embedded data variance.

(3)

where denotes the trace of a matrix, is the data covariance matrix, and with the feature dimension and the dimension of the projected subspace. The optimal solution is calculated by solving an eigendecomposition problem: , where are the

largest eigenvalues. Finally, the original data

is projected into the optimal -dimensional subspace using .

Iii-B2 Joint Marginal and Conditional Distribution Domain Adaptation (Jda)

However, the previous feature subspace calculated via PCA does not align explicitly data distributions between the source and target domain. Following [22, 21], we also empirically measure the distance of both marginal and conditional distributions across domain via the nonparametric distance measurement MMD in RKHS [3] once the original data projected into a low-dimensional feature space. Formally, the empirical distance of the two domains is defined as:

(4)

where represents the marginal distribution between and and its calculation is obtained by:

(5)

where . The difference between the marginal distributions and is reduced in minimizing .

Similarly, the distance of conditional probability distributions is defined as the sum of the empirical distances over the class labels between the sub-domains of a same label in the source and target domain:

(6)

where is the number of classes, represents the sub-domain in the source domain, is the number of samples in the source sub-domain. and are defined similarly for the target domain. Finally, represents the conditional distribution between sub-domains in and and it is defined as:

(7)

In minimizing , the mismatch of conditional distributions between and is reduced.

Iii-B3 Close yet Discriminative Domain Adaptation (Cdda)

However, the previous joint alignment of the marginal and conditional distributions across domain does not explicitly render data discriminative in the searched feature subspace. As a result, we introduce a Discriminative domain adaption via a repulsive force term, so as to increase the distances of sub-domains with different labels, and improve the discriminative power of the latent shared features, thereby making it possible for a better predictive model for both the source and target data.

Specifically, the repulsive force term is defined as: , where and index the distances computed from to and to , respectively. represents the sum of the distances between each source sub-domain and all the target sub-domains except the one with the label . The sum of these distances is explicitly defined as:

(8)

where is defined as

(9)

Symmetrically, represents the sum of the distances from each target sub-domain to all the the source sub-domains except the source sub-domain with the label . Similarly, the sum of these distances is explicitly defined as:

(10)

where is defined as

(11)

Finally, we obtain

(12)

We define as the repulsive force matrix.While the minimization of Eq.(4) and Eq.(6) makes closer both marginal and conditional distributions between source and target, the maximization of Eq.(12) increases the distances between source and target sub-domains, thereby improve the discriminative power of the searched latent feature subspace.

Iii-B4 Geometry Aware Domain Adaptation (Ga-Da)

In a number of state of the art DA methods, e.g.,[27, 28, 22], the simple Nearest Neighbor (NN) classifier is applied for label inference. In JDA and LRSR[42], NN-based label deduction is applied twice at each iteration. NN is first applied to the target domain in order to generate the pseudo labels of the target data and enable the computation of the conditional probability distance as defined in sect. III-B2. Once the optimized latent subspace identified, NN is then applied once again at the end of an iteration for the label prediction of the target domain. However, given the neighborhood usually based on the or distance, NN could fall short to measure the similarity of source and target domain data which may be embedded into a manifold with complex geometric structures.

To account for the underlying data manifold structure in data similarity measurement, we further introduce two consistency constraints, namely label smoothness consistency and geometric structure consistency for both the pseudo and final label inference.

Label Smoothness Consistency (LSC):LSC is a constraint designed to prevent too much changes from the initial query assignment .

(13)

where , is the calculated probability of data belonging to class. Each data has a predicted label . is the initial prediction. As for unlabeled target data , traditional ranking methods[18, 43] assign the labels . However, this definition lacks discriminative properties due to the equal probability assignments in . In this work, we define the initial as:

(14)

where is defined as pseudo labels, generated via a base classifier, e.g., NN.

Geometric Structure Consistency (GSC): GSC is designed to ensure that inferred data labels comply with the geometric structures of the underlying data manifolds. We propose to characterize alignment of label inference with the underlying data geometric structure through the Laplace matrix L:

(15)

where

is an affinity matrix

[26], with giving the affinity between two data samples and and defined as if and otherwise, is the degree matrix with . When Eq.(15) is minimized, the geometric structure consistency ensures that the label space does not change too much between nearby data.

Iii-B5 the final model (Dga-Da)

Our final DA model integrates: 1) alignment of both marginal and conditional distributions across domain as defined by Eq.(4) and Eq.(6), 2) the repulsive force as in Eq.(12), and 3) data geometry aware label inference through both the label smoothness (Eq.(13)) and geometric structure (Eq.(15)) consistencies. Therefore, our final model is defined as:

(16)

It can be re-written mathematically as:

(17)

where the constraint removes an arbitrary scaling factor in the embedding and prevents the above optimization collapse onto a subspace of dimension less than the required dimensions. is a regularization parameter to guarantee the optimization problem to be well-defined. is a trade-off parameter which balances LSC and GSC.

Iii-C Solving the model

Direct solution to Eq.(17) is nontrivial. We divide it into two sub-problems.

Sub-problem (a):

(18)

Sub-problem (b):

(19)

These two sub-problems are then iteratively optimized.

Sub-problem (a) amounts to solving the generalized eigendecomposition problem. Augmented Lagrangian method [10, 22] can be used to solve this problem. In setting its partial derivation w.r.t. equal to zero, we obtain:

(20)

where is the Lagrange multiplier. The optimal subspace is reduced to solving Eq.(20

) for the k smallest eigenvectors. Then, we obtain the projection matrix

and the underlying embedding space .

Sub-problem (b) is nontrivial. Inspired by the solution proposed in [45] [18] [43], the minimum is approached where the derivative of the function is zero. An approximate solution can be provided by:

(21)

where is the probability of prediction of the target domain corresponding to different class labels, is an affinity matrix and is the diagonal matrix.

For sake of simplicity, we define and then Eq.(21) is reformulated as Eq.(22):

(22)

To sum up, at a given iteration, sub-problem (a) as in Eq.(18) searches a latent feature subspace Z in closering both marginal and conditional data distributions between source and target while making use of source and current target labels in pushing away interclass data; sub-problem (b) as in Eq.(19) infers through Eq.(22) novel labels for target data in line with source data labels while making use of the geometric structures of the underlying data manifolds in the current subspace Z. This iterative process eventually ends up in a latent feature subspace where : 1) the discrepancies of both marginal and conditional data distributions between source and target are narrowed; 2) source and target data are rendered more discriminative thanks to the increase of interclass distances; and 3) the geometric structures of the underlying data manifolds are aligned.

The complete learning algorithm is summarized in Algorithm 1 - DGA-DA.

Input: Data , Source domain labels , subspace dimension , number of iterations , regularization parameters and
1 Step 1: Initialize the iteration counter t=0 and compute as in Eq.(5).
2 Step 2: Initialize pseudo target labels and projection space A:
3 (1) Solve the generalized eigendecomposition problem[10, 22] as in Eq.(20) (replace by ) and obtain adaptation matrix , then embed data via the transformation, ;
4 (2)Initialize pseudo target labels via a base classifier, e.g., 1-NN, based on source domain labels . while not converged and  do
5       Step 3: Update projection space A
6       (i) Compute (Eq.(7)) (ii) Compute as in Eq.(11) and Eq.(9) via . (iii) Calculate ;
7       (iv) Solve Eq.(20) then update and ;
8       Step 4: Label deduction
9       (i) construct the label matrix as in Eq.(14);
10       (ii) design the affinity matrix[26] and diagonal matrix ;
11       (iii) obtain in solving Eq.(21);
12       Step 5: update pseudo target labels ;
13       Step 6: ; Return to Step 3;
14      
Output: Adaptation matrix , embedding , Target domain labels
Algorithm 1 Discriminative Geometry Aware Domain Adaptation (DGA-DA)

Iii-D Kernelization Analysis

The proposed DGA-DA method can be extended to nonlinear problems in a Reproducing Kernel Hilbert Space via the kernel mapping , or , and the kernel matrix . We utilize the Representer theorem to formulate Kernel DGA-DA as

(23)

Iv Experiments

In this section, we verify and analyze in-depth the effectiveness of our proposed domain adaptation model, i.e., DGA-DA, on 36 cross domain image classification tasks generated by permuting six datasets (see Fig.2). Sect.IV-A describes the benchmarks and the features. Sect.IV-B lists the baseline methods which the proposed DGA-DA is compared to. Sect.IV-C presents the experimental setup and introduces in particular two partial DA methods, namely CDDA and GA-DA, in addition to the proposed DGA-DA based on our full DA model. Sect.IV-D discusses the experimental results in comparison with the state of the art. Sect.IV-E analyzes the convergence and parameter sensitivity of the proposed method. Sect.IV-F further provides insight into the proposed DA model in visualizing the achieved feature subspaces through both synthetic and real data.

Iv-a Benchmarks and Features

As illustrated in Fig.2, USPS[15]+MINIST[20], COIL20[22], PIE[22] and office+Caltech[22, 42, DBLP:journals/tip/HouTYW16, 35] are standard benchmarks for the purpose of evaluation and comparison with state-of-the-art in DA. In this paper, we follow the data preparation as most previous works[40, 42, 12, 11, 6, 24] do. We construct 36 datasets for different image classification tasks.

Office+Caltech consists of 2533 images of ten categories (8 to 151 images per category per domain)[11]. These images come from four domains: (A) AMAZON, (D) DSLR, (W) WEBCAM, and (C) CALTECH. AMAZON images were acquired in a controlled environment with studio lighting. DSLR consists of high resolution images captured by a digital SLR camera in a home environment under natural lighting. WEBCAM images were acquired in a similar environment to DSLR, but with a low-resolution webcam. CALTECH images were collected from Google Images.

We use two types of image features extracted from these datasets,

i.e., SURF and DeCAF6, that are publicly available. The SURF[13] features are shallow

features extracted and quantized into an 800-bin histogram using a codebook computed with K-means on a subset of images from Amazon. The resultant histograms are further standardized by z-score. The

Deep Convolutional Activation Features (DeCAF6)[8] are deep features computed as in AELM[40]

which makes of use VLFeat MatConvNet library with different pretrained CNN models, including in particular the Caffe implementation of

AlexNet[19]

which is trained on the ImageNet dataset. The outputs from the 6th layer are used as

deep features, leading to 4096 dimensional DeCAF6 features. In this experiment, we denote the dataset Amazon,Webcam,DSLR,and Caltech-256 as A,W,D,and C, respectively.

In denoting the direction from “source” to “target” by an arrow “” is the direction from “source” to “target”, DA tasks can then be constructed, namely A W C D, respectively. For example, “W D” means the Webcam image dataset is considered as the labeled source domain whereas the DSLR image dataset the unlabeled target domain.

USPS+MNIST shares ten common digit categories from two subsets, namely USPS and MNIST, but with very different data distributions (see Fig.2). We construct a first DA task USPS vs MNIST by randomly sampling 1,800 images in USPS to form the source data, and randomly sampling 2,000 images in MNIST to form the target data. Then, we switch the source/target pair to get another DA task, i.e., MNIST vs USPS. We uniformly rescale all images to size 16×16, and represent each one by a feature vector encoding the gray-scale pixel values. Thus the source and target data share the same feature space. As a result, we have defined two cross-domain DA tasks, namely USPS MNIST and MNIST USPS.

COIL20 contains 20 objects with 1440 images (Fig.2). The images of each object were taken in varying its pose about 5 degrees, resulting in 72 poses per object. Each image has a resolution of 32×32 pixels and 256 gray levels per pixel. In this experiment, we partition the dataset into two subsets, namely COIL 1 and COIL 2[42]. COIL 1 contains all images taken within the directions in (quadrants 1 and 3), resulting in 720 images. COIL 2 contains all images taken in the directions within (quadrants 2 and 4) and thus the number of images is 720. In this way, we construct two subsets with relatively different distributions. In this experiment, the COIL20 dataset with 20 classes is split into two DA tasks, i.e., COIL1 COIL2 and COIL2 COIL1

PIE face database consists of 68 subjects with each under 21 various illumination conditions[6, 22]. We adopt five pose subsets: C05, C07, C09, C27, C29, which provides a rich basis for domain adaptation, that is, we can choose one pose as the source and any rest one as the target. Therefore, we obtain different source/target combinations. Finally, we combine all five poses together to form a single dataset for large-scale transfer learning experiment. We crop all images to 32 × 32 and only adopt the pixel values as the input. Finally, with different face poses, of which five subsets are selected, denoted as PIE1, PIE2, etc., resulting in DA tasks, i.e., PIE1 vs PIE 2 PIE5 vs PIE 4, respectively.

Fig. 2: Sample images from six datasets used in our experiments. Each dataset represents a different domain. The OFFICE dataset contains three sub-datasets, namely DSLR, Amazon and Webcam.

Iv-B Baseline Methods

The proposed DGA-DA method is compared with twenty-two

methods of the literature, including deep learning-based approaches for unsupervised domain adaption. They are: (1)1-Nearest Neighbor Classifier(

NN); (2) Principal Component Analysis (PCA) +NN; (3) Geodesic Flow Kernel(GFK) [13] + NN; (4) Transfer Component Analysis(TCA) [28] +NN; (5) Transfer Subspace Learning(TSL) [36] +NN; (6) Joint Domain Adaptation (JDA) [22] +NN; (7) Extreme Learning Machine (ELM) [40] +NN; (8) Augmented Extreme Learning Machine (AELM) [40] +NN; (9) Subspace Alignment (SA)[9]; (10) Marginalized Stacked Denoising Auto-encoder (mSDA)[4]; (11) Transfer Joint Matching (TJM)[23]; (12) Robust Transfer Metric Learning (RTML)[6]; (13) Scatter Component Analysis (SCA)[11]; (14) Cross-Domain Metric Learning (CDML)[41]; (15)Deep Domain Confusion (DDC)[39]; (16)Low-Rank Transfer Subspace Learning (LTSL)[35]; (17)Low-Rank and Sparse Representation (LRSR)[42]; (18)Kernel Principal Component Analysis (KPCA)[34]; (19)Joint geometric and statistical alignment (JGSA) [44]; (20)Deep Adaptation Networks (DAN) [21]

; (21)Deep Convolutional Neural Network (

AlexNet) [19] and (22)Domain adaptation with low-rank reconstruction (RVDLR) [16].

In addition, for the purpose of fair comparison, we follow the experiment settings of JGSA, AlexNet and SCA, and apply DeCAF6 as the features for some methods to be evaluated. Whenever possible, the reported performance scores of the twenty-two methods of the literature are directly collected from previous research [22, 40, 6, 11, 42, 44]. They are assumed to be their best performance.

Iv-C Experimental Setup

For the problem of domain adaptation, it is not possible to tune a set of optimal hyper-parameters, given the fact that the target domain has no labeled data. Following the setting of previous research[24, 22, 42] , we also evaluate the proposed DGA-DA by empirically searching in the parameter space for the optimal settings. Specifically, the proposed DGA-DA method has three hyper-parameters, i.e., the subspace dimension , regularization parameters and . In our experiments, we set and 1) , and for USPS, MNISTCOIL20 and PIE, 2) , for Office and Caltech-256.

In our experiment, accuracy on the test dataset as defined by Eq.(24) is the evaluation measurement. It is widely used in literature, e.g.,[27, 21, 24, 22, 42], etc.

(24)

where is the target domain treated as test data, is the predicted label and is the ground truth label for a test data .

To provide insight into the proposed DA method and highlight the individual contribution of each term in our final model, i.e., the discriminative term using the repulsive force as defined in Eq.(12) and the geometry aware term through label smooth consistency as in Eq.(13) and geometry structure consistency as in Eq.(15), we evaluate the proposed DA method using three settings:

  • CDDA: In this setting, sub-problem (b) in sect. III-C as defined in Eq.(19) is simply replaced by the Nearest Neighbor (NN) predictor. This correspond to our final DA model as defined in Eq.(18) which only makes use of the repulse force term but without geometry aware label inference as defined by Eq.(13) and Eq.(15). This setting makes it possible to understand how important discriminative DA is w.r.t. state of the art baseline DA methods only focused on data distribution alignment, e.g., JDA.

  • GA-DA: In this setting, we extend popular data distribution alignment-based DA methods, e.g., JDA, with geometry aware label inference but ignore the repulsive force term, i.e., , in our final model reformulated in Eq.(17). This setting thus jointly consider across domain conditional and marginal distribution alignment (Eq.(5) and Eq.(6)) and geometry aware label inference (Eq.(13) and Eq.(15)). This setting enables quantification of the contribution of the geometry aware label inference term as defined by Eq.(13) and Eq.(15) in comparison with state of the art baseline DA methods only focused on data distribution alignment, e.g., JDA.

  • DGA-DA:This setting correspond to our full final model as defined in Eq.(17). It thus contains CDDA as expressed by sub-problem (a) as in sect. III-C to which we further add the geometry aware label inference as defined by sub-problem (b) in sect. III-C.

Iv-D Experimental Results and Discussion

Iv-D1 Experiments on the COIL 20 Dataset

The COIL dataset (see fig.2) features the challenge of pose variations between the source and target domain. Fig.3 depicts the experimental results on the COIL dataset. As can be seen in this figure where top results are highlighted in red color, the two partial models, i.e., CDDA, DA-GA and the proposed final model, DGA-DA, depict an overall average accuracy of , and , respectively. They both outperform the eight baseline DA algorithms with a significant margin.

It is worth noting that, when adding label inference based on the underlying data manifold structure, the proposed DGA-DA improves its sibling CDDA by a margin as high as roughly 7 points, thereby highlighting the importance of data geometry aware label inference as introduced in DGA-DA. As compared to JDA, the proposed CDDA, which adds a discriminative repulsive force term w.r.t. JDA, also shows its effectiveness and improves the latter by more than 3 points.

Fig. 3: Accuracy on the COIL Images Dataset.

Iv-D2 Experiments on the Office+Caltech-256 Data Sets

Fig. 4: Accuracy on the Office+Caltech Images with DeCAF6 Features.

Fig.4 and Fig.5

synthesize the experimental results in comparison with the state of the art when deep features (

i.e., DeCAF6 features) and classic shallow features (i.e., SURE features) are used, respectively.

  • As can be seen in Fig.5, both CDDA and DGA-DA outperform the state of the art method in terms of average accuracy, thereby demonstrating the effectiveness of the proposed DA method. In particular, in comparison with JDA which only cares about data distribution alignment between source and target and the proposed DA method is built upon, CDDA improves JDA by 2 points thanks to the discriminative repulsive force term introduced in our model. When label inference accounts for the underlying data structure, our final model DGA-DA further improves CDDA by roughly 1 point.

    Fig. 5: Accuracy on the Office+Caltech Images with SURF-BoW Features.
  • Fig.4 compares the proposed DA method using deep features w.r.t. the state of the art, in particular end-to-end deep learning-based DA methods. As can be seen in Fig.4, the use of deep features has enabled impressive accuracy improvement over shallow features. Simple baseline methods, e.g., NN, PCA, see their accuracy soared by roughly 40 points, demonstrating the power of deep learning paradigm. Our proposed DA method also takes advantage of this jump and sees its accuracy soared from 48.22 to 89.13 for CDDA and from 49.02 to 90.43 for DGA-DA. As for shallow features, CDDA improves JDA by 3 points and DGA-DA further ameliorates CDDA by 1 point when label inference accounts for the underlying data geometric structure. As a result, DGA-DA displays the best average accuracy and outperforms slightly DAN.

Iv-D3 Experiments on the USPS+MNIST Data Set

The UPS+MNIST dataset features different writing styles between source and target. Fig.6 lists the experimental results in comparison with 14 state of the art DA methods. As can be seen in the table, CDDA displays a 69.14% average accuracy and ranks the third best performer. It shows its effectiveness once more as it improves its baseline JDA by more than 5 points on average. When accounting for the underlying data geometry structure, the proposed DGA-DA further improves its sibling CDDA by a margin more than 7 points and displays the state of the art performance of a 76.54% accuracy. It is worth noting that the second best DA performer on this dataset, i.e., JGSA, also suggests aligning both statistically and geometrically data, and thereby corroborates our data geometry aware DA approach.

Fig. 6: Accuracy on the USPS+MNIST Images Dataset.

Iv-D4 Experiments on the CMU PIE Data Set

The CMU PIE dataset is a large face dataset featuring both illumination and pose variations. Fig.7 synthesizes the experimental results for DA using this dataset. As can be seen in the figure, similarly as in the previous experiments, the proposed DGA-DA displays the best average accuracy over 20 cross-domain adaptation experiments. In aligning both marginal and conditional data distributions, JDA performs quite well and displays a 60.24% average accuracy. In integrating the discriminative repulsive force term, CDDA improves JDA by roughly 3 points. DGA-DA further ameliorates CDDA by more than 1 point.

It is interesting to note that the second best performer on this dataset, namely LRSR, also tries to align geometrically source and target data through both low rank and sparse constraints so that source and target data are interleaved within a novel shared feature subspace.

Fig. 7: Accuracy on the PIE Images Dataset.

Iv-E Convergence and Parameter Sensitivity

Fig. 8: Sensitivity analysis of the proposed methods: (a) accuracy w.r.t. subspace dimension of CDDA; (b)accuracy w.r.t. subspace dimension of GA-DA; (c) accuracy w.r.t. subspace dimension of DGA-DA. Four datasets are used, i.e., COIL1, COIL2, USPS and MNIST.
Fig. 9: The classification accuracies of the proposed GA-DA and DGA-DA method vs. the parameters and on the selected four cross domains data sets, i.e., DSLR (D), Webcam (W), COIL1 and COIL2, with held fixed at .
Fig. 10: Convergence analysis using 12 cross-domain image classification tasks on Office+Caltech256 datasets with DeCAF6 Features. (accuracy w.r.t iterations)
Fig. 11: Comparisons of baseline domain adaptation methods and the proposed CDDA, GA-DA and DGA-DA method on the synthetic data
Fig. 12: Accuracy() and Visualization results of the MNISTUSPS DA task. Fig.12(a), Fig.12(b) and Fig.12(c) are visualization results of MNIST, USPS, MNISTUSPS datasets in their Original data space, respectively. After domain adaptation, Fig.12(d), Fig.12(e), Fig.12(f) and Fig.12(g) visualize the MNISTUSPS datasets in JDA, CDDA, GA-DA and DGA-DA subspaces, respectively. Fig.12(h), Fig.12(i), Fig.12(j) and Fig.12(k) show the visualization results of the target domain USPS in JDA, CDDA, GA-DA and DGA-DA subspaces, respectively. The ten digit classes are represented by different colors.

While the proposed DGA-DA displays state of the art performance over 36 DA tasks through six datasets (USPS, MINIST, COIL20, PIE, Amazon, Caltech), an important question is how fast the proposed method converges (sect.IV-E2) as well as its sensitivity w.r.t. its hyper-parameters (Sect.IV-E1).

Iv-E1 Parameter sensitivity

Three hyper-parameters, namely , and , are introduced in the proposed methods. is the dimension of the extracted feature subspace which determines the structure of low-dimension embedding. In Fig.8, we plot the classification accuracies of the proposed DA method w.r.t different values of on the COIL and USPS+MINIST datasets. As shown in Fig.8, the subspace dimensionality varies with , yet the proposed 3 DA variants, namely, CDDA, GA-DA and DGA-DA, remain stable w.r.t. a wide range of with . In our experiments, we set to balance efficiency and accuracy.

as introduced in Eq.(17) and Eq.(18) aims to regularize the projection matrix A to avoid over-fitting the chosen shared feature subspace with respect to both source and target data. as defined in Eq.(22) is a trade-off parameter which balances LSC and GSC. We study the sensitivity of the proposed GA-DA and DGA-DA methods with a wide range of parameter values, i.e., and . We plot in Fig.9 the results on D W COIL1 COIL2 datasets on both methods with held fixed at . As can be seen from Fig.9, the proposed GA-DA and DGA-DA display their stability as the resultant classification accuracies remain roughly the same despite a wide range of and values.

Iv-E2 Convergence analysis

In Fig.10, we further perform convergence analysis of the proposed CDDA, GA-DA and DGA-DA methods using the DeCAF6 features on the Office+Caltech datasets. The question here is how fast a DA method achieves its best performance w.r.t. the number of iterations . Fig.10 reports 12 cross domain adaptation experiments ( C A, C WD A , D W ) with the number of iterations .

As shown in Fig.10, CDDA, GA-DA and DGA-DA converge within 35 iterations during optimization.

Iv-F Analysis and Verification

To further gain insight of the proposed CDDA, GA-DA and DGA-DA w.r.t. its domain adaptation skills, we also evaluate the proposed methods using a synthetic dataset in comparison with several state of the art DA methods. Fig.11 visualizes the original data distributions with 4 classes and the resultant shared feature subspaces as computed by TCA, JDA, TJM, SCA, CDDA, GA-DA and DGA-DA, respectively. In this experiment, we focus our attention on the ability of the DA methods to: : (a) narrow the discrepancies of data distributions between source and target; (b) increase data discriminativeness; and (c) align data geometric structures between source and target. As such, the original synthetic data depicts slight distribution discrepancies between source and target for the first two class data, wide distribution mismatch for the third and fourth class data. Fourth class data further depict a moon like geometric structure.

As can be seen in Fig.11, baseline methods, e.g., TCA, SCA, TJM have difficulties to align data distributions with wide discrepancies, e.g., third class data. JDA narrow data distribution discrepancies but lacks class data discriminativeness. The proposed variant CDDA ameliorates JDA and makes class data well separated thanks to the introduced repulsive force term but falls short to preserve data geometric structure (see the fourth moon like class data. The variant GA-DA align data distributions and preserves the underlying data geometric structures thanks to label smoothness consistency (LSC) and geometric structure consistency (GSC) but lacks data discriminativeness. In contrast, thanks to the joint consideration of data discriminativeness and geometric structure awareness, the proposed DGA-DA not only align data distributions compactly but also separate class data very distinctively. Furthermore, it also preserves the underlying data geometric structures.

The above findings can be further verified using real data through the MNISTUSPS DA task where the proposed DA methods achieves remarkable results (See Fig.6). Fig.12 visualizes class explicit data distributions in their original subspace and the resultant shared feature subspace using JDA and the three variants of the proposed DA method, namely CDDA, GA-DA , DGA-DA, with the same experimental setting.

  • Data distributions and geometric structures. Fig.12(a,b,c) visualize the MNIST, USPS, MNISTUSPS datasets in their Original data space, respectively. As shown in these figures, the MNIST and USPS datasets depict different data distributions and various data structures. In particular, yellow dots represent digit 2. They show a long and narrow shape in MNIST (Fig.12(a)) while a circle like shape in USPS (Fig.12(b)). They further display large data discrepancies across domain (Fig.12(c)) as for all the other classes.

  • Contribution of the repulsive force term. Visualization results in Fig.12(h,i,j,k) show that, in comparison with their respective baseline DA methods, i.e., JDA (Fig.12(h)) and GA-DA (Fig.12(j)), the proposed two DA variants, i.e., CDDA(Fig.12(i)) and DGA-DA(Fig.12(k)) which integrate in their model the repulsive force term as introduced in Sect.III-B3, achieve data discriminativeness in compacting intra-class instances and separating inter-class data, respectively. As a result, as shown in Fig.6, DGA-DA outperforms GA-DA by points, and CDDA outperforms JDA by points,respectively, thereby illustrating the importance of increasing data discriminativeness in DA.

  • Contribution of Geometric Structure Awareness. Visualization results in Fig.12(d,e) show that the JDA and CDDA’s subspaces fail to preserve the geometric structures of the underlying data manifold. For instance, the long and narrow shape of the orange dots in the source MNIST domain and the corresponding circle blob orange cloud in the target USPS domain (Fig.12(c)) are not preserved anymore in the JDA (Fig.12(d)) and CDDA (Fig.12(e)) subspaces. In contrast, thanks to the geometry awareness constraints, i.e., label smoothness consistency (LSC) and geometric structure consistency (GSC), as introduced in Sect.III-B4, the two variants of the proposed DA methods, i.e., DA-GA (Fig.12(f)) and DGA-DA (Fig.12(g)), succeed to preserve the geometric structures of the underlying data, and thereby inherent data similarities and consistencies of label inference. As a result, DGA-DA outperforms CDDA by points, and GA-DA outperforms JDA by points. They thus suggest the importance of Geometric Structure Awareness in DA.

V Conclusion and Future Work

In this paper, we have proposed a novel Discriminative and Geometry Aware Unsupervised DA method based on feature adaptation. Comprehensive experiments on 36 cross-domain image classification tasks through six popular DA datasets highlight the interest of enhancing the data discriminative properties within the model and label propagation in respect of the geometric structure of the underlying data manifold, and verify the effectiveness of the proposed method compared with twenty-two baseline DA methods of the literature. Using both synthetic and real data and three variants of the proposed DA method, we have further provided in-depth analysis and insights into the proposed DGA-DA, in quantifying and visualizing the contribution of the data discriminativeness and data geometry awareness.

Our future work will concentrate on embedding the proposed method in deep networks and study other vision tasks, e.g., object detection, within the setting of transfer learning. Our future work will concentrate on embedding the proposed method in deep networks and study other vision tasks, e.g., object detection, within the setting of transfer learning.

References

  • [1] Mahsa Baktashmotlagh, Mehrtash Harandi, and Mathieu Salzmann. Distribution-matching embedding for visual domain adaptation. Journal of Machine Learning Research, 17(108):1–30, 2016.
  • [2] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
  • [3] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alex J Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
  • [4] Minmin Chen, Zhixiang Eddie Xu, Kilian Q. Weinberger, and Fei Sha. Marginalized denoising autoencoders for domain adaptation. CoRR, abs/1206.4683, 2012.
  • [5] Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey. CoRR, abs/1702.05374, 2017.
  • [6] Zhengming Ding and Yun Fu. Robust transfer metric learning for image classification. IEEE Trans. Image Processing, 26(2):660–670, 2017.
  • [7] Jeff Donahue, Judy Hoffman, Erik Rodner, Kate Saenko, and Trevor Darrell. Semi-supervised domain adaptation with instance constraints. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 668–675, 2013.
  • [8] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 647–655, 2014.
  • [9] Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 2960–2967, 2013.
  • [10] Michel Fortin and Roland Glowinski. Augmented Lagrangian methods: applications to the numerical solution of boundary-value problems, volume 15. Elsevier, 2000.
  • [11] Muhammad Ghifary, David Balduzzi, W. Bastiaan Kleijn, and Mengjie Zhang. Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE Trans. Pattern Anal. Mach. Intell., 39(7):1414–1430, 2017.
  • [12] Boqing Gong, Kristen Grauman, and Fei Sha. Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 222–230, 2013.
  • [13] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2066–2073. IEEE, 2012.
  • [14] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
  • [15] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell., 16(5):550–554, 1994.
  • [16] I-Hong Jhuo, Dong Liu, DT Lee, and Shih-Fu Chang. Robust visual domain adaptation with low-rank reconstruction. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2168–2175. IEEE, 2012.
  • [17] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 180–191. VLDB Endowment, 2004.
  • [18] T. H. Kim, K. M. Lee, and S. U. Lee. Learning full pairwise affinities for spectral segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7):1690–1703, July 2013.
  • [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [20] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
  • [21] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. In ICML, pages 97–105, 2015.
  • [22] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu.

    Transfer feature learning with joint distribution adaptation.

    In Proceedings of the IEEE International Conference on Computer Vision, pages 2200–2207, 2013.
  • [23] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S. Yu. Transfer joint matching for unsupervised domain adaptation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 1410–1417, 2014.
  • [24] Lingkun Luo, Xiaofang Wang, Shiqiang Hu, and Liming Chen. Robust data geometric structure aligned close yet discriminative domain adaptation. CoRR, abs/1705.08620, 2017.
  • [25] Lingkun Luo, Xiaofang Wang, Shiqiang Hu, Chao Wang, Yuxing Tang, and Liming Chen. Close yet distinctive domain adaptation. CoRR, abs/1704.04235, 2017.
  • [26] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss.

    On spectral clustering: Analysis and an algorithm.

    In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 849–856. MIT Press, 2002.
  • [27] Sinno Jialin Pan, James T Kwok, and Qiang Yang. Transfer learning via dimensionality reduction. In AAAI, volume 8, pages 677–682, 2008.
  • [28] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
  • [29] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • [30] Pau Panareda Busto and Juergen Gall. Open set domain adaptation. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [31] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3):53–69, May 2015.
  • [32] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain adaptation. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2988–2997, 2017.
  • [33] Sandeepkumar Satpal and Sunita Sarawagi. Domain adaptation of conditional probability models via feature subsetting. In PKDD, volume 4702, pages 224–235. Springer, 2007.
  • [34] Bernhard Schölkopf, Alexander J. Smola, and Klaus-Robert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998.
  • [35] Ming Shao, Dmitry Kit, and Yun Fu. Generalized transfer subspace learning through low-rank constraint. International Journal of Computer Vision, 109(1-2):74–93, 2014.
  • [36] S. Si, D. Tao, and B. Geng. Bregman divergence-based regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering, 22(7):929–942, July 2010.
  • [37] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1433–1440, 2008.
  • [38] Yuxing Tang, Josiah Wang, Xiaofang Wang, Boyang Gao, Emmanuel Dellandrea, Robert Gaizauskas, and Liming Chen. Visual and semantic knowledge transfer for large scale semi-supervised object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [39] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. CoRR, abs/1412.3474, 2014.
  • [40] Muhammad Uzair and Ajmal S. Mian. Blind domain adaptation with augmented extreme learning machine features. IEEE Trans. Cybernetics, 47(3):651–660, 2017.
  • [41] Hao Wang, Wei Wang, Chen Zhang, and Fanjiang Xu. Cross-domain metric learning based on information theory. In

    Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Québec City, Québec, Canada.

    , pages 2099–2105, 2014.
  • [42] Yong Xu, Xiaozhao Fang, Jian Wu, Xuelong Li, and David Zhang. Discriminative transfer subspace learning via low-rank and sparse representation. IEEE Trans. Image Processing, 25(2):850–863, 2016.
  • [43] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. H. Yang. Saliency detection via graph-based manifold ranking. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 3166–3173, June 2013.
  • [44] Jing Zhang, Wanqing Li, and Philip Ogunbona. Joint geometrical and statistical alignment for visual domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [45] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16, pages 321–328. MIT Press, 2004.