1 Introduction
Traditional machine learning tasks assume that both training and testing data are drawn from a same data distribution
[25, 27]. However, in many reallife applications, due to different factors as diverse as sensor difference, lighting changes, viewpoint variations, etc., data from a target domain may have a different data distribution with respect to the labeled data in a source domain where a predictor can be reliably learned. On the other hand, manually labeling enough target data for the purpose of training an effective predictor can be very expensive, tedious and thus prohibitive.Domain adaptation (DA) [25, 27] aims to leverage possibly abundant labeled data from a source domain to learn an effective predictor for data in a target domain despite the data distribution discrepancy between the source and target. While DA can be semisupervised by assuming a certain amount of labeled data is available in the target domain, in this paper we are interested in unsupervised DA where we assume that the target domain has no labels.
Recent DA methods can be categorized into instancebased adaptation[25, 7] and featurebased adaptation[21, 36]. The instancebased approach generally assumes that 1) the conditional distributions of source and target domain are identical[37], and 2) certain portion of the data in the source domain can be reused[25]
for learning in the target domain through reweighting. Feature based adaptation relaxes such a strict assumption and only requires that there exists a mapping from the input data space to a latent shared feature representation space. This latent shared feature space captures the information necessary for training classifiers for source and target tasks. In this paper, we propose a featurebased adaptation DA method.
A common method to approach feature adaptation is to seek a lowdimensional latent subspace[27, 26] via dimension reduction. State of the art features two main lines of approaches, namely data geometric structure alignmentbased or data distribution centered. Data geometric structure alignmentbased approaches, e.g., LTSL[29] , LRSR [36], seek a subspace where source and target data can be well aligned and interlaced in preserving inherent hidden geometric data structure via low rank constraint and/or sparse representation. Data distribution centered methods aim to search a latent subspace where the discrepancy between the source and target data distributions is minimized, via various distances, e.g., Bregman divergence[30] based distance, Geodesic distance[13] or Maximum Mean Discrepancy[14] (MMD). The most popular distance is MMD due to its simplicity and solid theoretical foundations.
A cornerstone theoretical result in DA [3, 2, 17] is achieved by BenDavid et al.
, who estimated an error bound of a learned hypothesis
on a target domain:(1) 
Eq.(1) provides insight on the way to improve DA algorithms as it states that the performance of a hypothesis on a target domain is determined by: 1) the classification error on the source domain ; 2) which measures the divergence[17] between two distributions(, ); 3) the difference in labeling functions across the two domains. In light of this theoretical result, we can see that data distribution centered DA methods only seek to minimize the second term in reducing data distribution discrepancies, whereas data geometric structure alignmentbased methods account for the underlying data geometric structure and expect but without theoretical guarantee the alignment of data distributions.
Different from state of the art DA methods, we propose in this paper a novel Discriminative Label Consistent DA (DLCDA) method which provides a unified framework for a simultaneous optimization of the three terms in the upperbound error in Eq.(1). Specifically, the proposed DLCDA also seeks a latent feature subspace to align data distributions as other state of the art methods , e.g., TCA[24], JDA[21], do, but also introduces a repulsive force term in the proposed model so as to increase interclass distances and thereby facilitate discriminative learning. More importantly, the proposed DLCDA leverages existing labels in the source domain and ensures label consistencies between the source and target domain through an iterative integrated linear label regression, thereby minimizing jointly the first and third term of the error bound of the underlying learned hypothesis on the target domain.
Comprehensive experiments carried out on standard DA benchmarks, i.e., 12 crossdomain image classification tasks, verify the effectiveness of the proposed method, which consistently outperforms the stateoftheart methods. Indepth analysis using both synthetic data and two additional partial models further provides insight of the proposed DA model and highlight its interesting properties.
The paper is organized as follows. Section 2 discusses the related work. Section 3 presents the method. Section 4 benchmarks the proposed DA method and provides indepth analysis. Section 5 draws conclusion.
2 Related work
Unsupervised Domain Adaptation (DA) assumes no labeled data are provided in the target domain. In earlier days this problem[25] is also known as covariant shift and can be solved by sample reweighting. However, those methods fail when the divergence between the source and target domain becomes significant.
Recent DA methods follow a mainstream approach which is based on feature adaptation. The core idea of these methods is to search a latent shared feature space where both source and target data are statistically aligned, thereby a hypothesis learned using labeled source data can be an effective predictor for the unlabeled target data. Eq.(1) by BenDavid et al. [3, 2, 17] provides a theoretical foundation of this approach. In minimizing the divergence of data distributions between the source and target domain, feature adaptationbased DA methods decrease the second term of the upper error bound in Eq.(1) and thereby improve performance of the learned predictor on the target domain.
The literature in feature adaptation has so far featured two main research lines: data distribution convergence (DDC)based or data geometric structure alignment (DGSA)based. In DDCbased DA methods, one aims to seek a latent shared feature subspace where the disparity of data distribution between source and target is minimized. For example,[30] proposed a Bregman Divergence based regularization schema, which combines Bregman divergence with conventional dimensionality reduction algorithms. In TCA[24], the authors use a similar dimensionality reduction framework while making use of MMD to minimize the marginal distribution shift. JDA[21] goes one step further and proposes to simultaneously minimize the discrepancies of the marginal and conditional distributions between source and target. Mahsa[1] proposes a novel dimension reduction DA method via learning two different distances to compare the source and target distributions: the Maximum Mean Discrepancy and the Hellinger distance. In DGSAbased DA methods, one seeks a shared feature subspace where target data can be sparsely reconstructed from source data [29] or source and target data are interleaved [36].
In light of the three terms in the upper error bound defined by the right hand side of Eq.(1), an optimized hypothesis on the target domain should simultaneously 1) minimize the prediction errors on the source domain, 2) decrease the divergence of data distributions and 3) ensure label consistency between the source and target. However, state of the art feature adaptationbased DA methods have only focused so far on data alignment, either statistically or geometrically. DDCbased DA methods only focus on bringing data distributions closer and may fall short to capture the inherent underlying data geometric structure. DGSAbased DA methods align data geometric structures between source and target in the searched feature subspace, and expect but without theoretical guarantee that the discrepancy of data distributions between source and target is implicitly reduced in the resultant subspace. As such, an interesting move recently is SCA [10] and JGSA [37] which jointly leverage data statistical and geometric properties in the search of the latent shared feature subspace. As shown in Fig.7 using synthetic data, another major disadvantage of most state of the art DA methods is that they do not consider discriminative knowledge hidden in the conditional distributions and as such they fall short to provide discriminativeness of data in the resultant feature subspace.
In contrast to those previous DA methods, the proposed DLCDA seeks a latent feature subspace which simultaneously optimizes the three terms of the upper error bound in Eq.(1). Specifically, the proposed DLCDA iteratively searches a shared feature subspace where 1) the source and target data distributions are discriminatively matched, and 2) the source and target labels can be sparsely regressed, thereby minimizing the prediction errors on the source data and ensuring label consistencies between the source and target domain.
3 The proposed method
3.1 Notations and Problem Statement
Matrices are written as boldface uppercase letters. Vectors are written as boldface lowercase letters. For matrix
, its th row is denoted as , and its th column is denoted by . We define the Frobenius norm and norm as: and .A domain is defined as an mdimensional feature space
and a marginal probability distribution
, i.e., with . Given a specific domain , a task is composed of a Ccardinality label set and a classifier , i.e., , wherecan be interpreted as the class conditional probability distribution for each input sample
.In unsupervised domain adaptation, we are given a source domain with labeled samples , which are associated with their class labels , and an unlabeled target domain with unlabeled samples , whose labels are are unknown. Here, is a binary vector in which if belongs to the th class. We define the data matrix in packing both the source and target data. The source domain and target domain are assumed to be different, i.e., , , , .
We also define the notion of subdomain, denoted as , representing the set of samples in with the label . Similarly, a subdomain can be defined for the target domain as the set of samples in with the label . However, as samples in the target domain are unlabeled, the definition of subdomains in the target domain, requires a base classifier, e.g., Nearest Neighbor (NN), to attribute pseudo labels for samples in .
The maximum mean discrepancy (MMD) is an effective nonparametric distancemeasure that compares the distributions of two sets of data by mapping the data to Reproducing Kernel Hilbert Space[4] (RKHS). Given two distributions and , the MMD between and is defined as:
(2) 
where and
are two random variable sets from distributions
and , respectively, and is a universal RKHS with the reproducing kernel mapping : , .The aim of the DLCDA is to search jointly a transformation matrix projecting discriminatively both the source and target data into a latent shared feature subspace and a label regressor in simultaneously minimizing the three terms of the upper error bound in Eq.(1).
3.2 Formulation
Specifically, we aim to define an integrated optimization model with the following properties: P1) The classification error on the source domain is minimized; P2) the discrepancy between the two distributions (, ) is reduced via ; P3) interclass distances in both the two domains are increased so as to facilitate discriminative learning; P4): label consistency across the two domains is explicitly maximized through iterative linear label regression, and P5) Data outliers are accounted for to avoid negative transfer.
3.2.1 Matching Marginal and Conditional Distributions
To meet property P2, we follow JDA[21] and explicitly leverage MMD in RKHS to measure the distances between the expectations of the source domain/subdomain and target domain/subdomain: 1) The empirical distance of the source and target domains are defined as . 2) The conditional distance is defined as the sum of the empirical distances between subdomains in and with a same label .
(3) 
where is the number of classes, represents the subdomain in the source domain. is defined similarly for the target domain. represents the marginal distribution between and with if , if and otherwise. represents the conditional distribution between the subdomains in and with if , if , if and otherwise. is the number of samples in the source subdomain and is defined similarly for the target subdomain .
The discrepancy between the marginal distributions and can be reduced in minimizing whereas the mismatches of conditional distributions between and can be decreased in minimizing . In summary, the discrepancies of both the marginal and conditional distributions between the source and target can be jointly reduced in minimizing .
3.2.2 Repulsing interclass data for discriminative DA
Aligning data distributions as does the previous subsection does not guarantee that both the source and target data are discriminative with respect to the class labels. To satisfy property P1 and P3, we introduce a repulsive force term , where indexes the distances computed from to . represents the sum of the distances from each source subdomain to all the other source subdomains , excluding the th source subdomain:
(4) 
Similarly, we can also introduce a repulsive force term between the source subdomains and those in the target domain, where and index the distances computed from to and those from to , respectively. represents the sum of the distances between each source subdomain and all the target subdomains excluding the th target subdomain. represents the sum of the distances from each target subdomain to all the the source subdomains excluding the th source subdomain. These two distances are explicitly defined as:
(5) 
Eq.(5) is defined in a similar way as Eq.(4) and Eq.(3), where if , if , if and otherwise; if , if , if and otherwise.
Finally, in integrating Eq.(4) and Eq.(5), we obtain the final repulsive force term as
(6) 
We further define as the repulsive force constraint matrix.
The maximization of Eq.(6) increases the distances between source subdomains as well as those between source and target subdomains, thereby enhancing the discriminative power of the underlying latent feature space.
3.2.3 Label Consistent Regression
Eq.(3) and Eq.(6) do not explicitly optimize the prediction errors of a learned hypothesis on the source data nor ensure label consistencies between the source and target domain.
To meet P1 and P4 and thereby explicitly optimize the first and the third term in Eq.(1), we introduce a novel label regression consistency constraint , where A is the transformation matrix projecting both the source and target data into a latent shared feature subspace of dimension . Specifically, we first embed each dimensional label vector into the dimensional latent shared feature subspace by adding times . We can then perform class label regression and explicitly enhance class prediction accuracy on the source data and the class label consistency between the source and target domains through the least square regression (LSR): , with Y the class label matrix as defined in Section 3.1 and extended into a matrix by embedding each label vector into a dimensional unit vector. This constraint simply expresses that each data sample should be projected in the vicinity of its corresponding unit label vector in the latent shared feature subspace.
As source and target data can be noisy, we also introduce a matrix E to model noise. As a result, the LSR can be reformulated as: . The introduction of the error matrix E enables to account for outliers and thereby alleviate the influence of negative transfer to meet property P4.
In many reallife applications, especially in the field of visual recognition, data of a given class generally lie within a manifold of much lower dimension in comparison with the original data space, e.g., pixel number of images. Therefore, we further introduce a norm constraint so as to express the property that the class label of a data sample should be regressed from a sparse combination of features in the latent shared feature subspace. This constraint introduces a regularization term on A for discriminative subspace projection.
The label regression consistency constraint is finally formulated as
(7) 
3.2.4 The final model
In integrating all the properties expressed in the previous subsections, i.e., Eq.(3), Eq.(6) and Eq.(7), we obtain our final DA model, formulated as Eq.(8)
(8) 
Through iterative optimization of Eq.(8), our DA method searches jointly a latent subspace and a label regression model satisfying at the same time properties P1 through P5.
3.3 Solving the model
Eq.(8) is not convex. We propose an effective method that solves each variable in a coordinate descent manner. Main steps for solving Eq.(8) are as follows. All the key steps have a closed form solution:
Step.1 (Initialization of ) can be initialized by calculating since there is no labels or pseudo labels on the target domain yet. We obtain where is defined as in Eq.(3).
Step.2 (Initialization of ) can be initialized to reduce the marginal distributions between and and to calculate an adaptive subspace via the Rayleigh quotient algorithm in solving Eq.(9):
(9) 
where is the centering matrix, and are Lagrange multipliers. is then initialized as the
smallest eigenvectors of Eq.(
9).Step.3 (Update of ) is updated in solving Eq.(8) with other variables held fixed. To update , one should solve Eq.(10)
(10) 
In setting to the partial derivative of Eq.(10) with respect to e, we achieve the optimal solution of e as
(11) 
Step.4 (Update of ) is updated by solving the optimization problem in Eq.(8) with other variables held fixed. To make sure Eq.(8) is differentiable, we regularize as to avoid . As a result, Eq.(8) becomes to Eq.(12)
(12) 
is infinitely close to zero, which makes Eq.(12) closely equivalent to Eq.(8). Solving directly Eq.(12) is nontrivial, we introduce a new variable which is a diagonal matrix with . G and A can be optimized iteratively. With G held fixed and e computed as in Eq.(11), we can reformulate Eq.(12) as Eq.(13)
(13) 
The closed form solution of Eq.(13) is
(14) 
Step.5 (Update of ) The label matrix Y contains two parts: true labels , and pseudo labels . Our aim is to iteratively refine the latter ones. Given fixed A, e and , each can be updated by solving the following problem:
(15) 
Using Lagrangian multipliers method, the final optimal solution of is
(16) 
where is coefficient of Lagrangian constraint , which can be obtained by solving .
Step.6 (Update of ) With labeled source data and pseudo labels generated on the target data in Step.5, we can update as
(17) 
In summary, Algorithm 1 synthesizes the whole process for solving Eq.(8).
3.4 Kernelization Analysis
The proposed DLCDA method can be extended to nonlinear problems in a Reproducing Kernel Hilbert Space via the kernel mapping , or , and the kernel matrix . We utilize the Representer theorem to formulate Kernel DLCDA as
(18) 
4 Experiments
4.1 Benchmarks
In domain adaptation, Office+Caltech[21, 36, 15, 29] are standard benchmarks for the purpose of evaluation and comparison with stateoftheart. In this paper, we follow the data preparation as most previous works[33, 36, 15, 12, 10, 6]. We construct 12 datasets for different image classification tasks.
Office+Caltech consists of 2533 images of ten categories (8 to 151 images per category per domain)[10], that forms four domains: (A) AMAZON, (D) DSLR, (W) WEBCAM, and (C) CALTECH. Fig.(1) illustrates some sample images from each domain. We denote the dataset Amazon,Webcam,DSLR,and Caltech256 by A,W,D,and C, respectively. domain adaptation tasks can then be constructed, namely A W C D, respectively.
Note that the arrow “” is the direction from “source” to “target”. For example, “Webcam DSLR” means Webcam is the labeled source domain while DSLR the unlabeled target.
4.2 State of the art DA Methods
The proposed DLCDA method is compared with twentythree
methods of the literature, including deeplearning based approaches for unsupervised domain adaption, given the fact that we also made use of deep features in our experiments. They are: (1)1Nearest Neighbor Classifier (
NN); (2) Principal Component Analysis (
PCA); (3) GFK[13]; (4)TCA[24]; (5)TSL[31]; (6)JDA[21]; (7)ELM[33]; (8)AELM[33]; (9)SA[9]; (10)mSDA[5]; (11)TJM[22]; (12)RTML[6]; (13)SCA[10]; (14)CDML[35]; (15)DDC[32]; (16)LTSL[29]; (17)LRSR[36]; (18)KPCA[28]; (19)JGSA [37]; (20)DAN[20]; (21)AlexNet[18] (22)PUnDA[11] (23)TAISL[23].4.3 Experimental Setup
We used two types of features extracted from these datasets that are publicly available, namely
SURF and DeCAF6 features. The SURF[13]features are extracted and quantized into an 800bin histogram with the codebook computed with Kmeans on a subset of images from Amazon. Then the histograms are standardized by zscore.
Deep Convolutional Activation Features (DeCAF6)[8] were constructed as in previous research[10, 37, 23] which uses the VLFeat MatConvNet[34]library with a number of pretrained CNN models. With the proposed Caffe
[16] implementation of AlexNet[19]trained on the ImageNet dataset, we used the outputs from the 6th layer as the features, leading to 4096 dimensional
DeCAF6 features.4.4 Results and Discussion
4.4.1 Experiments on the Office+Caltech256 Data Sets
We follow the experimental settings of JDA, LTSL, DAN and LRSR, and apply DeCAF6 as the features for some methods to be evaluated. However, whenever possible in these figures, we directly report the performance scores from the publications, i.e., [21, 33, 6, 10, 36, 37, 11] , of the twentythree methods listed above. They are assumed to be their best performance.
As can be seen from Fig.2 and Fig.3, the experimental results verify the effectiveness of the proposed DLCDA method which consistently outperforms stateoftheart DA methods, whether with the traditional shallow SURF features or the deep DeCAF6 features. However, given such a performance, one natural question that one can raise is how the proposed method is sensitive to the choice of the hyperparameters. This issue is analyzed in the next subsection.
4.4.2 Parameter Sensitivity Analysis
and as defined in Eq.(8) are the major hyperparameters of the proposed DLCDA method. While aims to regularize the projection matrix to avoid overfitting the chosen shared feature subspace with respect to both source and target data, as expressed in Eq.(7) controls the dimensionality of class dependent data manifold in the searched shared feature subspace, or in other word the sparsity level of the linear combination of the projected features to regress the class label. We study the sensitivity of the proposed DLCDA method with a wide range of parameter values, i.e., and . We only report the results on C D W D datasets due to the space limitation. With held fixed at , Fig.4 illustrates these results. As can be seen from Fig.4, the proposed DLCDA displays its stability as the resultant classification accuracies remain roughly the same despite a wide range of and values.
In Fig.5, We further perform a convergence analysis of the proposed DLCDA method, namely the performance convergence w.r.t. the number of iterations , as well as the impact of the chosen dimensionality, i.e., , of the searched shared feature subspace. We report the results on D W, W C D A. In Fig.5.a, we vary the number of iterations , whereas the subspace dimensionality varies with in Fig.5.b. As can be seen from Fig.5.a, the proposed DLCDA achieves its optimal performance only after 2 iterations. Furthermore, Fig.5. shows that DLCDA remains stable w.r.t. a wide rang of .
4.4.3 Analysis and Verification
The core model of the proposed DLCDA method is Eq.(8) which adds up two optimization terms, namely discriminative data distribution alignment (DDA) term as defined in Eq.(19), and label consistent regression (LCR) term as defined in Eq.(7). The DDA and LCR term aim to decrease discriminatively data distribution mismatch and ensure label consistency as defined by the second and third term of the error bound in Eq.(1). One interesting question is how each of these terms contributes to the proposed final model as defined by Eq.(8). For this purpose, we derive from Eq.(8) two additional partial DLCDA methods, namely DLCDA(DDA) making only use of discriminative distribution alignment as defined in Eq.(19) and DLCDA(LCR) restricted to label consistent regression as defined in Eq.(7). They are benchmarked using the Office+Caltech datasets with the DeCAF6 features in comparison with the proposed DLCDA method. Fig.6 plots these experimental results.
(19) 
As can be seen from this figure, the proposed DLCDA outperforms DLCDA(LCR) by points and DLCDA(DDA) by points. These results thus suggest the complementarity of the DDA and LCR terms and the addedvalue of their joint optimization. To gain intuition, Fig.6 further visualizes class explicit data distributions in the resultant shared feature subspace using the three variants of DLCDA over the three crossdomain datasets, namely W C , D C and W D. Different colors represent different classes. As can be seen in the figure, DLCDA(DDA) shows its effectiveness in compacting intraclass instances. When the DDA term is further combined with the LCR term, the proposed DLCDA better separates data from different classes in increasing interclass distances.
To further gain insight of the proposed DLCDA w.r.t. its domain adaptation skills, we also evaluate DLCDA using a synthetic dataset in comparison with several state of the art DA methods. Fig.7 visualizes the original data distributions with 4 classes and the resultant shared feature subspaces as computed by TCA, JDA, TJM, SCA and DLCDA, respectively. In this experiment, we focus our attention on the ability of the DA methods to align discriminatively data distributions between source and target. As such, the original synthetic data depicts slight distribution discrepancies between source and target for the first two class data, wide distribution mismatch for the third and fourth class data. Fourth class data further depict a moon like geometric structure. As can be seen in Fig.7, baseline methods have difficulties to align data distributions with wide discrepancies, i.e., third or fourth class data. In contrast, thanks to the joint use of the DDA and LCR terms, the proposed DLCDA not only align data distributions compactly but also separate class data very distinctively.
5 Conclusion
We have proposed in this paper a novel unsupervised DA method, namely Discriminative Label Consistent Domain Adaptation (DLCDA), which, in contrast to state of the art DA methods only focused on data alignment, simultaneously optimizes three terms of the upper error bound of a learned classifier on the target domain. Furthermore, data outliers are also explicitly accounted for in our model to avoid negative transfer. Comprehensive experiments using the standard Office+CaltechA benchmark in DA show the effectiveness of the proposed method which consistently outperform state of the art DA methods.
References
 [1] M. Baktashmotlagh, M. Harandi, and M. Salzmann. Distributionmatching embedding for visual domain adaptation. Journal of Machine Learning Research, 17(108):1–30, 2016.
 [2] S. BenDavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
 [3] S. BenDavid, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137–144, 2007.
 [4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
 [5] M. Chen, Z. E. Xu, K. Q. Weinberger, and F. Sha. Marginalized denoising autoencoders for domain adaptation. CoRR, abs/1206.4683, 2012.
 [6] Z. Ding and Y. Fu. Robust transfer metric learning for image classification. IEEE Trans. Image Processing, 26(2):660–670, 2017.

[7]
J. Donahue, J. Hoffman, E. Rodner, K. Saenko, and T. Darrell.
Semisupervised domain adaptation with instance constraints.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 668–675, 2013.  [8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pages 647–655, 2014.
 [9] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 18, 2013, pages 2960–2967, 2013.
 [10] M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang. Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE Trans. Pattern Anal. Mach. Intell., 39(7):1414–1430, 2017.
 [11] B. Gholami, O. (Oggi) Rudovic, and V. Pavlovic. Punda: Probabilistic unsupervised domain adaptation for knowledge transfer across visual categories. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [12] B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively learning domaininvariant features for unsupervised domain adaptation. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013, pages 222–230, 2013.
 [13] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2066–2073. IEEE, 2012.
 [14] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the twosampleproblem. In Advances in neural information processing systems, pages 513–520, 2007.
 [15] C. Hou, Y. H. Tsai, Y. Yeh, and Y. F. Wang. Unsupervised domain adaptation with label and structural consistency. IEEE Trans. Image Processing, 25(12):5552–5562, 2016.
 [16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03  07, 2014, pages 675–678, 2014.
 [17] D. Kifer, S. BenDavid, and J. Gehrke. Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data basesVolume 30, pages 180–191. VLDB Endowment, 2004.

[18]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012.  [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, 2017.
 [20] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In ICML, pages 97–105, 2015.

[21]
M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu.
Transfer feature learning with joint distribution adaptation.
In Proceedings of the IEEE International Conference on Computer Vision, pages 2200–2207, 2013.  [22] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer joint matching for unsupervised domain adaptation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 2328, 2014, pages 1410–1417, 2014.

[23]
H. Lu, L. Zhang, Z. Cao, W. Wei, K. Xian, C. Shen, and A. van den Hengel.
When unsupervised domain adaptation meets tensor representations.
In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.  [24] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
 [25] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
 [26] P. Panareda Busto and J. Gall. Open set domain adaptation. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [27] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3):53–69, May 2015.

[28]
B. Schölkopf, A. J. Smola, and K. Müller.
Nonlinear component analysis as a kernel eigenvalue problem.
Neural Computation, 10(5):1299–1319, 1998.  [29] M. Shao, D. Kit, and Y. Fu. Generalized transfer subspace learning through lowrank constraint. International Journal of Computer Vision, 109(12):74–93, 2014.
 [30] S. Si, D. Tao, and B. Geng. Bregman divergencebased regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering, 22(7):929–942, 2010.
 [31] S. Si, D. Tao, and B. Geng. Bregman divergencebased regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering, 22(7):929–942, July 2010.
 [32] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. CoRR, abs/1412.3474, 2014.
 [33] M. Uzair and A. S. Mian. Blind domain adaptation with augmented extreme learning machine features. IEEE Trans. Cybernetics, 47(3):651–660, 2017.
 [34] A. Vedaldi and K. Lenc. Matconvnet  convolutional neural networks for MATLAB. CoRR, abs/1412.4564, 2014.

[35]
H. Wang, W. Wang, C. Zhang, and F. Xu.
Crossdomain metric learning based on information theory.
In
Proceedings of the TwentyEighth AAAI Conference on Artificial Intelligence, July 27 31, 2014, Québec City, Québec, Canada.
, pages 2099–2105, 2014.  [36] Y. Xu, X. Fang, J. Wu, X. Li, and D. Zhang. Discriminative transfer subspace learning via lowrank and sparse representation. IEEE Trans. Image Processing, 25(2):850–863, 2016.
 [37] J. Zhang, W. Li, and P. Ogunbona. Joint geometrical and statistical alignment for visual domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
Comments
There are no comments yet.