1 Introduction
Supervised machine learning requires large amount of labeled training data for an effective training, especially when the underlying prediction model is complex,
e.g., deep learning models
[9, 3] with a number of parameters at a scale of millions. Furthermore, it assumes that training and testing data have a same distribution for the effectiveness of the learned prediction model. However, such a requirement is hardly satisfied in reallife applications, as labeled data generally are rare and manual annotation could be tedious[15, 14], expensive and even imprecise or impractical, especially when labeled data require pixelwise precision in a number of computer vision tasks,
e.g., object edge detection, semantic segmentation or medical image segmentation. Transfer learning (TL) aims to mitigate or even bypass such labeled data starvation in leveraging existing related labeled source data. As such, TL has received an increasing interest from various research communities[14]. In this paper, we investigate a specific TL problem, namely unsupervised Domain Adaptation (DA), which assumes a shared task space, a source domain with labeled data and a target domain only with unlabeled data. While the source and target domains are related, their data distributions are assumed to be very different. As a result, a learned prediction model using the labeled source data generally leads to a very poor performance when it is directly applied to the target unlabeled data. Because DA is certainly one of the most frequently encountered TL problem in reallife applications, it has been the focus of significant research efforts in recent years[3, 9, 11, 15].The mainstream research in DA is the search of a latent joint subspace[13] between source and target and features two main lines of approaches. The first line of approaches, e.g., LTSL[16] , LRSR [18], seeks a subspace where source and target data can be well aligned and interlaced in preserving inherent hidden geometric data structure via low rank constraint and/or sparse representation, whereas the second line of research, e.g., JDA [10], CDDA [11], searches a subspace where the discrepancy between the source and target data distributions is minimized via a nonparametric distance, i.e., Maximum Mean Discrepancy (MMD)[5]. While the first line of approaches has no guarantee that data distribution discrepancy between the source and target will be minimized, the second one needs to resort to additional adhoc methods to further explicit hidden data geometric structure and enable effective label propagation.
In this paper we propose a novel DA method, namely Robust Data Structure Aligned Close yet Discriminative DA (RSACDDA), which unifies in a single model the previous two lines of approaches. Specifically, the proposed DA method incorporates into its model objective functions, which search a joint subspace, 1) to bring closer, via a nonparametric distance, i.e., Maximum Mean Discrepancy (MMD), both marginal and class conditional data distributions while repulsing interclass both source and target data; 2) to achieve alignment of inherent hidden data geometric structures between source and target through locality aware, i.e.
, low rank, and sparse reconstruction of the target data using source data; 3) to provide robustness in modeling data outliers through a columnwise reconstruction error matrix which is enforced to be sparse. The resolution of the proposed model is achieved through iterative optimization, alternating Rayleigh quotient algorithm and inexact augmented Lagrange multiplier algorithm. Extensive experiments carried out on 16 crossdomain image classification tasks verify the effectiveness of the proposed DA method, which consistently outperforms state of the art methods. Figure 1 intuitively illustrates the proposed method.
To sum up, the contributions of this paper are threefold: (1) design of a novel discriminative DA model, which unifies in a single framework alignment of both data distributions and geometric structures between source and target while repulsing interclass data; (2) introduction of a novel DA algorithm, which searches iteratively a joint subspace optimizing the proposed DA model, using alternatively Rayleigh quotient algorithm and inexact augmented Lagrange multiplier algorithm; (3) Verification of the effectiveness of the proposed DA method through 16 crossdomain image classification tasks.
2 Related Work
State of the art in DA has featured so far two main approaches: 1) the search of a novel joint subspace where source and target domain have a similar data distribution [13, 10, 11]; or 2) adaptation of the prediction model trained using the source data [18, 16, 6]. In the first approach, source and target data are changed because projected into the novel joint subspace. A prediction model trained on labeled source data in this novel subspace can thus be applied to projected target data because they are aligned or have similar data distribution in the novel joint subspace. The second approach is to modify the parameters of the prediction model trained on the source domain so that the decision boundaries are adapted to the target data which remain unchanged.
Because target domain only contains unlabeled data, modification of a prediction model’s parameters proves difficult and an increasing research focus has recently turned on the first approach. One can distinguish two lines of research work in this direction: 1) Data distribution oriented methods [12, 13, 10, 11], ; and 2) Data reconstructionbased methods [18, 16, 6, 19].
Data distribution oriented methods seek a joint subspace to decrease the mismatch between the source and target data distributions, using the nonparametric metric, namely Maximum Mean Discrepancy (MMD), based on Reproducing Hilbert Space (RKHS)[2]. TCA [13] [12] brings closer the source and target marginal distributions; JDA [10] further decreases the discrepancy of both the marginal and conditional distributions. CDDA [11]
goes one step further with respect to JDA and achieves discriminative DA in introducing a repulsive force term in their model to repulse interclass instances. However, all these methods do not account explicitly in their model for the inherent hidden data structure which are important for reliable label propagation, thereby avoiding negative knowledge transfer from the source domain. One exception is CDDA, which, however, resorts to an adhoc method, namely spectral clustering, to make the label propagation respect a constraint of Geometric Structure Consistency.
This drawback has been precisely addressed by data reconstructionbased methods, e.g., RDALR[6], LTSL[16], LRSR[18] and LSDT[19], which search a learned subspace in minimizing the reconstruction error of target data using source or both source and target data, and make use of low rank and sparse representation to segment the inherent hidden data structure and account for data outliers as well. While these methods ensure that source and target data are well aligned and interleaved, they do not have theoretic guarantee that aligned source and target data have similar data distribution.
The proposed RSACDDA combines the advantages of the previous two research lines and unifies in a same framework both the data distribution oriented methods and reconstructionbased ones.
3 Method
3.1 Notations and Problem Statement
We begin with the definitions of notations and concepts most of which we borrow directly from CDDA [11].
Vectors and matrices are frequently used in the subsequent and represented using bold symbols. Given a matrix , we define the Frobenius norm and nuclear norm as: , , where
is the singular values vector of the matrix
.A domain is defined as an mdimensional feature space
and a marginal probability distribution
, i.e., with .Given a specific domain , a task is composed of a Ccardinality label set
and a classifier
, i.e., , wherewhich can be interpreted as the class conditional probability distribution for each input sample
.In unsupervised domain adaptation (DA), we are given a source domain with labeled samples, and a unlabeled target domain with unlabeled samples with the assumption that source domain and target domain are different, i.e., , , , . We also define the notion of subdomain, denoted as , representing the set of samples in with label . Similarly, a subdomain can be defined for the target domain as the set of samples in with label . However, as is the target domain with unlabeled samples, a basic classifier, e.g., Nearest Neighbor (NN), is needed to attribute pseudo labels for samples in .
The aim of the Robust Data Geometric Structure Aligned Close yet Discriminative Domain Adaptation (RSACDDA) is to learn a latent subspace with the following properties: P1) the discrepancy of both the marginal and conditional distributions between the source and target domains is reduced; P2) The distances between each subdomain to the others, are increased in order to enable a discriminative DA; P3) both the inherent local and global data geometric structures are preserved and aligned for reliable label prediction; and P4) Data outliers are accounted for to avoid negative transfer.
3.2 Model
3.2.1 Latent Feature Space with Dimensionality Reduction
The finding of a latent feature space with dimensionality reduction has been demonstrated useful for DA in several previous works, e.g., [12, 13, 10]. One of its important properties is that original data is projected into a lower dimensional space which is considered as principal
structure of data. In the proposed method, we also apply the Principal Component Analysis (PCA). Mathematically, given with an input data matrix
, , the centering matrix is defined as , where is the matrix of ones. The optimization of PCA is to find a projection spacewhich maximizes the embedded data variance.
(1) 
where denotes the trace of a matrix, is the data covariance matrix, and with the feature dimension and the dimension of the projected subspace. The optimal solution is calculated by solving an eigendecomposition problem: , where are the
largest eigenvalues. Finally, the original data
is projected into the optimal dimensional subspace using .3.2.2 Closering Marginal and Conditional Distributions
However, the subspace calculated via PCA does not decrease the mismatch of data distributions between the source and target domain. As a result, to meet property P1, we explicitly leverage the nonparametric distance measurement MMD in RKHS [1] to compute the distance between expectations of source domain/subdomain and target domain/subdomain, once the original data projected into a lowdimensional feature space. Formally, the empirical distance of the source and target domains are defined as . The distance of conditional probability distributions is defined as the sum of the empirical distances over the class labels between the subdomains of a same label in the source and target domain:
(2) 
where is the number of classes, represents the subdomain in the source domain, is the number of samples in the source subdomain. and are defined similarly for the target domain. Finally, represents the marginal distribution between and and represents the conditional distribution between subdomains in and , they are defined as:
(3) 
The difference between the marginal distributions and is reduced in minimizing and the mismatch of conditional distributions between and is reduced in minimizing .
3.2.3 Repulsing interclass data for discriminative DA
In bringing closer data distributions, the previous DA model does not explicitly achieve a discriminative DA in repulsing interclass data. Such a discriminative DA can be achieved by introducing a novel repulsive force , which aims to increase interclass distances, thereby satisfying property P2. Specifically, the repulsive force for DA is defined as: , where , and index the distances computed from to , to and to respectively. represents the sum of the distances between each source subdomain and all the target subdomains except the one with the label . represents the sum of the distances from each target subdomain to all the the source subdomains except the source subdomain with the label . represents the sum of the distances from each source subdomain to all the the source subdomains except the source subdomain with the label . The sum of these distances is explicitly defined as:
(4) 
where , and are defined as
(5) 
Finally, we obtain
(6) 
We define as the repulsive force constraint matrix. While the minimization of Eq.(2) makes closer both marginal and conditional distributions between source and target, the maximization of Eq.(4) increases the distances between source and target subdomains with different labels as well as source subdomains with different labels, thereby enhancing the discriminative power of the underlying latent feature space.
3.2.4 Data geometric structure alignment
Both source and target data can be embedded into a manifold with complex geometric structure. While the model developed in the previous subsections brings closer data distributions while repulsing interclass data, it does not explicitly account for the inherent hidden data geometric structure. We tackle this problem so that our DA model enables data geometric structure alignment between source and target and thereby meets property P3. For this purpose, we propose to use source data to linearly reconstruct target data in a common latent subspace in learning a reconstruction coefficient matrix Z. In noting the projection transformation, the reconstruction problem can be formulated as: .
To further align data geometric structures between source and target, we introduce into our model two additional constraints, namely locality aware and sparse representation constraints. The locality constraint is a constraint already widely explored in manifold learning [15, 16, 18, 11]. It aims to ensure that target data in a neighborhood is only reconstructed from neighboring source data and thereby intuitively preserves and aligns source and target data geometric structures. This locality constraint is achieved by enforcing the reconstruction matrix Z to be low rank with a blockwise structure. As a result, the reconstruction problem aware of locality is now defined as:. However, rank minimization problem is nonconvex, which is difficult to solve. Fortunately, [7] points out we could treat the rank constraint problem as nuclear norm problem, and reformulate it as: .
The sparse representation constraint aims to further ensure the alignment of data geometric structures between source and target in enforcing that each target datum is only sparsely reconstructed from a few meaningful source data, and thereby source and target data are locally interleaved in the searched subspace. Therefore, the reconstruction problem can be further formulated as : , with the columnwise error matrix.
To account for a few data outliers and meet property P4, we can simply enforce the columnwise error matrix to be sparse and thereby provide robustness of the proposed DA method to noisy data and alleviate the influence of negative transfer. As a result, the objective function of our DA method for the alignment of data geometric structures between source and target and robustness to data outliers is defined as follows:
(7) 
3.2.5 Final energy function
3.3 Optimization
We solve Eq.(8) through two main steps. Firstly, Rayleigh quotient algorithm is applied to calculate initial . Eq.(8) is iteratively optimized via augmented Lagrange multiplier (ALM) method as in Eq.(9):
(9) 
To solve efficiently the problem defined in Eq.(8), we calculate the initial MMD matrix via Rayleigh quotient algorithm[10, 11] as shown in Algorithm 1(a). The process to solve the projection matrix is shown in Algorithm 1(b). Step 1 and Step 3 of Algorithm 1(b) are derived through analytical calculation. is updated at each iteration of Algorithm 1(b) via the same process as Step 3 in Algorithm 1(a). Step 4, 5 and 6 in Algorithm 1(b) are calculated according to [18, 8, 7]. Detail proof procedure are not shown here due to space limitation and will be provided online as supplementary materials.
4 Experiments
In this section, we verify the effectiveness of our proposed domain adaptation model, i.e., RSACDDA, on 16 crossdomain image classification tasks.
4.1 Benchmarks, Baseline Methods and Experimental setup
In DA, USPS+MINIST, COIL20E and office+Caltech are standard benchmarks for the purpose of evaluation and comparison with state of the art. In this paper, we follow the data preparation as most previous works. We construct 16 datasets for different image classification tasks. They are: (1) the USPS and MINIST datasets of digits, but with different data distributions. We build the crossdomains as: USPS vs MNIST and MNIST vs USPS; (2) the COIL20 dataset with 20 classes, split into COIL1 vs COIL2 and COIL2 vs COIL1; (3) Office and Caltech256. Office contains three realworld datasets: Amazon(images downloaded from online merchants), Webcam(low resolution images) and DSLR( highresolution images by digital web camera). Caltech256 is a standard dataset for object recognition, which contains 30,607 images for 31 categories. We denote the dataset Amazon,Webcam,DSLR,and Caltech256 as A,W,D,and C, respectively. domain adaptation tasks can then be constructed, namely A W C D, respectively.
The proposed RSACDDA is compared with nine methods of the literature, excluding only CNNbased works, given the fact that we are not using deep features. They are: (1)1Nearest Neighbor Classifier(NN); (2) Principal Component Analysis (PCA) +NN; (3) Geodesic Flow Kernel(GFK)
[4] + NN; (4) Transfer Component Analysis(TCA) [13] +NN; (5)Transfer Subspace Learning(TSL) [17] +NN; (6) Joint Domain Adaptation (JDA) [10] +NN. (7) Close yet discriminative domain adaptation (CDDA)[11] +NN. (8) Lowrank and Sparse Representation (LRSR) [18] +NN. (9) Lowrank Transfer subspace Learning (LTSL)[16] +NN. Note that TCA and TSL can be viewed as special case of JDA with , and JDA a special case of CDDA method when the repulsive force domain adaptation is ignored.All the reported performance scores of the eight methods of the literature are directly collected from the authors’ publication. Please note that partial experimental results are quoted from CDDA and LRSR. They are assumed to be their best performance.
In terms of experimental setup, it is not possible to tune the set of optimal hyperparameters, given the fact that the target domain has no labeled data. Following the setting of CDDA, LRSR and LTSL, we also evaluate the proposed RSACDDA by empirically searching the parameter space for the optimal settings. Specifically, the proposed Algorithm 1(a) has two hyperparameters, i.e., the subspace dimension , regularization parameters . In our experiments, we set and 1) for USPS, MNIST and COIL20 , 2) for Office and Caltech256. In Algorithm 1(b) there are three hyperparameters, i.e., the subspace dimension , regularization parameters and . In our experiments, we set and the remaining parameters similar to those in LRSR.
4.2 Experimental Results and Discussion
Datasets  NN  PCA  GFK  TCA  TSL  JDA  CDDA  LRSR  LTSL  RSACDDA 

USPS vs MNIST  44.70  44.95  46.45  51.05  53.75  59.65  62.05  52.33  63.20  
MNIST vs USPS  65.94  66.22  67.22  56.28  66.06  67.28  76.22  58.55  77.50  
COIL1 vs COIL2  83.61  84.72  72.50  88.47  88.06  89.31  91.53  88.61  75.69  95.42 
COIL2 vs COIL1  82.78  84.03  74.17  85.83  87.92  88.47  93.89  89.17  72.22  95.28 
C A  23.70  36.95  41.02  38.20  44.47  44.78  48.33  51.25  25.26  45.30 
C W  25.76  32.54  40.68  38.64  34.24  41.69  44.75  38.64  19.32  41.69 
C D  25.48  38.22  38.85  41.40  43.31  45.22  48.41  47.13  21.02  49.04 
A C  26.00  34.73  40.25  37.76  37.58  39.36  42.12  43.37  16.92  39.09 
A W  29.83  35.59  38.98  37.63  33.90  37.97  41.69  36.61  14.58  43.39 
A D  25.48  27.39  36.31  33.12  26.11  39.49  37.58  38.85  21.02  39.49 
W C  19.86  26.36  30.72  29.30  29.83  31.17  31.97  29.83  34.64  32.95 
W A  22.96  31.00  29.75  30.06  30.27  32.78  37.27  34.13  39.56  35.28 
W D  59.24  77.07  80.89  87.26  87.26  89.17  87.90  82.80  72.61  94.90 
D C  26.27  29.65  30.28  31.70  28.50  31.52  34.64  31.61  35.08  33.66 
D A  28.50  32.05  32.05  32.15  27.56  33.09  33.51  33.19  39.67  36.01 
D W  63.39  75.93  75.59  86.10  85.42  89.49  90.51  77.29  74.92  90.17 
Average (USPS)  55.32  55.59  56.84  53.67  59.90  63.47  69.14  55.44  70.35  
Average (COIL)  83.20  84.38  73.34  87.15  87.99  88.89  92.71  88.89  73.96  95.35 
Average (Amazon)  31.37  39.79  42.95  43.61  42.37  46.31  48.22  45.39  34.55  48.41 
Overall Average  40.84  47.34  48.48  50.31  50.27  53.78  56.40  52.09  57.02 
The classification accuracies of the proposed method and the nine baseline methods are shown in Table.1. The highest accuracy for each crossdomain adaptation task is highlighted in bold. For fair comparison, all methods listed in Table.1 are proposed to use nearest neighbor classifier. As shown in Table.1, the proposed method depicts an overall average accuracy of , which outperforms the nine baseline algorithms. The proposed method ranks first in terms of accuracy on 9 crossdomain tasks out of 16, and achieves the best average accuracy on the three datasets as well as the best overall average, thereby demonstrating the effectiveness of the proposed method. It is worth noting that the proposed method depicts accuracy on COIL20. This is rather an unexpected impressive score given the unsupervised nature of the domain adaptation for the target domain.
5 Conclusion and Future Work
In this paper, we have proposed a novel DA method, namely RSACDDA, which brings closer both marginal and class conditional data distributions between source and target and aligns inherent hidden source and target data geometric structures while achieving discriminative DA in repulsing interclass source and target data. Comprehensive experiments on 16 crossdomain datasets for image classification task verify the effectiveness of the proposed method in comparison with nine baseline methods of the literature. Our future work will concentrate on embedding the proposed method in deep networks and study other vision tasks, e.g., object detection, within the setting of transfer learning.
References
 [1] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, HansPeter Kriegel, Bernhard Schölkopf, and Alex J Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
 [2] Karsten M. Borgwardt, Arthur Gretton, Malte J. Rasch, HansPeter Kriegel, Bernhard Schölkopf, and Alexander J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In Proceedings 14th International Conference on Intelligent Systems for Molecular Biology 2006, Fortaleza, Brazil, August 610, 2006, pages 49–57, 2006.
 [3] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for largescale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML11), pages 513–520, 2011.

[4]
Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman.
Geodesic flow kernel for unsupervised domain adaptation.
In
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on
, pages 2066–2073. IEEE, 2012.  [5] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander J. Smola. A kernel method for the twosampleproblem. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 47, 2006, pages 513–520, 2006.
 [6] IHong Jhuo, Dong Liu, D. T. Lee, and ShihFu Chang. Robust visual domain adaptation with lowrank reconstruction. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 1621, 2012, pages 2168–2175, 2012.
 [7] Zhouchen Lin, Minming Chen, and Yi Ma. The augmented lagrange multiplier method for exact recovery of corrupted lowrank matrices. CoRR, abs/1009.5055, 2010.
 [8] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace structures by lowrank representation. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):171–184, 2013.
 [9] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. In ICML, pages 97–105, 2015.

[10]
Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu.
Transfer feature learning with joint distribution adaptation.
In Proceedings of the IEEE International Conference on Computer Vision, pages 2200–2207, 2013.  [11] Lingkun Luo, Xiaofang Wang, Shiqiang Hu, Chao Wang, Yuxing Tang, and Liming Chen. Close yet distinctive domain adaptation. CoRR, abs/1704.04235, 2017.
 [12] Sinno Jialin Pan, James T Kwok, and Qiang Yang. Transfer learning via dimensionality reduction. In AAAI, volume 8, pages 677–682, 2008.

[13]
Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang.
Domain adaptation via transfer component analysis.
IEEE Transactions on Neural Networks
, 22(2):199–210, 2011.  [14] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
 [15] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3):53–69, May 2015.
 [16] Ming Shao, Dmitry Kit, and Yun Fu. Generalized transfer subspace learning through lowrank constraint. International Journal of Computer Vision, 109(12):74–93, 2014.
 [17] S. Si, D. Tao, and B. Geng. Bregman divergencebased regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering, 22(7):929–942, July 2010.
 [18] Yong Xu, Xiaozhao Fang, Jian Wu, Xuelong Li, and David Zhang. Discriminative transfer subspace learning via lowrank and sparse representation. IEEE Trans. Image Processing, 25(2):850–863, 2016.
 [19] Lei Zhang, Wangmeng Zuo, and David Zhang. LSDT: latent sparse domain transfer learning for visual adaptation. IEEE Trans. Image Processing, 25(3):1177–1191, 2016.
Comments
There are no comments yet.