There are large volumes of unlabeled data available online, owing to the exponential increase in the number of images and videos uploaded online. It would be easy to obtain labeled data if trained classifiers could predict the labels for unlabeled data. However, classifier models do not perform well when applied to unlabeled data from different distributions, owing to domain-shift [Torralba and Efros2011]. Domain adaptation deals with adapting classifiers trained on data from a source distribution, to work effectively on data from a target distribution [Pan and Yang2010]. Some domain adaptation techniques assume the presence of a few labels for the target data, to assist in training a domain adaptive classifier [Aytar and Zisserman2011, Duan, Tsang, and Xu2012, Hoffman et al.2013]. However, real world applications need not support labeled data in the target domain and adaptation here is termed as unsupervised domain adaptation.
Many of the unsupervised domain adaptation techniques can be organized into linear and nonlinear procedures, based on how the data is handled by the domain adaptation model. A linear
domain adaptation model performs linear transformations on the data to align the source and target domains or, it trains an adaptive linear classifier for both the domains; for example a linear SVM[Bruzzone and Marconcini2010]. Nonlinear techniques are deployed in situations where the source and target domains cannot be aligned using linear transformations. These techniques apply nonlinear transformations on the source and target data in order to align them. For example, Maximum Mean Discrepancy (MMD) is applied to learn nonlinear representations, where the difference between the source and target distributions is minimized [Pan et al.2011]. Even though nonlinear transformations may align the domains, the resulting data may not be conducive to classification. If, after domain alignment, the data were to be clustered based on similarity, it can lead to effective classification.
We demonstrate this intuition through a binary classification problem using a toy dataset. Figure (1a), displays the source and target domains of a two-moon dataset. Figure (1b), depicts the transformed data after KPCA (nonlinear projection). In trying to project the data onto a common ‘subspace’, the source data gets dispersed. Figure (1c), presents the data after domain alignment using Maximum Mean Discrepancy (MMD). Although the domains are now aligned, it does not necessarily ensure enhanced classification. Figure (1d), shows the data after MMD and similarity-based embedding, where data is clustered based on class label similarity. Cross-domain alignment along with similarity-based embedding, makes the data classification friendly.
In this work, we the present the Nonlinear Embedding Transform (NET) procedure for unsupervised domain adaptation. The NET performs a nonlinear transformation to align the source and target domains and also cluster the data based on label-similarity. The NET algorithm is a spectral (eigen) technique that requires certain parameters (like number of eigen bases, etc.) to be pre-determined. These parameters are often given random values which need not be optimal [Pan et al.2011, Long et al.2013, Long et al.2014]. In this work, we also outline a validation procedure to fine-tune model parameters with a validation set created from the source data. In the following, we outline the two main contributions in our work:
Nonlinear embedding transform (NET) algorithm for unsupervised domain adaptation.
Validation procedure to estimate optimal parameters for an unsupervised domain adaptation algorithm.
We evaluate the validation procedure and the NET algorithm using 7 popular domain adaptation image datasets, including object, face, facial expression and digit recognition datasets. We conduct 50 different domain adaptation experiments to compare the proposed techniques with existing competitive procedures for unsupervised domain adaptation.
For the purpose of this paper, we discuss the relevant literature under the categories linear domain adaptation methods and nonlinear
domain adaptation methods. A detailed survey on transfer learning procedures can be found in[Pan and Yang2010]. A survey of domain adaptation techniques for vision data is provided by [Patel et al.2015].
The Domain Adaptive SVM (DASVM) [Bruzzone and Marconcini2010], is an unsupervised method that iteratively adapts a linear SVM from the source to the target. In recent years, the popular unsupervised linear domain adaptation procedures are Subspace Alignment (SA) [Fernando et al.2013], and the Correlation Alignment (CA) [Sun, Feng, and Saenko2015]. The SA algorithm determines a linear transformation to project the source and target to a common subspace, where the domain disparity is minimized. The CA is an interesting technique which argues that aligning the correlation matrices of the source and target data is sufficient to reduce domain disparity. Both the SA and CA are linear procedures, whereas the NET is a nonlinear method.
Although deep learning procedures are inherently highly nonlinear, we limit the scope of our work to nonlinear transformation of data that usually involves a positive semi-definite kernel function. Such procedures are closely related to the NET. However, in our experiments, we do study the NET with deep features also. The Geodesic Flow Kernel (GFK)[Gong et al.2012], is a popular domain adaptation method, where the subspace spanning the source data is gradually transformed into the target subspace along a path on the Grassmann manifold of subspaces. Spectral procedures like the Transfer Component Analysis (TCA) [Pan et al.2011]
, the Joint Distribution Alignment (JDA)[Long et al.2013] and Transfer Joint Matching (TJM) [Long et al.2014], are the most closely related techniques to the NET. All of these procedures involve a solution to a generalized eigen-value problem in order to determine a projection matrix to nonlinearly align the source and target data. In these spectral methods, domain alignment is implemented using variants of MMD, which was first introduced in the TCA procedure. JDA introduces joint distribution alignment which is an improvement over TCA that only incorporates marginal distribution alignment. The TJM performs domain alignment along with instance selection by sampling only relevant source data points. In addition to domain alignment with MMD, the NET algorithm implements similarity-based embedding for enhanced classification. We also introduce a validation procedure to estimate the model parameters for unsupervised domain adaptation approaches.
Domain Adaptation With Nonlinear Embedding
In this section, we first outline the NET algorithm for unsupervised domain adaptation. We then describe a cross-validation procedure that is used to estimate the model parameters for the NET algorithm.
We begin with the problem definition where we consider two domains; source domain and target domain . Let be a subset of the source domain and be the subset of the target domain. Let and be the source and target data points respectively. Let and be the source and target labels respectively. Here, and are data points and and are the associated labels. We define , where . The problem of domain adaptation deals with the situation where the joint distributions for the source and target domains are different, i.e. , where and
denote random variables for data points and labels respectively. In the case of unsupervised domain adaptation, the labelsare unknown. The goal of unsupervised domain adaptation is to estimate the labels of the target data corresponding to using and .
Nonlinear Domain Alignment
A common procedure to align two datasets is to first project them to a common subspace. Kernel-PCA (KPCA) estimates a nonlinear basis for such a projection. In this case, data is internally mapped to a high-dimensional (possibly infinite-dimensional) space defined by . is the mapping function and
is a RKHS (Reproducing Kernel Hilbert Space). The dot product between the mapped vectorsand , is estimated by a positive semi-definite (psd) kernel, . The dot product captures the similarity between and . The kernel similarity (gram) matrix consisting of similarities between all the data points in , is given by, . The matrix is used to determine the projection matrix , by solving,
Here, is the centering matrix given by , where
is an identity matrix andis a matrix of 1s. , is the matrix of coefficients and the nonlinear projected data is given by . Along with projecting the source and target data to a common subspace, the domain-disparity between the two datasets must also be reduced. We employ the Maximum Mean Discrepancy (MMD) [Gretton et al.2009], which is a standard nonparametric measure to estimate domain disparity. We adopt the Joint Distribution Adaptation (JDA) [Long et al.2013]
, algorithm which seeks to align both the the marginal and conditional probability distributions of the projected data. The marginal distributions are aligned by estimating the coefficient matrix, which minimizes:
, is the MMD matrix which given by,
Likewise, the conditional distribution difference can also be minimized by introducing matrices , with , defined as,
Here, and are the sets of source and target data points respectively. is the subset of source data points with class label and . Similarly, is the subset of target data points with class label and . Since the target labels being unknown, we use predicted target labels to determine . We initialize the target labels using a classifier trained on the source data and refine the labels over iterations. Combining both the marginal and conditional distribution terms leads us to the JDA model, which is given by,
Similarity Based Embedding
In addition to domain alignment, the NET algorithm ensures that the projected data , is classification friendly (easily classifiable). To this end we introduce laplacian eigenmaps in order to cluster datapoints based on class label similarity. The adjacency matrix , captures the similarity relationships between datapoints, where,
To ensure that the projected data is clustered based on data similarity, we minimize the sum of squared distances between data points weighted by the adjacency matrix. This can be expressed as a minimization problem,
Here, and . They form the diagonal entries of , the diagonal matrix. , is the squared normalized distance between the projected data points and , which get clustered together when , (as they belong to the same category). The normalized distance is a more robust clustering measure as compared to the standard Euclidean distance , [Chung1997]. Substituting , yields the trace term, where , denotes the symmetric positive semi-definite graph laplacian matrix with , and is an identity matrix.
To arrive at the optimization problem, we consider the nonlinear projection in Equation (1), the joint distribution alignment in Equation (5) and the similarity based embedding in Equation (7). Maximizing Equation (1) and minimizing Equations (5) and (7) can also be achieved by maintaining Equation (1) constant and minimizing Equations (5) and (7). Minimizing the similarity embedding in Equation (7) can result in the projected vectors being embedded in a low dimensional subspace. In order to maintain the subspace dimensionality, we introduce a new constraint in place of Equation (1). The optimization problem for NET is obtained by minimizing Equations (5) and (7). The goal is to estimate the projection matrix, . Along with regularization and the dimensionality constraint, we get,
The first term controls the domain alignment and is weighted by . The second term ensures similarity based embedding and is weighted by . The third term is the regularization (Frobenius norm) that ensures a smooth projection matrix and it is weighted by . The constraint on (in place of ), prevents the projection from collapsing onto a subspace with dimensionality less than , [Belkin and Niyogi2003]. We solve Equation (8) by forming the Lagrangian,
where the Lagrangian constants are represented by the diagonal matrix . Setting the derivative , yields the generalized eigen-value problem,
In unsupervised domain adaptation the target labels are treated as unknown. Current domain adaptation methods that need to validate the optimum parameters for their models, inherently assume the availability of target labels [Long et al.2013], [Long et al.2014]. However, in the case of real world applications, when target labels are not available, it is difficult to verify if the model parameters are optimal. In the case of the NET model, we have 4 parameters , that we want to pre-determine. We introduce a technique using Kernel Mean Matching (KMM) to sample the source data to create a validation set. KMM has been used to weight source data points in order to reduce the distribution difference between the source and target data [Fernando et al.2013], [Gong, Grauman, and Sha2013]. Source data points with large weights have a similar marginal distribution to the target data. These data points are chosen to form the validation set. The KMM estimates the weights , , by minimizing . In order to simplify, we define , and . The minimization is then represented as a quadratic programming problem,
The first constraint limits the scope of discrepancy between source and target distributions, with , leading to an unweighted solution. The second constraint ensures the measure , is a probability distribution [Gretton et al.2009]. In our experiments, we select 10% of the source data with the largest weights to create the validation set. We fine tune the values of , using the validation set. For fixed values of , the NET model is trained using the source data (without the validation set) and target data. The model is tested on the validation set to estimate parameters yielding highest classification accuracies.
In this section, we evaluate the NET algorithm and the model selection proposition across multiple image classification datasets and several existing procedures for unsupervised domain adaptation.
We conduct our experiments across 7 different datasets. Their characteristics are outlined in Table (1).
MNIST-USPS datasets: These are popular handwritten digit recognition datasets. Here, the digit images are subsampled to pixels. Based on [Long et al.2014], we consider two domains MNIST (2,000 images from MNIST) and USPS (1,800 images from USPS).
CKPlus-MMI datasets: The CKPlus [Lucey et al.2010], and MMI [Pantic et al.2005] are popular Facial Expression recognition datasets. They contain videos of facial expressions. We choose 6 categories of facial expression, viz., anger, disgust, fear, happy, sad, surprise. We create two domains, CKPlus and MMI
, by selecting video frames with the most intense expressions. We use a pre-trained deep convolutional neural network (CNN), to extract features from these images. In our experiments, we use the VGG-F model[Chatfield et al.2014]
, trained on the popular ImageNet object recognition dataset. The VGG-F network is similar in architecture to the popular AlexNet[Krizhevsky, Sutskever, and Hinton2012]. We extract the 4096-dimensional features that are fed into the fully-connected layer. We apply PCA on the combined source and target data to reduce the dimension to 500 and use these features across all the experiments.
COIL20 dataset: It is an object recognition dataset consisting of 20 categories with two domains, COIL1 and COIL2. The domains consist of images of objects captured from views that are 5 degrees apart. The images are pixels with gray scale values [Long et al.2013].
PIE dataset: The “Pose, Illumination and Expression” (PIE) dataset consists of face images ( pixels) of 68 individuals. The images were captured with different head-pose, illumination and expression. Similar to [Long et al.2013], we select 5 subsets with differing head-pose to create 5 domains, namely, P05 (C05, left pose), P07 (C07, upward pose), P09 (C09, downward pose), P27 (C27, frontal pose) and P29 (C29, right pose).
: This is currently the most popular benchmark dataset for object recognition in the domain adaptation computer vision community. The dataset consists of images of everyday objects. It consists of 4 domains;Amazon, Dslr and Webcam from the Office dataset and Caltech domain from the Caltech-256 dataset. The Amazon domain has images downloaded from the www.amazon.com website. The Dslr and Webcam domains have images captured using a DSLR camera and a webcam respectively. The Caltech domain is a subset of the Caltech-256 dataset that was created by selecting categories common with the Office dataset. The Office-Caltech dataset has 10 categories of objects and a total of 2533 images (data points). We experiment with two kinds of features for the Office-Caltech dataset; (i) 800-dimensional SURF features [Gong et al.2012], (ii) Deep features. The deep features are extracted using a pre-trained network similar to the CKPlus-MMI datasets.
|PIE||Face||11,554||1,024||68||P05, …, P29|
|Ofc-Cal SURF||Object||2,533||800||10||A, C, W, D|
|Ofc-Cal Deep||Object||2,505||4096||10||A, C, W, D|
We compare the NET algorithm with the following baseline and state-of-the-art methods.
|SA||Subspace Alignment [Fernando et al.2013]|
|CA||Correlation Alignment [Sun, Feng, and Saenko2015]|
|GFK||Geodesic Flow Kernel [Gong et al.2012]|
|TCA||Transfer Component Analysis [Pan et al.2011]|
|TJM||Transfer Joint Matching [Long et al.2014]|
|JDA||Joint Distribution Adaptation [Long et al.2013]|
Like NET, the TCA, TJM and JDA are all spectral methods. While all the four algorithms use MMD to align the source and target datasets, the NET, in addition, uses nonlinear embedding for classification enhancement. TCA, TJM and JDA, solve for in a setting similar to Equation (10). However, unlike NET, they do not have the similarity based embedding term and , is fixed in all the three algorithms. Therefore, these models have only 2 free parameters , that need to be pre-determined in contrast to NET, which has 4 parameters, . Since TCA, TJM and JDA, are all quite similar to each other, for the sake of brevity, we evaluate model selection (estimating optimal model parameters) using only JDA and NET. The SA, CA and GFK algorithms, do not have any critical free model parameters that need to be pre-determined.
In our experiments, is a special case of the NET, where model parameters , have been determined using a validation set derived from Equation (11). Similarly, is a special case of JDA, where , have been determined using a validation set derived from Equation (11). In order to ascertain the optimal nature of the parameters determined with a source-based validation set, we estimate the best model parameters using the target data (with labels) as a validation set. These results are represented by NET in the figures and tables. The results for the rest of the algorithms (SA, CA, GFK, TCA, TJM and JDA), are obtained with the parameter settings described in their respective works.
For fair comparison with existing methods, we follow the same experimental protocol as in [Gong et al.2012, Long et al.2014]. We conduct 50 different domain adaptation experiments with the previously mentioned datasets. Each of these is an unsupervised domain adaptation experiment with one source domain (data points and labels) and one target domain (data points only). When estimating , we choose 10 iterations to converge to the predicted test/validation labels. Wherever necessary, we use a Gaussian kernel for , with a standard width equal to the median of the squared distances over the dataset. We train a 1-Nearest Neighbor (NN) classifier using the projected source data and test on the projected target data for all the experiments. We choose a NN classifier as in [Gong et al.2012, Long et al.2014], since it does not require tuning of cross-validation parameters. The accuracies reflect the percentage of correctly classified target data points.
|Expt.||SURF Features||Deep Features|
Parameter Estimation Study
Here we evaluate our model selection procedure. The NET algorithm has 4 parameters , and the JDA has 2 parameters , that need to be pre-determined. To determine these parameters, we weight the source data points using Equation (11) and select 10% of the source data points with the largest weights. These source data points have a distribution similar to the target and they are used as a validation set to determine the optimal values for the model parameters . The parameter space consists of and from the set . For the sake of brevity, we present one set of parameters for every dataset, although in practice, a unique set of parameters can be evaluated for every domain adaptation experiment. Given a set of model parameters, we conduct the domain adaptation experiment using the entire source data (data and labels) and the target data (data only). The accuracies obtained are represented as shaded columns and in Tables (3) and (4).
In order to evaluate the validity of our proposed model selection method, we also determine the parameters using the target data as a validation set. These results are represented by the NET column in Tables (3) and (4). Since the NET column values have been determined using the target data, they can be considered as the best accuracies for the NET model. The rest of the column values SA, CA, GFK, TCA, TJM and JDA, were estimated with model parameters suggested in their respective papers. The recognition accuracies for is greater than that of the other domain adaptation methods and is nearly comparable to the NET. In Table (3), the has better performance than the JDA. This goes to show that a proper validation procedure does help select the best set of model parameters. It demonstrates that the proposed model selection procedure is a valid technique for evaluating an unsupervised domain adaptation algorithm in the absence of target data labels. Figures (2) and (3), depict the variation of average validation set accuracies over the model parameters. Based on these curves, the optimal parameters are chosen for each of the datasets.
NET Algorithm Evaluation
The NET algorithm has been compared to existing unsupervised domain adaptation procedures across multiple datasets. The results of the NET algorithm are depicted under the NET column in Tables (3) and (4). The parameters used to obtain these results are depicted in Table (5). The accuracies obtained with the NET algorithm are consistently better than any of the other spectral methods (TCA, TJM and JDA). NET also consistently performs better compared to non-spectral methods like SA, CA and GFK.
Discussion and Conclusions
The average accuracies obtained with JDA and NET using the validation set are comparable to the best accuracies with JDA and NET. This empirically validates the model selection proposition. However, there is no theoretical guarantee that the parameters selected are the best. In the absence of theoretical validation, further empirical analysis is advised when using the proposed technique for model selection.
In this paper, we have proposed the Nonlinear Embedding Transform algorithm and a model selection procedure for unsupervised domain adaptation. The NET performs favorably compared to competitive visual domain adaptation methods across multiple datasets.
This material is based upon work supported by the National Science Foundation (NSF) under Grant No:1116360. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.
- [Aytar and Zisserman2011] Aytar, Y., and Zisserman, A. 2011. Tabula rasa: Model transfer for object category detection. In Intl. Conference on Computer Vision.
- [Belkin and Niyogi2003] Belkin, M., and Niyogi, P. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15(6):1373–1396.
- [Bruzzone and Marconcini2010] Bruzzone, L., and Marconcini, M. 2010. Domain adaptation problems: A dasvm classification technique and a circular validation strategy. Pattern Analysis and Machine Intelligence, IEEE Trans. on 32(5):770–787.
- [Chatfield et al.2014] Chatfield, K.; Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference.
- [Chung1997] Chung, F. R. 1997. Spectral graph theory, volume 92. American Mathematical Soc.
- [Duan, Tsang, and Xu2012] Duan, L.; Tsang, I. W.; and Xu, D. 2012. Domain transfer multiple kernel learning. Pattern Analysis and Machine Intelligence, IEEE Trans. on 34(3):465–479.
[Fernando et al.2013]
Fernando, B.; Habrard, A.; Sebban, M.; and Tuytelaars, T.
Unsupervised visual domain adaptation using subspace alignment.
IEEE Conference on Computer Vision and Pattern Recognition, 2960–2967.
- [Gong et al.2012] Gong, B.; Shi, Y.; Sha, F.; and Grauman, K. 2012. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition.
[Gong, Grauman, and
Gong, B.; Grauman, K.; and Sha, F.
Connecting the dots with landmarks: Discriminatively learning
domain-invariant features for unsupervised domain adaptation.
Intl. Conference on Machine learning, 222–230.
- [Gretton et al.2009] Gretton, A.; Smola, A.; Huang, J.; Schmittfull, M.; Borgwardt, K.; and Schölkopf, B. 2009. Covariate shift by kernel mean matching. Dataset shift in machine learning 3(4):5.
- [Hoffman et al.2013] Hoffman, J.; Rodner, E.; Donahue, J.; Saenko, K.; and Darrell, T. 2013. Efficient learning of domain-invariant image representations. In Intl. Conference on Learning Representations.
- [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, 1097–1105.
- [Long et al.2013] Long, M.; Wang, J.; Ding, G.; Sun, J.; and Yu, P. S. 2013. Transfer feature learning with joint distribution adaptation. In Intl. Conference on Machine Learning, 2200–2207.
- [Long et al.2014] Long, M.; Wang, J.; Ding, G.; Sun, J.; and Yu, P. 2014. Transfer joint matching for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 1410–1417.
- [Lucey et al.2010] Lucey, P.; Cohn, J. F.; Kanade, T.; Saragih, J.; Ambadar, Z.; and Matthews, I. 2010. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In IEEE Conference on Computer Vision and Pattern Recognition, 94–101.
- [Pan and Yang2010] Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. Knowledge and Data Engineering, IEEE Trans. on 22(10):1345–1359.
- [Pan et al.2011] Pan, S. J.; Tsang, I. W.; Kwok, J. T.; and Yang, Q. 2011. Domain adaptation via transfer component analysis. Neural Networks, IEEE Trans. on 22(2):199–210.
- [Pantic et al.2005] Pantic, M.; Valstar, M.; Rademaker, R.; and Maat, L. 2005. Web-based database for facial expression analysis. In IEEE Conference on Multimedia and Expo. IEEE.
- [Patel et al.2015] Patel, V. M.; Gopalan, R.; Li, R.; and Chellappa, R. 2015. Visual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine 32(3):53–69.
- [Sun, Feng, and Saenko2015] Sun, B.; Feng, J.; and Saenko, K. 2015. Return of frustratingly easy domain adaptation. In Intl. Conference on Computer Vision, TASK-CV.
- [Torralba and Efros2011] Torralba, A., and Efros, A. A. 2011. Unbiased look at dataset bias. In IEEE Conference on Computer Vision and Pattern Recognition, 1521–1528.