In traditional supervised learning, it is assumed that training data samples and test data samples are drawn from the same distribution when a learner (e.g., classifier) is trained. However, this may not be valid in domain adaptation problems. We often have plenty of labeled examples from one domain (source domain) for training a classifier and intend to apply the trained classifier to another different domain (target
domain) with very few or even no labeled example. Domain adaptation has tight relationship with semi-supervised learning[Liu et al.2012] and has been studied before under different names, including covariate shift [Shimodaira2000] and sample selection bias [Heckman1979, Zadrozny2004].
The difficulty of domain adaptation comes from the gap between the source distribution and target distribution. Hence, the learning model trained in the source domain could not be directly used in the target domain. Many algorithms have been proposed to address domain adaptation problems. For example, a semi-supervised method, namely structural correspondence learning (SCL), was proposed in [Blitzer et al.2006]. SCL defines pivot features which are common to both domains and tries to find the correlation between pivot and non-pivot features. It extracts the corresponding subspace and augments the original feature space with the subspace towards more effective classification. The method in [Blitzer et al.2011] tries to find the correlations between original features and identify the subspaces in which a good predictor could be obtained by training a classifier merely on the samples from the source domain. The kernel mean matching (KMM) method was proposed in [Huang et al.2007], which aims to minimize the distance between means of the training and test samples in a reproducing kernel Hilbert space (RKHS) by reweighting the samples from the source domain. In [Chen et al.2011b], a co-training method called CODA for domain adaptation was proposed. In each iteration, CODA formulates an individual optimization problem which simultaneously learns a target predictor, a split of the feature space into views, and a subset of source and target features to be included in the predictor. It tries to progressively bridge the gap between the source and target domains by adding both the features and instances that the current predictor is most confident about. Nonnegative matrix factorization was employed to bridge the gap between domains by sharing feature clusters [Wang et al.2011]. Glorot et al. [Glorot et al.2011] proposed to learn robust feature representations with stacked denoising auto-encoders (SDA) for domain adaptation. Marginalized stacked denoising auto-encoder (mSDA) [Chen et al.2012], a variant of SDA with a slightly different network structure, was proposed to address the drawback of SDA being too slow in training. Chen et al. [Chen et al.2012]
noticed that the random feature corruption for SDA can be marginalized out, which is equivalent to training a learning model with an infinitely large number of corrupted input data conceptually. Moreover, the linear denoising auto-encoders used in SDA have a closed form, which help speed up the computation. Promising performance was achieved in cross-domain sentiment analysis tasks. In[Sun et al.2015], correlation alignment (CORAL) exploits the difference between the covariance matrices as a measure of the gap between the source and target domains. A shallow linear feature algorithm was then proposed. Inspired by the theory of domain adaptation [Ben-david et al.2006]
, domain-adversarial neural networks (DANN)[Ajakan et al.2014] and domain adaptation by back-propagation (DAB) [Ganin and Lempitsky2015]
optimize an approximated domain distance and an empirical training error on the source domain to seek the hidden representations for both source and target domain samples. DANN and DAB can be treated as classifiers for tackling domain adaptation problems, and also be trained on top of other feature learning algorithms,e.g., mSDA.
Besides the aforementioned methods, domain adaptation has also been studied theoretically. Most of them are built on distances measuring the dissimilarity between different distributions, and generalization bounds are derived based on the proposed distances. In [Ben-david et al.2006], the -distance was used to analyze representations for domain adaptation, and VC-dimension shaped generalization bounds were derived for domain adaptation. The analysis showed that a good feature learning algorithm should achieve a low training error on source domain and a small -distance simultaneously. In [Blitzer et al.2008], a uniform convergence bound was provided and it was extended to domain adaptation with multiple source domains combined with weights. In [Mansour et al.2009, Zhang et al.2013], the -distance was extended to a more general form and could be used to compare the distance for more general tasks such as regression. All the distances mentioned above were defined in a worst-case sense. In [Germain et al.2013], a distance defined in an average sense by making use of PAC-Bayesian theory was suggested and an algorithm that simultaneously optimizes the error on the source domain, the hypothesis complexity, and the distance was proposed. In [Ben-David et al.2010], the conditions for the success of domain adaptation were analyzed. It was shown that a small distance between source and target domains and the existence of a low error classifier on both domains in the hypothesis class are necessary for the success of domain adaptation.
In this paper, we provide theoretic analysis of several key issues pertaining to effective domain adaptive feature learning, showing that the difference (measured by Frobenius norm) between the second moments of source and target domain distributions should be small. Based on this analysis, we propose a simple yet effective feature learning algorithm used in conjunction with linear classifiers. To further improve the feature learning quality, we employ a deep learning approach inspired by stacked denoising auto-encoders in[Glorot et al.2011, Chen et al.2012], leading to a deep linear model (DLM). DLMs are easy to analyze and usually are the starting point for theoretical development of neural networks [Goodfellow et al.2016]. Finally, we demonstrate the effectiveness of our proposed feature learning algorithms on the Amazon review and spam datasets.
2 Theoretic Analysis and New Algorithms
2.1 Notations and Background
Usually, a domain is considered as a pair consisting of a distribution on and a labeling function . In this paper, we consider two domains, a source domain and a target domain
. The probability density functions for source distribution and target distribution areand , respectively. There exist samples with labels sampled from the source domain while samples from the target domain are sampled without labels. They have the same number of features, which is denoted as . The samples from source domain form the data matrix , and samples from target domain form . We use to denote the data matrix containing the samples from both domains.
A hypothesis is a function . The risk of a hypothesis over domain is denoted as , which is the difference between the two functions. We use notations and to denote the risk of on source domain and target domain respectively. In this paper, we consider only linear classifiers. We denote the best linear classifier for source domain as
with corresponding parameter vector, the best classifier for target domain as with parameter vector . For simplicity of expression, we define the difference of expectation of a function on source domain and target domain as
Our goal in this paper is to learn a new feature representation with samples from source domain and target domain such that we can train a classifier with the learned representations on source domain and apply it directly to target domain for achieving low value.
2.2 Analysis of Feature Learning for Linear Classifiers
We assume that for all samples and there exists a low-risk linear classifier with parameter for both domains. Its risk on source domain is , and risk on target domain is and , where is assumed to be small.
With triangle inequality, it was shown that for any classifier , the following is satisfied [Blitzer et al.2008]:
From this inequality, we can see that the performance of classifier on target domain is determined by: 1) the risk of the best classifier on both domains; 2) the risk of on source domain; 3) the difference of dissimilarity between and on both domains. A good feature learning algorithm should decrease the sum of these three terms. In this paper, we focus on the third part and provide an analysis for effective linear feature learning algorithm for linear classifiers.
In the above inequality, measures the difference between and , which is a non-smooth 0-1 loss and is difficult to analyze. In this paper, we use a smooth approximation for the measure of dissimilarity between and . We denote the smooth loss as , where is the parameter for the linear classifier and is the true label of sample . can be logistic loss or smooth approximation of hinge loss [Zhang and Oles2001]. We denote and as the first and second derivatives of respectively. Thus, we have the following theorem, which is proved in the supplementary material.
Let and . Assume that any sample in our problem satisfies , we have the following inequality:
where , and are defined as
and are partial derivatives, is the label of sample predicted by as .
Our goal is to find new representations such that the approximation for the measure of dissimilarity between and is small. Here, we minimize its upper bound instead. In the right hand side of inequality (1), is the unknown best possible classifier, hence constants like , and which are related to could not be optimized directly by feature learning algorithms. We assume that the learned features lie in the same region with the original feature, hence could not be optimized. Therefore, in order to achieve good performance, we should minimize . We will use as a measure of domain distance and design a feature learning algorithm based on it.
Let us denote , and
, and our goal is to learn a linear transformationsuch that the learned data matrix is suitable for domain adaptation. Based on the above analysis, one of the objectives is to minimize , which is difficult to optimize. Because positive-semidefinite matrices and satisfy , we have . Therefore, the domain distance is bounded by . Hence, our goal can be expressed as finding a matrix with not only small Frobenius norm but also small . Moreover, the learned representation should also be similar to the original ones. Hence, we can simply use
as objective function111To express clearly, we omit the bias term.. In order to force each feature to contribute equally, the length of features should be incorporated by the second term. The final objective function of our method is expressed as follows
where is a diagonal matrix with , and is the th row of data matrix . We term our method as Feature LeArning with second Moment Matching (FLAMM). We will prove that our method will decrease the distance between domains under some conditions. If
, this algorithm becomes regularized linear regression, which is called simple feature learning algorithm, referred as SFL222The linear transformation step of mSDA could be rewritten in a form similar to SFL approximately. Hence, each layer of mSDA could be seen as SFL plus a non-linear transformation step.. SFL does not consider minimizing the distance between the second moments explicitly, but it still can improve the adaptation performance in many cases which can be seen from the experimental results section. With our theoretical analysis of domain adaptation, we will make an attempt to illustrate why this simple method works.
To further improve the performance, we adopt the strategy from deep learning methods and apply this algorithm to the learned feature repeatedly. Following the tradition of deep learning, we call the process of finding one linear transformation matrix and updating the data matrix as one layer. The whole algorithm is summarized in Algorithm 1. We can see that the output of our method has the same dimensionality as input. Moreover, the final output is just the linear transformation of the original data matrix, which makes it easier to analyze than the other auto-encoder based methods.
In [Sun et al.2015]
, a linear feature algorithm called correlation alignment (CORAL) was proposed. It uses the difference between the covariance matrices of source distribution and target distribution as a measure of the gap between source domain and target domain. CORAL just whitens source domain and then recolors it with covariance matrix of target domain. The authors did not provide theoretical analysis for the gap measure used and CORAL is also not a deep feature learning method. We will show that our method is better than CORAL.
2.4 Analysis of Our Method
In this subsection, we will provide an analysis of FLAMM. Since FLAMM is just a revision of SFL by adding a term, we will analyze SFL first. We will prove that the domain distance decreases for each layer of SFL under some conditions, which also explains why mSDA works since the layer of mSDA is similar to that of SFL. Based on that, we will provide an analysis of FLAMM. Here are some necessary lemmas.
Lemma 1 (Weyl’s inequality [Horn and Johnson2012] Theorem 4.3.1).
Let be symmetric matrices, and denote the eigenvalue of matrix
be symmetric matrices, and denote the eigenvalue of matrixas , which is arranged in increasing order . For , we have .
Lemma 2 ([Horn and Johnson1991] Theorem 3.3.16).
Let be matrices, and denote the singular values of matrix
matrices, and denote the singular values of matrixas , which is arranged in increasing order . For , we have .
is computed by solving problem (2) with , if then is positive definite matrix.
If , we will have , which means . Hence, . Therefore, is positive definite matrix.
, , are symmetric matrices and , , then .
Proof: If is a rank-one matrix, we can express as . Then we have . If is not a rank-one matrix, it can be expressed as . Since , we denote and write . Then we have
Based on the above lemmas, we have the following theorem:
If is computed by solving problem (2) with and , then the inequality is satisfied.
Proof: In this proof, we will use the following notations:
By Lemma 3, we know that if we will obtain that is a positive definite matrix, which means is also a positive definite matrix. Hence, we have . Therefore, we obtain:
From the above theorem, we can see that on a certain layer of SFL, if is big enough distance between the second moments will definitely get smaller compared to that of the input matrix. Hence, SFL will decrease the distance implicitly on one layer. This also explains why mSDA works for domain adaptation problems, since the layer of mSDA is very similar to that of SFL. Therefore, even if SFL and mSDA do not optimize the difference between the two second moment explicitly, they are capable of improving the domain adaptation performance. For FLAMM, with the third term in the objective function, it can decrease the domain distance explicitly. Hence, FLAMM could be seen as a trade-off between reconstruction error and domain distance. And we can achieve the same reconstruction error with smaller domain distance. That’s the reason why FLAMM performs better than SFL, which can be seen in the following section. In practice, we do not need to compute the exact condition in theorem 2, we can just treat and as ordinary parameters and select them using the validation set. We will illustrate the changes of distance empirically in the experimental results section.
3 Experimental Results
We evaluate and analyze the proposed methods on the Amazon review dataset 333http://www.cs.jhu.edu/~mdredze/datasets/sentiment/ [Blitzer et al.2007], and the ECML/PKDD 2006 spam dataset 444http://www.ecmlpkdd2006.org/challenge.html [Bickel2008]. As in [Blitzer et al.2007], a smaller subset of the Amazon review dataset which contains reviews of four types of products: Books, DVDs, Electronics, and Kitchen appliances, is used. In this dataset, each domain consists of 2000 labeled inputs and approximately 4000 unlabeled ones. We only consider binary classification problem, i.e. whether a review is positive (higher than 3 stars) or negative (3 stars or lower) as in [Glorot et al.2011, Chen et al.2012] and use 5000 most frequent terms as features.
The second dataset is from the ECML/PKDD 2006 discovery challenge which is about personalized spam filtering and generalization across related learning tasks. It contains two tasks: task A and task B. We adopt the dataset of task A for our comparisons and analysis because it contains more samples. In this dataset, 4000 labeled training samples were collected from publicly available sources, with half of them being spam and the other half being non-spam. The testing samples were collected from 3 different user inboxes, say U0, U1 and U2, each of which consists of 2500 samples. Hence, the distributions of source domain and target domain are different since they are from different sources. In this dataset, there are three adaptation tasks in total. As in the Amazon review dataset, 5000 most frequent terms were chosen as features. Three samples were deleted as a result of not containing any of these 5000 terms. Hence, we have 7497 testing samples totally.
3.2 Comparison and Analysis
For domain adaptation tasks, traditional cross validation can not be used to select parameters since the source distribution and target distribution are different. In our experiments, we simply use a small validation set containing only 500 labeled samples selected randomly from target domain to select parameters for all feature learning algorithms 555The method we used was called STV in [Zhong et al.2010]. And it works quite well in our setting. Transfer cross validation [Zhong et al.2010] should provide similar results.. Once we have the new learned features, we treat the cross domain classification task as traditional supervised learning problem and the validation set was not used to select parameters for classifiers.
We report the results of two baseline representations. The first one is just the raw tf-idf representation and the second one is the PCA representation. For PCA, the subspace is obtained from both source domain samples and target domain samples. Besides these two baselines, we also compare our method with CORAL [Sun et al.2015] and CODA [Chen et al.2011a]. CORAL is an effective shallow feature learning method and has been shown outperform many well-known methods, e.g. GFK [Gong et al.2012], SVMA [Duan et al.2012] and SCL [Blitzer et al.2006], mainly on image domain adaptation tasks. CODA is a state-of-the-art domain adaptation algorithm based on co-training. At last, we compare our method with deep learning methods. Since mSDA is better than SDA as shown in [Chen et al.2012], we only provide comparisons with mSDA. We use binary representations for mSDA as in [Chen et al.2012], and for the other methods, samples are represented with tf-idf and normalized to have unit length. For representations learned by all feature learning methods, we train a linear SVM on the source domain data and test it on the target domain. The performance metric is classification accuracy.
We denote our method with layers as and similar notations are also used for SFL and mSDA. We set as 2 for the Amazon review dataset and 5 for the spam dataset. To provide better understanding of our algorithm, we also provide results with only one layer. Parameters for representation learning process were selected based on the validation set. On the Amazon review dataset, both and were selected from . On the spam dataset, they were selected from . For mSDA, the noise was selected from on the Amazon dataset and on the spam dataset. For PCA, the reduced dimensionality was selected from for both datasets. For CODA666Source code: http://www.cse.wustl.edu/~mchen/code/coda.tar, we used the same parameters as in [Chen et al.2011b], except that we set on the spam dataset. The results of CODA depend on initializations, hence we run CODA 10 times and provide the average accuracies. The results are presented in Tables 1 and 2. We can see that our method achieved the best performance on the two datasets. Results of FLAMM are better than those of SFL, and results with multiple layers are usually better than results with only one layer. We can see that the proposed objective and multi-layer structure did help improve the adaptation performance.
The performances with different number of layers on the two datasets are also plotted in Figure 1. We can see that our method is better than SFL consistently on both datasets. On the Amazon dataset, our method become worse when is larger than 4. The reason might be that the difference between the learned representations and the original representations is unnecessarily high.
We also present the distance between the second moments of source domain and target domain through layers with different and 777In our implementation, is the product of a number and , where is the number of samples. when as shown in Figure 2. We can see that the distance decreases through layers for both FLAMM and SFL and the distance of FLAMM is smaller than that of SFL with the same . This observation explained why SFL also improved the adaptation performance. For FLAMM, the parameters control how small domain distance we can get. Bigger and will lead to smaller distance between the two second moments. But at the same time, the learned representations will be more different from the original input888FLAMM tries to reconstruct the new input data matrix in each layer, which is the output data matrix from previous layer. Hence, the outputs of each layer will usually have larger and larger difference with the original samples through layers. , which might not be suitable for classifiers. Hence, our method could be understood as a trade-off between domain distance and reconstruction error.
In this paper, theoretic analysis of the factors that affect the performance of feature learning for domain adaptation was provided. We found that the distance between the second moments of the source domain and target domain should be small, which is important for a linear classifier to generalize well on the target domain. Based on our analysis, an extremely easy yet effective feature learning algorithm was proposed. Furthermore, our algorithm was extended by leveraging multiple layers, leading to a deep linear model. Meanwhile, we also explained why the simple ridge regression based method which does not minimize the gap between source and target distributions can improve the adaptation performance. The experimental results on the Amazon review and spam datasets corroborated the advantages of our proposed feature learning approach.
This work was partially supported by the following grants: NSF-IIS 1302675, NSF-IIS 1344152, NSF-DBI 1356628 and RGC GRF Grant No. PolyU 152039/14E.
- [Ajakan et al.2014] Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, and Mario Marchand. Domain-adversarial neural networks. arXiv preprint arXiv:1412.4446, 2014.
- [Ben-david et al.2006] S. Ben-david, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS, pages 137–144. 2006.
[Ben-David et al.2010]
Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál.
Impossibility theorems for domain adaptation.
International Conference on Artificial Intelligence and Statistics, pages 129–136, 2010.
- [Bickel2008] S. Bickel. ECML-PKDD discovery challenge 2006 overview. In ECML-PKDD Discovery Challenge Workshop, pages 1–9, 2008.
- [Blitzer et al.2006] J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In EMNLP, pages 120–128, 2006.
- [Blitzer et al.2007] J. Blitzer, M. Dredze, and F. Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In ACL, pages 440–447, June 2007.
- [Blitzer et al.2008] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in Neural Information Processing Systems 20, pages 129–136. 2008.
- [Blitzer et al.2011] J. Blitzer, D. Foster, and S. Kakade. Domain adaptation with coupled subspaces. In International Conference on Artificial Intelligence and Statistics, pages 173–181, 2011.
- [Chen et al.2011a] M. Chen, Y. Chen, and K. Q. Weinberger. Automatic feature decomposition for single view co-training. In ICML, pages 953–960, 2011.
- [Chen et al.2011b] M. Chen, K. Q. Weinberger, and J. Blitzer. Co-Training for domain adaptation. In NIPS 24, pages 2456–2464. 2011.
[Chen et al.2012]
M. Chen, Z. Xu, F. Sha, and K. Q. Weinberger.
Marginalized denoising autoencoders for domain adaptation.In ICML, pages 767–774, 2012.
- [Duan et al.2012] Lixin Duan, Ivor W Tsang, and Dong Xu. Domain transfer multiple kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):465–479, 2012.
[Ganin and Lempitsky2015]
Yaroslav Ganin and Victor Lempitsky.
Unsupervised domain adaptation by backpropagation.In ICML-15, pages 1180–1189, 2015.
- [Germain et al.2013] Pascal Germain, Amaury Habrard, Emilie Morvant, et al. A PAC-bayesian approach for domain adaptation with specialization to linear classifiers. In ICML, pages 738–746, 2013.
- [Glorot et al.2011] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, pages 513–520, 2011.
- [Gong et al.2012] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pages 2066–2073. IEEE, 2012.
- [Goodfellow et al.2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
- [Heckman1979] James J Heckman. Sample selection bias as a specification error. Econometrica: Journal of the econometric society, pages 153–161, 1979.
- [Horn and Johnson1991] R.A. Horn and C.R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991.
- [Horn and Johnson2012] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.
- [Huang et al.2007] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf. Correcting sample selection bias by unlabeled data. In NIPS 19, pages 601–608. MIT Press, Cambridge, MA, 2007.
- [Liu et al.2012] Wei Liu, Jun Wang, and Shih-Fu Chang. Robust and scalable graph-based semisupervised learning. Proceedings of the IEEE, 100(9):2624–2638, 2012.
- [Mansour et al.2009] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In NIPS 21, pages 1041–1048. 2009.
- [Shimodaira2000] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
- [Sun et al.2015] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. arXiv preprint arXiv:1511.05547, 2015.
[Wang et al.2011]
Hua Wang, Feiping Nie, Heng Huang, and Chris Ding.
Dyadic transfer learning for cross-domain image classification.In
2011 IEEE International Conference on Computer Vision (ICCV), pages 551–556. IEEE, 2011.
- [Zadrozny2004] Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In ICML, pages 114–122, 2004.
- [Zhang and Oles2001] Tong Zhang and Frank J Oles. Text categorization based on regularized linear classification methods. Information retrieval, 4(1):5–31, 2001.
- [Zhang et al.2013] Chao Zhang, Lei Zhang, and Jieping Ye. Generalization bounds for domain adaptation. CoRR, abs/1304.1574, 2013.
- [Zhong et al.2010] Erheng Zhong, Wei Fan, Qiang Yang, Olivier Verscheure, and Jiangtao Ren. Cross validation framework to choose amongst models and datasets for transfer learning. In ECML PKDD, pages 547–562, 2010.