1 Introduction
Supervised learning and in particular classification is an important task in machine learning with a broad range of applications. The obtained models are used to predict the label of unseen test samples. In general, it is assumed that the underlying domain of interest is not changing between training and test samples. If the domain is changing from one task to a related but different task, one would like to reuse the available learning model. Domain differences are quite common in realworld scenarios and eventually lead to substantial performance drops
Weiss2016.A transfer learning example is the classification of web pages: A classifier is trained in the domain of university web pages with a word distribution according to universities and in the test scenario, the domain has changed to nonuniversity web pages, where the word distribution may not be similar to training distribution. Figure
1 shows a toy example of a traditional and a transfer classification task with clearly visible domain differences.More formally, let be source data sampled from the source domain distribution and let be target data from the target domain distribution . Traditional machine learning assumes similar distributions, i.e. , but transfer learning assumes different distributions, i.e. , and appears in the web page example where could be features of university websites and are features of nonuniversity websites.
In general, transfer learning aims to solve the divergence between domain distributions by reusing information in one domain to help to learn a target prediction function in a different domain of interest 5288526
. However, despite the definition, the proposed solutions implicitly solve differences by linear transformations, detailed in section
4.1 and 4.2. Multiple transfer learning methods have been already proposed, following different strategies and improving prediction performance of underlying classification algorithms in test scenarios Weiss20165288526. In this paper, we focus on sparse models, which are not yet covered sufficiently by transfer learning approaches.The Probabilistic Classification Vector Machine (PCVM) Chen2009 is a sparse probabilistic kernel classifier pruning unused basis functions during training, found to be very effective Chen2009Schleif2015 with competitive performance to Support Vector Machine (SVM) Cortes1995a. The PCVM
is naturally sparse and creates interpretable models as needed in many application domains of transfer learning. The original PCVM is not well suited for transfer learning due to their focus on stationary Gaussian distribution and is equipped within this work with two transfer learning approaches.
The contributions are detailed in the following:
We integrate Transfer Kernel Learning (TKL) Long2015 into the PCVM to retain its sparsity.
Inspired by the BasisTransfer (BT) stvm, a subspace transfer learning approach is proposed and is also combined with PCVM, boosting prediction performance significantly compared to the baseline. This is enhanced by Nyström techniques, which reduces computational complexity compared to BT. Finally, a data augmentation strategy is proposed, making the approach independent of a certain domain adaptation task, which is a drawback of BT.
The proposed solutions are tested against other commonly used transfer learning approaches on common datasets in the field.
The rest of the paper is organized as follows: An overview of related work is given in section 2. The mathematical preliminaries of PCVM and Nyström approximation are introduced in section 3. The proposed transfer learning extensions following in section 4. An experimental part is given in section 5, addressing the classification performance, the sparsity and the computational time of the approaches. A summary and a discussion of open issues are provided in the conclusion.
2 Related Work
The area of transfer learning provides a broad range of transfer strategies with many competitive approaches Weiss20165288526. In the following, we briefly name these strategies and discuss the key approaches used herein.
The instance transfer methods try to align the distribution by reweighting some source data, which can directly be used with target data in the training phase Weiss2016.
Approaches implementing the symmetric feature transfer Weiss2016 are trying to find a common latent subspace for source and target domain with the goal to reduce distribution differences, such that the underlying structure of the data is preserved in the subspace. An example of a symmetric feature transfer method is the Transfer Component Analysis (TCA) Pan2011.
The asymmetric feature transfer approaches try to transform source domain data in the target (subspace) domain. This should be done in a way that the transformed source data will match the target distribution. In comparison to the symmetric feature transfer approaches, there is no shared subspace available, but only the target space Weiss2016. An example is given by the Joint Distribution Adaptation (JDA) Long2013, which solves divergences in distributions similar to TCA, but aligning conditional distributions with pseudolabeling techniques. Pseudolabeling is performed by assigning labels to unlabeled target data by a baseline classifier, e.g. SVM, resulting in a target conditional distribution, followed by matching it to the source conditional distribution of the ground truth source label Long2013. The Subspace Alignment (SA) Fernando2013a algorithm is another asymmetric transfer learning approach. It computes a target subspace representation where source and target data are aligned but is only evaluated on image domain adaptation data. We included SA in the experimental study also containing nonimage data.
The relationalknowledge transfer aims to find some relationship between source and target data Weiss2016. Transfer Kernel Learning (TKL) Long2015 is a recent approach, which approximates a kernel of training data with kernel of test data via the Nyström kernel approximation. It only considers discrepancies in distributions and further claims it is sufficient to approximate a training kernel via test kernel, i.e. , for effective knowledge transfer Long2015. Note that the restriction to kernels does not apply to the Nyström transfer learning extension (section 4.2) because it is completely in Euclidean space.
All the considered methods have approximately a complexity of where is the most significant number of samples concerning test or training Dai2007; Long2013; Long2015; Pan2011. According to the definition of transfer learning 5288526, these algorithms do transductive transfer learning, because some unlabeled test data must be available at training time. These transfersolutions cannot be directly used as predictors, but instead are wrappers for classification algorithms. The baseline classifier is most often the SVM. Note that discussed approaches are only tested with a classification task and may limited to this task.
3 Preliminaries
3.1 Probabilistic Classification Vector Machine
The Probabilistic Classification Vector Machine (PCVM) Chen2009 uses a probabilistic kernel model
(1) 
with a link function , being the weights of the basis functions and as bias term. The class assignment of given data is given by Chen2009
(2) 
In PCVM the basis functions are defined explicitly as part of the model design. In (1) the standard kernel trick can be applied Scholkopf.2001. The probabilistic output of PCVM Chen2009 is calculated by using the probit link function, i.e.
(3) 
where
is the cumulative distribution of the normal distribution
. The PCVM Chen2009 uses the ExpectationMaximization algorithm for learning the model. However, the PCVM is not restricted to EM and other optimization approaches like Monte Carlo techniques are also possible Chen2014b. The underlying sparsity framework within the optimization prunes unused basis functions, independent of the optimization approach, and is, therefore, a sparse probabilistic learning machine. In PCVM we will use the standard RBFkernel with a Gaussian width .In Schleif2015 a PCVM with linear costs was suggested, which makes use of the Nyström approximation and could be additionally used to improve the runtime and memory complexity. Further details can be found in Chen2009 and Schleif2015.
3.2 Nyström Approximation
The computational complexity of calculating kernels or eigensystems scales with where is the sample size SchleifGT18. Therefore, lowrank approximations and dimensionality reductions of data matrices are popular methods to speed up computational processes NIPS2000_1866. The Nyström approximation NIPS2000_1866 is a reliable technique to approximate a kernel matrix by a lowrank representation, without computing the eigendecomposition of the whole matrix:
By the Mercer theorem, kernels
can be expanded by orthonormal eigenfunctions
and nonnegative eigenvalues
in the form(4) 
The eigenfunctions and eigenvalues of a kernel are defined as solutions of the integral equation
(5) 
where
is a probability density over the input space. This integral can be approximated based on the Nyström technique by an i.i.d. sample
from(6) 
Using this approximation we denote with the corresponding Gram submatrix and get the corresponding matrix eigenproblem equation as
(7) 
with is column orthonormal and is a diagonal matrix. Now we can derive the approximations for the eigenfunctions and eigenvalues of the kernel
(8) 
where is the th column of . Thus, we can approximate at an arbitrary point as long as we know the vector For a given Gram matrix one may randomly chose rows and columns. The corresponding indices are called landmarks, and should be chosen such that the data distribution is sufficiently covered. Strategies how to chose the landmarks have recently been addressed in DBLP:journals/tnn/ZhangK10a; Kumar.2012; GittensM16; DBLP:journals/csda/BrabanterBSM10. The approximation is exact if the sample size is equal to the rank of the original matrix and the rows of the sample matrix are linear independent.
3.3 Nyström Matrix Form
The technique just introduced can be simplified by rewriting it in matrix form Nemtsov2016, where scaling factors like in equation (8) are neglected. We will use the matrix formulation throughout the remaining paper. Again, given a Gram matrix , it can be decomposed to
(9) 
with , , and . The submatrix is called the landmark matrix containing randomly chosen rows and columns from and has the Eigen Value Decomposition (EVD) as in equation (7
), where eigenvectors are
and eigenvalues are on the diagonal of . The remaining approximated eigenvectors of , i.e. or , are obtained by the Nyström method with as in equation (8). Combining and the full approximated eigenvectors of are(10) 
The eigenvectors of can be inverted by computing (Note ):
(11) 
Combining equation (10), equation (11) and , the matrix is approximated by
(12) 
The Nyström approximation error is given by the Frobenius Norm between ground truth and reconstructed matrices, i.e. .
3.4 Kernel Approximation
The Nyström approximation NIPS2000_1866 speeds up kernel computations, because kernel evaluation must not be done over all points, but only a fraction of it. Let be a Mercer kernel matrix with decomposition as in equation (9). Again we pick samples with , leading to as defined before. An approximated kernel is constructed by combining equation (12) and (9).
(13) 
with being a submatrix of incorporating all rows and landmark columns and as landmarks matrix. Based on the definition of kernel matrices, is valid and, therefore, only and must be computed. Note that the kernel approximation is used in section 4.1.
3.5 General Matrix Approximation
Despite its initial restriction to kernel matrices, recent research expanded the Nyström technique to approximate a Singular Value Decomposition (SVD) Nemtsov2016.
NyströmSVD generalizes the concept of matrix decomposition with the consequence that respective matrices must not be square.
Let be a rectangular matrix with decomposition as in equation (9).
The SVD of the landmark matrix is given by where are left and are right singular vectors.
are positive singular values.
The decomposition matrices have the same size as in section 3.2.
Similarly to EVD in section 3.2 the left and right singular vectors for the nonsymmetric part and are obtained via Nyström techniques Nemtsov2016 and are defined as and respectively.
Applying the same principals as for NyströmEVD, is approximated by
(14) 
Note that for nongram matrices like , is no longer valid. The matrix approximation (equation (14)), which is described in this section, is used in the performance extension in section 4.2.
4 Nyström Transfer Learning
4.1 Transfer Kernel SparsityExtension
The domain invariant TKL Long2015 technique is part of the first transfer learning extension. Based on experimental results shown in section 5 the use of TKL retains model sparsity of PCVM. Given the categories of transfer learning Weiss2016 introduced in section 2, the approach can be seen as relationalknowledgetransfer. It approximates the source kernel by the target kernel via the Nyström method. It aims to find a source kernel close to target distribution and simultaneously searches an eigensystem of the source kernel, which minimizes the squared Frobenius norm between ground truth and approximated kernel Long2015.
Let be training data sampled from with labels and be test data sampled from with labels . Note that we assume . We obtain the associated kernel matrices for training and for testing by evaluating an appropriate kernel function (e.g. the RBFkernel). For clarity, we rewrite equation (9) in kernel notation
(15) 
with as train, as test and as crossdomainkernel between domains.
Revisiting the original Nystrom approach, it uses randomly chosen columns and rows from , however, in TKL, the target kernel is seen as a landmark matrix and is used for approximating the training kernel . Hence, landmarks are not randomly picked from , but is used as landmark matrix and is the complete target set, therefore the approximation uses landmarks. is not used in the landmark selection. This differs from the original Nyström approach.
The TKL approach assumes that the distributions differences are sufficiently aligned if , which also leads to Long2015.
Rewriting equation (13), we can create an approximated training kernel by
(16) 
where are eigenvalues from source kernel and are eigenvalues from target kernel. This new kernel is not based on the eigensystem of and should therefore not reduce distribution differences sufficiently. Hence, the eigenvectors of are constructed by the target eigensystem
(17) 
To fully approximate the eigensystem of the training kernel, the eigenvalues are defined as model parameters of TKL, leading to approximated kernel . These new parameters must be wellchosen to reduce domain differences and while keeping original training information. The following optimization problem was suggested in Long2015 to solve this issue
(18) 
with as eigenspectrum dumping factor. The obtained kernel is domain invariant Long2015 and can be used in any kernel machine. The complexity of the TKL algorithm can be given with , where denotes the number of used eigenvectors and refers to the dimensionality of data Long2015. TKL in combination with PCVM is called Probabilistic Classification Transfer Kernel Vector Machine (PCTKVM).
4.1.1 Properties of PCTKVM algorithm
The TKL kernel is used to train the PCVM. In general, an RBFkernel with inplace optimized distributionwidth parameter is used in PCVM. Accordingly, a simple replacement of the standard RBFkernel by a kernel obtained with TKL will be inefficient. In PCVM the kernel is recalculated in each iteration, based on the optimized from the previous iteration. Consequently, we would have to recalculate the entire transfer kernel too. The complexity of the standard PCVM is where is the number of basis functions and at the beginning of the training and before pruning basis functions. The complexity of TKL is . Combining them, we would end up with a computational complexity of .
However, the performance of PCTKVM strongly depends on the quality of . Hence, some reasonable must be obtained via grid search, because inplace optimization is infeasible. The PCTKVM using a fixed has the complexity of . Note that TKL and PCTKVM are restricted to kernels, but are independent of the kernel type.
Predictions are made with the PCVM prediction function, but by employing as kernel for test data Long2015. However, in case of the SVM as baseline classifier, the kernel with size is used and restricted to the respective support vectors. The prediction function for the SVM has the form , where are the Lagrange multipliers Long2015.
Because of the sparsity of the PCVM the number of basis functions used in the decision function is typically small^{2}^{2}2The decision function may only be constructed by samples.. If we consider that our model has nonzero weight vectors with and because the PCVM uses only kernel rows/columns corresponding to the nonzero weight vector index, our final kernel for prediction has size . Therefore, the prediction function of the PCTKVM has the form: The probabilistic output is calculated with the probit link function used in the PCVM. Pseudo code of sparsity extension is shown in algorithm 1.
4.2 Nyström Basis Transfer PerformanceExtension
It is a reasonable strategy in TKL to align kernel matrices rather than kernel distributions in Reproducing Kernel Hilbert Space (RKHS), since distributions alignments are nontrivial in RKHS Long2015. Hence, TKL modifies the kernel explicitly to reduce the difference between two kernel matrices. Similar source and target kernels must be obtained, because the underlying classifier is kernelbased and has no transfer learning.
In 6790375 is shown that if source and target datasets are similar, they follow similar distributions, i.e. if then , and further have similar kernel distributions and similar kernels. Therefore, we propose a transfer learning approach operating in Euclidean space rather than RKHS, because it does not limit approaches to kernel classifiers. Further, the obtained kernels after transfer of data also follow similar distributions. A recent study stvm already showed great transfer capabilities and performance, by aligning and with a small error in terms of the Frobenius norm. However, this requires same samples sizes of and and is assumed in the following with size . The study considered the following optimization problem
(19) 
where and are transformation matrices drawing the data closer together. A solution is found analytically, summarized in three steps stvm
: First, normalize data to standard mean and variance. This will align marginal distributions in Euclidean space without considering label information
stvm. Second, compute an SVD of source and target data, i.e. and . Next, the approach assumes in terms of Frobenius norm due to normalization with zero mean and variance one, reducing the scaling factor of singular values to the same range. Finally, compute a solution for equation (19) by solving the linear equations. One obtains and . Note that . Apply the transfer operation and approximate the source matrix by using target basis information(20) 
with as approximated source data, used for training. The threestep process is shown in figure 2 as geometrical interpretation demonstrated with a toy example created by Gaussian random sampling.
In the following, the work stvm is continued, and we propose a Nyström based version with three main improvements: Reduction of computational complexity via Nyström, implicit dimensionality reduction and neglecting sample size requirements of BT. Further, we introduce a data augmentation strategy that eliminates the restriction stvm to the task of text transfer learning.
Recap equation (19) and consider a slightly changed optimization problem
(21) 
where , a transformation matrix must be found, which is again obtained analytically. Because we apply a dimensionality reduction technique, just the leftsided transformation matrix must be determined, which is derived in the following: Based on the relationship between SVD and EVD, the Principal Component Analysis (PCA) can be rewritten in terms of SVD. Consider the target matrix with SVD
(22) 
where as eigenvectors and as eigenvalues of . By choosing only the biggest eigenvalues and corresponding eigenvectors the dimensionality of is reduced by
(23) 
where , and is the reduced target matrix. Hence, only a left sided transformation in equation (21) is required, because right sided transformation is omitted in equation (23).
The computational complexity of BT and PCA is decreased by applying NyströmSVD: Let and have a decomposition given as in equation (9). Note for clarity the Nyström notation is used as in section 3.5. For a NyströmSVD we chose from both matrices columns/rows obtaining landmarks matrices and . Based on NyströmSVD in equation (14), the dimensionality is reduced as in equation (23) keeping only most relevant data structures
(24) 
Hence, it is sufficient to compute an SVD of instead of with and therefore is considerably lower in computational complexity. Analogy, we approximate source data by . Since we again assume due to data normalization, solving the linear equation as a possible solution for equation (21), leads to . Plugging it back we obtain
(25) 
where again the basis of target data transfers structural information into the training domain. The matrix is used for training and is used for testing. According to Weiss2016, it is an asymmetric transfer approach. Further, it is transductive 5288526 and does not need labeled target data. For further references, we call the approach Nyström Basis Transfer (NBT) and in combination with PCVM, Nyström Transfer Vector Machine (NTVM).
4.2.1 Properties of Nyström Basis Transfer
We showed that NBT is a valid PCA approximation by equation (23).
It follows by definition of SVD that and is an orthogonal basis.
Therefore, equation (24) and equation (25) are orthogonal transformations.
In particular equation (25) transforms the source data into the target subspace and projects it onto the principal components of .
If data matrices and are standard normalized^{3}^{3}3Experimental data are standard normalized to mean zero and variance one in the preprocessing., the geometric interpretation is a rotation of source data w.r.t to angles of the target basis, already shown in figure 2.
The computational complexity of NBT is given by the complexity of NyströmSVD and the calculation of the respective landmark matrices and with complexity .
The matrix inversion of diagonal matrix in equation (24) can be neglected.
Remaining matrix multiplications are , contributing to an overall complexity of NBT, which is with . This makes NBT the fastest transfer learning solution in terms of computational complexity in comparison to the discussed methods in section 2.
The approximation error is similar to the original BasisTransfer stvm error:
(26) 
4.2.2 Data Augmentation
In BT stvm, sample sizes of data matrices must be aligned. This is not required in NBT as seen in equation (24) and equation (25), because differences in size are aligned during transformation. However, the original dataset has an sized label vector with , which does not correspond to and this labelvector should not be transformed into the new size because semantic label information does not correspond with transformed data. Hence, sample sizes must still be the same, i.e. , but is not required by definition of NBT. We propose a data augmentation strategy for solving different sample sizes, applied before
doing knowledge transfer. Data augmentation is common in machine or deep learning and has a variety of applications
Zhang2016; Hauberg16. However, source and target data should have a reasonable size to proper encode domain knowledge.In general, there are two cases, first , meaning there is not enough source data. This is augmented via sampling from a classwise multivariate Gaussian distribution harmonizing the number of samples per class of source data. The other case is and is solved by uniform random removal of source data from the largest class, i.e. with as number of class samples and as label with most samples, in the source set . The approach reduces source data to size . This is somewhat counterintuitive because one does not want to reduce the source set. However, we have no class information of the target set at training time, and we would be guessing class labels of target data when adding new artificial samples. The data augmentation strategy is summarized as
(27) 
where is classwise mean, is classwise variance. The function maps a training sample to the ground truth label and is the number of class sample occurrences.
5 Experiments
We follow the experimental design typical for transfer learning algorithms Long2015; Gong2017; Long2013; 5288526; Pan2011. A crucial characteristic of datasets for transfer learning is that domains for training and testing are different but related. This relation exists because the train and test classes have the same top category or source. The classes themselves are subcategories or subsets. The parameters for respective methods^{4}^{4}4Source code, parameters and datasets obtainable via https://github.com/ChristophRaab/ntvm are determined for best performance in terms of accuracy via grid search evaluated on source data.
5.1 Dataset Description
The study consists of benchmark datasets and are already preprocessed. Reuters from WenyuanDai.2007, 20Newsgroup from Long2014a and CaltechOffice from Gong2017. A summary of image and text datasets is shown in table 1 and table 2. Respective datasets are detailed in the following.
5.1.1 Image Datatsets
Caltech256^{5}^{5}5https://people.eecs.berkeley.edu/j̃hoffman/domainadapt/#datasets_code Office:
The first, Caltech (C) is an extensive dataset of images and initially contains of 30607 images within 257 categories. However, in this setting, only 1123 images are used to be related to the Office dataset. We adopt the sampling scheme from Gong2017.
The Office dataset is a collection of images drawn from three sources, which are from Amazon (A), digital SLR camera DSLR (D) and webcam (W). They vary regarding camera, light situation and size, but having 31 object categories, e.g. computer or printer, in common.
Duplicates are removed, as well as images, which have more than 15 similar Scale Invariant Feature Transform (SIFT) in common.
To get an overall collection of the four image sets, which are considered as domains, categories with the same description are taken.
From the Caltech and Office dataset, ten similar categories are extracted:
backpack, touringbike, calculator, headphones, computerkeyboard, laptop101, computermonitor, computer mouse, coffeemug, and projector.
They are the class labels from one to ten.
With this, a classifier should be trained in the training domain, e.g. on projector images (Class One) from amazon (Domain A), and should be able to classify the test image to the corresponding image category, e.g. projector (Class One) images from Caltech (Domain C) against other image types like headphones (Class Two).
The final feature extraction is done with Speeded Up Robust Features Extraction (SURF) and encoded with 800bin histograms.
Finally, the twelve combination of domain datasets are designed to be trained and tested against each other by the ten labels Gong2017. An overview of the image dataset is given in table 1.
Dataset  Subsets  #Samples  #Features  #Classes 

Caltech256  Caltech (C)  1123  800  10 
Office  Amazon (A)  958  
DSLR (D)  295  
Webcam(W)  157 
5.1.2 Text Datasets
In the following, the text datasets are discussed. The arranging of the text domain adaption datasets is different from the image datasets. The text datasets are structured into top categories and subcategories. These top categories are regarded as labels and the subcategories are used for training and testing. The variation of subcategories between training and testing creates a transfer problem. The difference to image datasets is that at the image datasets the (sub) categories are labels and the difference in the top category (source of images) between training and testing, e.g Caltech to Amazon, creates the transfer problem. An overview of the text datasets is given in table 2.
Reuters21578^{6}^{6}6http://www.daviddlewis.com/resources/testcollections/reuters21578: A collection of Reuters newswire articles collected in 1987 with a hierarchical structure given as topcategories and subcategories to organize the articles. The three top categories Organization (Orgs), Places and People are used in our experiments. The category Orgs has 56 subcategories, Places has 176 and People has 269. In the category Places, all articles about USA are removed making the three nearly even distributed in terms of articles.
We follow the sampling scheme from WenyuanDai.2007, which will be discussed in the following. Note that the top categories, which are just mentioned, are the labels of the datasets.
All subcategories of a top category are randomly divided into two parts of subcategories with about the same number of articles. This selection is fixed for all experiments. For a top category this creates two parts and of subcategories and for another top category this creates the parts and . The top category is regarded as a class and as another one.
The transfer problem is created by using and for training and and is used for testing. This is a two class problem, because of the two top categories and . Such a configuration is called vs . If the second part is used for training, i.e. and , and the first part for testing, i.e. and , it is regarded as vs . The individual subcategories have different distribution but are related by the top category, creating a change in distribution between training and testing.
Based on this six datasets are generated: Orgs vs. Places, Orgs vs. People, People vs. Places, Places vs. Orgs, People vs. Places and Places vs. People. The articles are converted to lower case, words are stemmed and stopwords are removed. With the Document Frequency (DF)Threshold of 3, the numbers of features are cut down. The features are generated with TermFrequency InverseDocumentFrequency (TFIDF). For a detailed choice of subcategories see WenyuanDai.2007.
Top Category (Names)  Subcategory  #Samples  #Features  #Classes 

Comp  comp.graphics  970  25804  2 
comp.os.mswindows.misc  963  
comp.sys.ibm.pc.hardware  979  
comp.sys.mac.hardware  958  
Rec  rec.autos  987  
rec.motorcycles  993  
rec.sport.baseball  991  
rec.sport.hokey  997  
Sci  sci.crypt  989  
sci.electronics  984  
sci.med  987  
sci.space  985  
Talk  talk.politics.guns  909  
talk.politics.mideast  940  
talk.politics.misc  774  
talk.religion.misc  627  
Orgs  56 subcategories  1237  4771  2 
People  269 subcategories  1208  
Places  176 subcategories  1016 
20Newsgroup^{7}^{7}7http://qwone.com/~jason/20Newsgroups/: The original collection has approximately 20000 text documents from 20 newsgroups. The four top categories are comp, rec, talk and sci and containing four subcategories each. We follow a data sampling scheme introduced by Long2015 and generate 216 cross domain datasets based on subcategories: Again the top categories are the labels and the sub categories are varied between training and testing to create a transfer problem.
Let be a top category and are subcategories of and another top category with and are subcategories of . A dataset is constructed by selecting two subcategories for each topcategory, e.g. , , , and , for training and select another four, e.g. , , , and for testing. The top categories and are respective classes.
For two top categories every permutation is used and therefore datasets are generated. By combining each top category with each other there are 216 dataset combinations. The datasets are summarized as mean per top category combination, e.g. C vs K, which are comp vs rec, comp vs talk, comp vs sci, rec vs sci, rec vs talk and sci vs talk. The transfer problem is created by training and testing on different subcategories analogy to Reuters. This version of 20Newsgroup has 25804 TFIDF features within 15033 documents Long2015.
Note to reproduce the results below, one should use the linked version of datasets with same choice of subcategories. Regardless of dataset, features have been normalized to standard mean and variance. The samples for training and testing the classifiers are drawn with fold sampling scheme suggested by Alpaydm.1999, with a transfer learning adapted data sampling scheme as suggested in Gong2017.
5.2 Comparison of Prediction Performance
The results of the experiments are summarized in table 3 and showing mean errors of the crossvalidation study per dataset. To determine statistically significant differences, we follow Chen2009, using the Friedman Test Demsar2006 with a confidence level of and BonferroniDunn PostHoc correction. The marks statistical significance against NTVM. The PCTKVM and NTVM are compared to baseline classifier to standard transfer learning methods and nontransfer learning baseline methods, i.e. SVM and PCVM. The PCTKVM has overall comparable performance to PCVM, however, is worse at Newsgroup, showing negative transfer Weiss2016. This should be investigated in future work.
The NTVM method has excellent performance and outperforms every other algorithm by far. In the overall comparison, the NTVM is significantly better compared to the other approaches, except SA.
Dataset  SVM  PCVM  TCA  JDA  TKL  SA 



Orgs vs People  23.0  31.1  23.8  25.7  18.6  6.3  20.7  3.1  
People vs Orgs  21.1  28.1  20.3  24.9  13.0  5.7  23.2  3.1  
Orgs vs Place  30.8  33.4  28.7  27.4  22.7  6.4  33.9  2.6  
Place vs Orgs  35.8  35.7  34.1  34.1  17.5  6.6  40.6  2.4  
People vs Place  38.8  41.3  37.2  41.3  30.6  8.3  36.3  2.6  
Place vs People  41.3  41.2  43.4  43.3  34.0  11.5  39.7  2.5  
Reuters Mean  31.8  35.1  31.3  32.8  22.7  7.5  32.4  2.7  
Comp vs Rec  12.7  17.9  8.1  7.8  3.0  1.8  37.3  0.5  
Comp vs Sci  24.5  29.1  26.3  27.1  9.5  4.8  33.7  0.8  
Comp vs Talk  5.1  6.2  2.9  4.2  2.4  0.9  14.9  7.2  
Rec vs Sci  23.7  36.6  17.3  23.9  5.1  1.6  29.3  0.3  
Rec vs Talk  18.7  27.8  13.6  15.2  5.6  1.8  33.4  3.1  
Sci vs Talk  21.7  30.9  20.1  26.1  14.6  2.9  30.2  6.9  
Newsgroup Mean  17.8  24.7  14.7  17.4  6.7  2.3  29.8  3.1  
C vs A  48.0  55.9  49.7  49.0  49.1  38.0  52.8  19.9  
C vs W  53.8  57.0  55.1  53.5  53.1  68.1  55.2  59.5  
C vs D  62.2  65.9  65.2  58.9  59.4  68.3  60.9  47.4  
A vs C  54.8  59.5  53.8  54.6  53.9  42.4  56.6  37.5  
A vs W  61.0  66.1  58.6  58.8  57.2  67.5  59.7  55.6  
A vs D  62.6  64.8  66.5  59.2  59.2  64.9  61.6  48.9  
D vs C  68.6  72.5  62.4  61.1  65.2  66.1  68.7  19.9  
D vs A  69.0  72.4  65.5  65.8  64.6  66.2  67.0  35.6  
D vs W  40.6  62.5  28.2  30.1  27.4  30.7  45.8  47.9  
W vs C  65.9  67.8  62.1  65.3  63.2  59.4  63.9  16.1  
W vs A  67.6  69.0  63.3  67.0  63.0  64.7  68.4  38.5  
W vs C  23.1  41.1  21.6  22.1  27.4  32.5  45.5  59.8  
Image Mean  56.4  62.9  54.3  53.8  53.6  55.7  58.8  40.6  
Overall Mean  35.3  40.9  33.4  34.6  27.7  21.8  40.4  15.5 
Especially at Reuters, NTVM convinces with stable and best performance over multiple datasets. Table 3 shows that NTVM is significantly better in terms of mean at Reuters except TKL and SA.
The NTVM also outperforms most of the time at image datasets, showing the capability to tackle multiclass problems and their independence from a certain domain adaptation task, unlike previous work stvm. Further, in terms of mean error on image, the NTVM outperforms SVM, PCVM, PCTKVM and SA with statistical significant differences.
The NTVM is also very good at Newsgroup, but not that outstanding. It is overall little worse than SA, but not statistically significant. Further, it is best at half of the datasets and convinces with error performances under one percent.
Note that the standard deviation is not shown, because it will not provide more insights into the performance. It is overall very similar and small, because the underlying classifier is the same.
The sensitivity of the number of landmarks on prediction error as the only parameter of NBT is demonstrated in figure 3. It shows a comparison of the number of landmarks and the mean of classification error over Reuters and Office  Caltech datasets. The plot indicates that the error decreases to a global minimum with an increasing number of landmarks, which supports the Nyström error expectation. However, further increasing to the maximum number of landmarks, i.e. all samples, the error starts to increase. We assume this indicates that only a subset of features is relevant for classification and remaining features are noise. Further, this subset is drawn randomly, hence by choosing various landmark matrices, other features becoming relevant or nonrelevant, as features correlate with certain features and again other features correlate with others.
5.3 Comparison of Model Complexity
Dataset  PCVM  SVM  TCA  JDA  SA  TKL  PCTKVM  NTVM 
Reuters(1153.66)  49.07  441.78  168.51  201.87  100.87  351.21  1.97  329.51 
Image(633.25)  62.87  284.37  231.65  264.38  238.44  262.64  46.63  27.59 
20 Newsgroup(3758.30)  74.23  1247.10  269.75  252.49  211.57  1046.26  92.89  74.70 
Overall Mean  62.06  640.17  223.30  245.31  183.60  553.27  47.16  143.93 
We measured the model complexity with the number of model vectors, e.g. support vectors. The result of our experiment is shown as mean summarizing a dataset group in table 4. We see that the transfer learning models of the PCTKVM provide very sparse models while having good performance. The sparsity of NTVM is also very competitive. However, the overall sparsity is worse in comparison to PCTKVM and PCVM. In comparison to all nonPCVM methods, the PCTKVM outperforms the respective methods by far.
The difference in model complexity is exemplarily shown in figure 4.
It shows a sample result of classification of PCTKVM and TKLSVM on the text dataset Orgs vs People with the settings from above.
The PCTKVM error is with three model vectors and the error of TKLSVM is with support vectors.
PCTKVM achieves sustain performance by a small model complexity and provides a way to interpret the model.
Note that the algorithms are trained in the original feature space and the models are plotted in a reduced space,
using the tdistributed stochastic neighbor embedding algorithm vanDerMaaten.2008
. Note that the KullbackLeibler divergence of the data shown in figure
4 between input distribution (original space) and output distribution (reduced space) is 0.92.5.4 Time Comparison
The mean time results in seconds of the cross validation study per data set group are shown in table 5. Note that SVM and PCVM are underlying classifier for compared approaches and are presented for the baseline and not marked as winners in the table. They are also included in the time measurement of transfer learning approaches. Overall the SVM is the fastest algorithm, because it is baseline and uses the LibSVM implementation. The overall fastest transfer learning approach is TKL, but JDA is also promising and fastest at Reuters and Newsgroup.
The PCVM is overall by far the slowest classifier. By integration of TKL and NBT the time performance of resulting PCTKVM and NTVM are a big magnitude faster. We assume that the PCVM converges faster with transfer learning resulting in less computational time. Overall NTVM is slightly faster and lower in time at Reuters and Newsgroup in comparison to PCTKVM, which supports the discussion about computational complexity in section 4.2.1. In comparison to other transfer learning approaches, both approaches are slower than other transfer learning approaches. But the reason for this should be the PCVM as underlying classifier, because TKL is the fastest transfer approach with SVM. Further work should aim to measure the time with same classifier to make results more comparable.
Dataset 

TCA  JDA  TKL  SA 

PCTKVM  NTVM  

Reuters  0.06  0.86  0.36  0.40  0.87  543.71  16.20  8.78  
Newsgroup  1.35  21.39  4.79  2.80  59.70  1501.7  5.06  25.29  
Image  0.02  0.29  0.16  0.44  0.08  258.78  17.69  3.41  
Overall  0.48  7.51  1.77  1.21  20.22  768.06  12.98  12.49 
6 Conclusion
Summarizing, we proposed two transfer learning extensions for the PCVM, resulting in PCTKVM and NTVM. The first shows the best overall sparsity and comparable performance to common transfer learning approaches. The NTVM has an outstanding performance, both in absolute values and statistical significance. It has competitive sparsity and lowest computational complexity compared to discussed solutions. NBT is an enhancement of previous versions of Basis Transfer via Nyström methods and is no longer limited to specific domain adaptation tasks. The dimensionality reduction paired with projection of source data into the target subspace via NBT showed its reliability and robustness in this study. Proposed solutions are tested against standard benchmarks in the field in terms of algorithms and datasets. In future work, deep transfer learning, different baseline classifiers and realworld or different domain adaptation datasets should be integrated. Further, smart sampling techniques for landmark selection should be tackled.
Comments
There are no comments yet.