Transfer learning extensions for the probabilistic classification vector machine

07/11/2020 ∙ by Christoph Raab, et al. ∙ FHWS 0

Transfer learning is focused on the reuse of supervised learning models in a new context. Prominent applications can be found in robotics, image processing or web mining. In these fields, the learning scenarios are naturally changing but often remain related to each other motivating the reuse of existing supervised models. Current transfer learning models are neither sparse nor interpretable. Sparsity is very desirable if the methods have to be used in technically limited environments and interpretability is getting more critical due to privacy regulations. In this work, we propose two transfer learning extensions integrated into the sparse and interpretable probabilistic classification vector machine. They are compared to standard benchmarks in the field and show their relevance either by sparsity or performance improvements.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised learning and in particular classification is an important task in machine learning with a broad range of applications. The obtained models are used to predict the label of unseen test samples. In general, it is assumed that the underlying domain of interest is not changing between training and test samples. If the domain is changing from one task to a related but different task, one would like to reuse the available learning model. Domain differences are quite common in real-world scenarios and eventually lead to substantial performance drops

Weiss2016.

A transfer learning example is the classification of web pages: A classifier is trained in the domain of university web pages with a word distribution according to universities and in the test scenario, the domain has changed to non-university web pages, where the word distribution may not be similar to training distribution. Figure 

1 shows a toy example of a traditional and a transfer classification task with clearly visible domain differences.

Traditional Problem
Transfer Problem
Figure 1: Toy example showing a comparison of a traditional classification task with two classes shown in red and blue. Left figure shows a traditional classification task with one domain and right side shows transfer learning classification task with two domains, which are indicated as shapes. Transfer learning aims to extract common information in one domain to help a model in another domain.

More formally, let be source data sampled from the source domain distribution and let be target data from the target domain distribution . Traditional machine learning assumes similar distributions, i.e. , but transfer learning assumes different distributions, i.e. , and appears in the web page example where could be features of university websites and are features of non-university websites.

In general, transfer learning aims to solve the divergence between domain distributions by reusing information in one domain to help to learn a target prediction function in a different domain of interest 5288526

. However, despite the definition, the proposed solutions implicitly solve differences by linear transformations, detailed in section 

4.1 and 4.2. Multiple transfer learning methods have been already proposed, following different strategies and improving prediction performance of underlying classification algorithms in test scenarios Weiss20165288526. In this paper, we focus on sparse models, which are not yet covered sufficiently by transfer learning approaches.

The Probabilistic Classification Vector Machine (PCVM) Chen2009 is a sparse probabilistic kernel classifier pruning unused basis functions during training, found to be very effective Chen2009Schleif2015 with competitive performance to Support Vector Machine (SVM) Cortes1995a. The PCVM

is naturally sparse and creates interpretable models as needed in many application domains of transfer learning. The original PCVM is not well suited for transfer learning due to their focus on stationary Gaussian distribution and is equipped within this work with two transfer learning approaches.

The contributions are detailed in the following:
We integrate Transfer Kernel Learning (TKL) Long2015 into the PCVM to retain its sparsity. Inspired by the Basis-Transfer (BT) stvm, a subspace transfer learning approach is proposed and is also combined with PCVM, boosting prediction performance significantly compared to the baseline. This is enhanced by Nyström techniques, which reduces computational complexity compared to BT. Finally, a data augmentation strategy is proposed, making the approach independent of a certain domain adaptation task, which is a drawback of BT. The proposed solutions are tested against other commonly used transfer learning approaches on common datasets in the field.

The rest of the paper is organized as follows: An overview of related work is given in section 2. The mathematical preliminaries of PCVM and Nyström approximation are introduced in section 3. The proposed transfer learning extensions following in section 4. An experimental part is given in section 5, addressing the classification performance, the sparsity and the computational time of the approaches. A summary and a discussion of open issues are provided in the conclusion.

2 Related Work

The area of transfer learning provides a broad range of transfer strategies with many competitive approaches Weiss20165288526. In the following, we briefly name these strategies and discuss the key approaches used herein.

The instance transfer methods try to align the distribution by re-weighting some source data, which can directly be used with target data in the training phase Weiss2016.

Approaches implementing the symmetric feature transfer Weiss2016 are trying to find a common latent subspace for source and target domain with the goal to reduce distribution differences, such that the underlying structure of the data is preserved in the subspace. An example of a symmetric feature transfer method is the Transfer Component Analysis (TCA) Pan2011.

The asymmetric feature transfer approaches try to transform source domain data in the target (subspace) domain. This should be done in a way that the transformed source data will match the target distribution. In comparison to the symmetric feature transfer approaches, there is no shared subspace available, but only the target space Weiss2016. An example is given by the Joint Distribution Adaptation (JDA) Long2013, which solves divergences in distributions similar to TCA, but aligning conditional distributions with pseudo-labeling techniques. Pseudo-labeling is performed by assigning labels to unlabeled target data by a baseline classifier, e.g. SVM, resulting in a target conditional distribution, followed by matching it to the source conditional distribution of the ground truth source label Long2013. The Subspace Alignment (SA) Fernando2013a algorithm is another asymmetric transfer learning approach. It computes a target subspace representation where source and target data are aligned but is only evaluated on image domain adaptation data. We included SA in the experimental study also containing non-image data.

The relational-knowledge transfer aims to find some relationship between source and target data Weiss2016. Transfer Kernel Learning (TKL) Long2015 is a recent approach, which approximates a kernel of training data with kernel of test data via the Nyström kernel approximation. It only considers discrepancies in distributions and further claims it is sufficient to approximate a training kernel via test kernel, i.e. , for effective knowledge transfer Long2015. Note that the restriction to kernels does not apply to the Nyström transfer learning extension (section 4.2) because it is completely in Euclidean space.

All the considered methods have approximately a complexity of where is the most significant number of samples concerning test or training Dai2007; Long2013; Long2015; Pan2011. According to the definition of transfer learning 5288526, these algorithms do transductive transfer learning, because some unlabeled test data must be available at training time. These transfer-solutions cannot be directly used as predictors, but instead are wrappers for classification algorithms. The baseline classifier is most often the SVM. Note that discussed approaches are only tested with a classification task and may limited to this task.

3 Preliminaries

3.1 Probabilistic Classification Vector Machine

The Probabilistic Classification Vector Machine (PCVM) Chen2009 uses a probabilistic kernel model

(1)

with a link function , being the weights of the basis functions and as bias term. The class assignment of given data is given by Chen2009

(2)

In PCVM the basis functions are defined explicitly as part of the model design. In (1) the standard kernel trick can be applied Scholkopf.2001. The probabilistic output of PCVM Chen2009 is calculated by using the probit link function, i.e.

(3)

where

is the cumulative distribution of the normal distribution

. The PCVM Chen2009 uses the Expectation-Maximization algorithm for learning the model. However, the PCVM is not restricted to EM and other optimization approaches like Monte Carlo techniques are also possible Chen2014b. The underlying sparsity framework within the optimization prunes unused basis functions, independent of the optimization approach, and is, therefore, a sparse probabilistic learning machine. In PCVM we will use the standard RBF-kernel with a Gaussian width .

In Schleif2015 a PCVM with linear costs was suggested, which makes use of the Nyström approximation and could be additionally used to improve the run-time and memory complexity. Further details can be found in Chen2009 and Schleif2015.

3.2 Nyström Approximation

The computational complexity of calculating kernels or eigensystems scales with where is the sample size SchleifGT18. Therefore, low-rank approximations and dimensionality reductions of data matrices are popular methods to speed up computational processes NIPS2000_1866. The Nyström approximation NIPS2000_1866 is a reliable technique to approximate a kernel matrix by a low-rank representation, without computing the eigendecomposition of the whole matrix:
By the Mercer theorem, kernels

can be expanded by orthonormal eigenfunctions

and non-negative eigenvalues

in the form

(4)

The eigenfunctions and eigenvalues of a kernel are defined as solutions of the integral equation

(5)

where

is a probability density over the input space. This integral can be approximated based on the Nyström technique by an i.i.d. sample

from

(6)

Using this approximation we denote with the corresponding Gram sub-matrix and get the corresponding matrix eigenproblem equation as

(7)

with is column orthonormal and is a diagonal matrix. Now we can derive the approximations for the eigenfunctions and eigenvalues of the kernel

(8)

where is the th column of . Thus, we can approximate at an arbitrary point as long as we know the vector For a given Gram matrix one may randomly chose rows and columns. The corresponding indices are called landmarks, and should be chosen such that the data distribution is sufficiently covered. Strategies how to chose the landmarks have recently been addressed in DBLP:journals/tnn/ZhangK10a; Kumar.2012; GittensM16; DBLP:journals/csda/BrabanterBSM10. The approximation is exact if the sample size is equal to the rank of the original matrix and the rows of the sample matrix are linear independent.

3.3 Nyström Matrix Form

The technique just introduced can be simplified by rewriting it in matrix form Nemtsov2016, where scaling factors like in equation (8) are neglected. We will use the matrix formulation throughout the remaining paper. Again, given a Gram matrix , it can be decomposed to

(9)

with , , and . The submatrix is called the landmark matrix containing randomly chosen rows and columns from and has the Eigen Value Decomposition (EVD) as in equation (7

), where eigenvectors are

and eigenvalues are on the diagonal of . The remaining approximated eigenvectors of , i.e.  or , are obtained by the Nyström method with as in equation (8). Combining and the full approximated eigenvectors of are

(10)

The eigenvectors of can be inverted by computing (Note ):

(11)

Combining equation (10), equation (11) and , the matrix is approximated by

(12)

The Nyström approximation error is given by the Frobenius Norm between ground truth and reconstructed matrices, i.e. .

3.4 Kernel Approximation

The Nyström approximation NIPS2000_1866 speeds up kernel computations, because kernel evaluation must not be done over all points, but only a fraction of it. Let be a Mercer kernel matrix with decomposition as in equation (9). Again we pick samples with , leading to as defined before. An approximated kernel is constructed by combining equation (12) and (9).

(13)

with being a sub-matrix of incorporating all rows and landmark columns and as landmarks matrix. Based on the definition of kernel matrices, is valid and, therefore, only and must be computed. Note that the kernel approximation is used in section 4.1.

3.5 General Matrix Approximation

Despite its initial restriction to kernel matrices, recent research expanded the Nyström technique to approximate a Singular Value Decomposition (SVD) Nemtsov2016.
Nyström-SVD generalizes the concept of matrix decomposition with the consequence that respective matrices must not be square.
Let be a rectangular matrix with decomposition as in equation (9). The SVD of the landmark matrix is given by where are left and are right singular vectors. are positive singular values. The decomposition matrices have the same size as in section 3.2. Similarly to EVD in section 3.2 the left and right singular vectors for the non-symmetric part and are obtained via Nyström techniques Nemtsov2016 and are defined as and respectively.
Applying the same principals as for Nyström-EVD, is approximated by

(14)

Note that for non-gram matrices like , is no longer valid. The matrix approximation (equation  (14)), which is described in this section, is used in the performance extension in section 4.2.

4 Nyström Transfer Learning

4.1 Transfer Kernel Sparsity-Extension

The domain invariant TKL Long2015 technique is part of the first transfer learning extension. Based on experimental results shown in section 5 the use of TKL retains model sparsity of PCVM. Given the categories of transfer learning Weiss2016 introduced in section 2, the approach can be seen as relational-knowledge-transfer. It approximates the source kernel by the target kernel via the Nyström method. It aims to find a source kernel close to target distribution and simultaneously searches an eigensystem of the source kernel, which minimizes the squared Frobenius norm between ground truth and approximated kernel Long2015.

Let be training data sampled from with labels and be test data sampled from with labels . Note that we assume . We obtain the associated kernel matrices for training and for testing by evaluating an appropriate kernel function (e.g. the RBF-kernel). For clarity, we rewrite equation (9) in kernel notation

(15)

with as train-, as test- and as cross-domain-kernel between domains. Revisiting the original Nystrom approach, it uses randomly chosen columns and rows from , however, in TKL, the target kernel is seen as a landmark matrix and is used for approximating the training kernel . Hence, landmarks are not randomly picked from , but is used as landmark matrix and is the complete target set, therefore the approximation uses landmarks. is not used in the landmark selection. This differs from the original Nyström approach.
The TKL approach assumes that the distributions differences are sufficiently aligned if , which also leads to Long2015.
Rewriting equation (13), we can create an approximated training kernel by

(16)

where are eigenvalues from source kernel and are eigenvalues from target kernel. This new kernel is not based on the eigensystem of and should therefore not reduce distribution differences sufficiently. Hence, the eigenvectors of are constructed by the target eigensystem

(17)

To fully approximate the eigensystem of the training kernel, the eigenvalues are defined as model parameters of TKL, leading to approximated kernel . These new parameters must be well-chosen to reduce domain differences and while keeping original training information. The following optimization problem was suggested in Long2015 to solve this issue

(18)

with as eigenspectrum dumping factor. The obtained kernel is domain invariant Long2015 and can be used in any kernel machine. The complexity of the TKL algorithm can be given with , where denotes the number of used eigenvectors and refers to the dimensionality of data Long2015. TKL in combination with PCVM is called Probabilistic Classification Transfer Kernel Vector Machine (PCTKVM).

4.1.1 Properties of PCTKVM algorithm

The TKL kernel is used to train the PCVM. In general, an RBF-kernel with in-place optimized distribution-width parameter is used in PCVM. Accordingly, a simple replacement of the standard RBF-kernel by a kernel obtained with TKL will be inefficient. In PCVM the kernel is recalculated in each iteration, based on the optimized from the previous iteration. Consequently, we would have to recalculate the entire transfer kernel too. The complexity of the standard PCVM is where is the number of basis functions and at the beginning of the training and before pruning basis functions. The complexity of TKL is . Combining them, we would end up with a computational complexity of .

However, the performance of PCTKVM strongly depends on the quality of . Hence, some reasonable must be obtained via grid search, because in-place optimization is infeasible. The PCTKVM using a fixed has the complexity of . Note that TKL and PCTKVM are restricted to kernels, but are independent of the kernel type.

Predictions are made with the PCVM prediction function, but by employing as kernel for test data Long2015. However, in case of the SVM as baseline classifier, the kernel with size is used and restricted to the respective support vectors. The prediction function for the SVM has the form , where are the Lagrange multipliers Long2015.

Because of the sparsity of the PCVM the number of basis functions used in the decision function is typically small222The decision function may only be constructed by samples.. If we consider that our model has non-zero weight vectors with and because the PCVM uses only kernel rows/columns corresponding to the non-zero weight vector index, our final kernel for prediction has size . Therefore, the prediction function of the PCTKVM has the form: The probabilistic output is calculated with the probit link function used in the PCVM. Pseudo code of sparsity extension is shown in algorithm 1.

1: as sized training and sized test set; as sized training label vector; kernel(-type) ker; eigenspectrum dumping factor ; as kernel parameter.
2:Weight Vector ; bias , kernel parameter ; transfer kernel .
3: = calculate_dissimilarity_matrix();
4: = transfer_kernel_Learning(,ker,,); According to equation (18)
5:[,] = pcvm_training(); According to section 3.1
Algorithm 1 Probabilistic Classification Transfer Kernel Vector Machine

However, the sparsity extension performs similar to common approaches shown in the experiments in section 5, but to fully capture the advantages of Nyström and PCVM the following section shows a transfer learning approach focused on performance.

4.2 Nyström Basis Transfer Performance-Extension

It is a reasonable strategy in TKL to align kernel matrices rather than kernel distributions in Reproducing Kernel Hilbert Space (RKHS), since distributions alignments are non-trivial in RKHS Long2015. Hence, TKL modifies the kernel explicitly to reduce the difference between two kernel matrices. Similar source and target kernels must be obtained, because the underlying classifier is kernel-based and has no transfer learning.

In 6790375 is shown that if source and target datasets are similar, they follow similar distributions, i.e. if then , and further have similar kernel distributions and similar kernels. Therefore, we propose a transfer learning approach operating in Euclidean space rather than RKHS, because it does not limit approaches to kernel classifiers. Further, the obtained kernels after transfer of data also follow similar distributions. A recent study stvm already showed great transfer capabilities and performance, by aligning and with a small error in terms of the Frobenius norm. However, this requires same samples sizes of and and is assumed in the following with size . The study considered the following optimization problem

(19)

where and are transformation matrices drawing the data closer together. A solution is found analytically, summarized in three steps stvm

: First, normalize data to standard mean and variance. This will align marginal distributions in Euclidean space without considering label information

stvm. Second, compute an SVD of source and target data, i.e.  and . Next, the approach assumes in terms of Frobenius norm due to normalization with zero mean and variance one, reducing the scaling factor of singular values to the same range. Finally, compute a solution for equation (19) by solving the linear equations. One obtains and . Note that . Apply the transfer operation and approximate the source matrix by using target basis information

(20)

with as approximated source data, used for training. The three-step process is shown in figure 2 as geometrical interpretation demonstrated with a toy example created by Gaussian random sampling.

Data unnormalized
Data after standard normalization
Data after Basis-Transfer
Figure 2: Process of Basis-Transfer with samples from two domains. Class information is given by color (red/blue) and domain is indicated by shape (domain one - , domain two - ). First (a), the non-normalized data with a knowledge gap. Second (b), a normalized feature space. Third (c), Basis-Transfer approximation is applied, correcting the samples, i.e. shapes with same color are aligned, and training data is usable for learning a classification model. Best viewed in color.

In the following, the work stvm is continued, and we propose a Nyström based version with three main improvements: Reduction of computational complexity via Nyström, implicit dimensionality reduction and neglecting sample size requirements of BT. Further, we introduce a data augmentation strategy that eliminates the restriction stvm to the task of text transfer learning.

Recap equation (19) and consider a slightly changed optimization problem

(21)

where , a transformation matrix must be found, which is again obtained analytically. Because we apply a dimensionality reduction technique, just the left-sided transformation matrix must be determined, which is derived in the following: Based on the relationship between SVD and EVD, the Principal Component Analysis (PCA) can be rewritten in terms of SVD. Consider the target matrix with SVD

(22)

where as eigenvectors and as eigenvalues of . By choosing only the biggest eigenvalues and corresponding eigenvectors the dimensionality of is reduced by

(23)

where , and is the reduced target matrix. Hence, only a left sided transformation in equation (21) is required, because right sided transformation is omitted in equation (23).

The computational complexity of BT and PCA is decreased by applying Nyström-SVD: Let and have a decomposition given as in equation (9). Note for clarity the Nyström notation is used as in section 3.5. For a Nyström-SVD we chose from both matrices columns/rows obtaining landmarks matrices and . Based on Nyström-SVD in equation (14), the dimensionality is reduced as in equation (23) keeping only most relevant data structures

(24)

Hence, it is sufficient to compute an SVD of instead of with and therefore is considerably lower in computational complexity. Analogy, we approximate source data by . Since we again assume due to data normalization, solving the linear equation as a possible solution for equation (21), leads to . Plugging it back we obtain

(25)

where again the basis of target data transfers structural information into the training domain. The matrix is used for training and is used for testing. According to Weiss2016, it is an asymmetric transfer approach. Further, it is transductive 5288526 and does not need labeled target data. For further references, we call the approach Nyström Basis Transfer (NBT) and in combination with PCVM, Nyström Transfer Vector Machine (NTVM).

4.2.1 Properties of Nyström Basis Transfer

We showed that NBT is a valid PCA approximation by equation (23). It follows by definition of SVD that and is an orthogonal basis. Therefore, equation (24) and equation (25) are orthogonal transformations. In particular equation (25) transforms the source data into the target subspace and projects it onto the principal components of . If data matrices and are standard normalized333Experimental data are standard normalized to mean zero and variance one in the preprocessing., the geometric interpretation is a rotation of source data w.r.t to angles of the target basis, already shown in figure 2.
The computational complexity of NBT is given by the complexity of Nyström-SVD and the calculation of the respective landmark matrices and with complexity . The matrix inversion of diagonal matrix in equation (24) can be neglected. Remaining matrix multiplications are , contributing to an overall complexity of NBT, which is with . This makes NBT the fastest transfer learning solution in terms of computational complexity in comparison to the discussed methods in section 2.
The approximation error is similar to the original Basis-Transfer stvm error:

(26)

4.2.2 Data Augmentation

In BT stvm, sample sizes of data matrices must be aligned. This is not required in NBT as seen in equation (24) and equation (25), because differences in size are aligned during transformation. However, the original dataset has an sized label vector with , which does not correspond to and this label-vector should not be transformed into the new size because semantic label information does not correspond with transformed data. Hence, sample sizes must still be the same, i.e. , but is not required by definition of NBT. We propose a data augmentation strategy for solving different sample sizes, applied before

doing knowledge transfer. Data augmentation is common in machine or deep learning and has a variety of applications

Zhang2016; Hauberg16. However, source and target data should have a reasonable size to proper encode domain knowledge.

In general, there are two cases, first , meaning there is not enough source data. This is augmented via sampling from a class-wise multivariate Gaussian distribution harmonizing the number of samples per class of source data. The other case is and is solved by uniform random removal of source data from the largest class, i.e.  with as number of class samples and as label with most samples, in the source set . The approach reduces source data to size . This is somewhat counter-intuitive because one does not want to reduce the source set. However, we have no class information of the target set at training time, and we would be guessing class labels of target data when adding new artificial samples. The data augmentation strategy is summarized as

(27)

where is class-wise mean, is class-wise variance. The function maps a training sample to the ground truth label and is the number of class sample occurrences.

Pseudo code of Data Augmentation, NBT and PCVM summarized as NTVM is shown in algorithm 2.

1: as sized training; as sized test set; as sized training label vector; as number of landmarks parameter; as RBF-kernel parameter.
2:Weight Vector ; bias ;
3:=standard_normalization(,) Similar as in Fig. 2
4: = data_augmentation(,) According to equation (27)
5: = matrix_decomposition(,) According to equation (9)
6: = matrix_decomposition(,) According to equation (9)
7: = ; Singular Values of landmark matrix of
8: = ; SVD of landmark matrix of
9: = According to equation (24)
10: According to equation (24)
11: According to equation (25) and similar to Fig 2
12:[,] = pcvm_training(,,); According to Chen2009
Algorithm 2 Nyström Transfer Vector Machine (NTVM)

5 Experiments

We follow the experimental design typical for transfer learning algorithms Long2015; Gong2017; Long2013; 5288526; Pan2011. A crucial characteristic of datasets for transfer learning is that domains for training and testing are different but related. This relation exists because the train and test classes have the same top category or source. The classes themselves are subcategories or subsets. The parameters for respective methods444Source code, parameters and datasets obtainable via https://github.com/ChristophRaab/ntvm are determined for best performance in terms of accuracy via grid search evaluated on source data.

5.1 Dataset Description

The study consists of benchmark datasets and are already preprocessed. Reuters from WenyuanDai.2007, 20-Newsgroup from Long2014a and Caltech-Office from Gong2017. A summary of image and text datasets is shown in table 1 and table  2. Respective datasets are detailed in the following.

5.1.1 Image Datatsets

Caltech-256555https://people.eecs.berkeley.edu/j̃hoffman/domainadapt/#datasets_code- Office: The first, Caltech (C) is an extensive dataset of images and initially contains of 30607 images within 257 categories. However, in this setting, only 1123 images are used to be related to the Office dataset. We adopt the sampling scheme from Gong2017. The Office dataset is a collection of images drawn from three sources, which are from Amazon (A), digital SLR camera DSLR (D) and webcam (W). They vary regarding camera, light situation and size, but having 31 object categories, e.g. computer or printer, in common. Duplicates are removed, as well as images, which have more than 15 similar Scale Invariant Feature Transform (SIFT) in common.
To get an overall collection of the four image sets, which are considered as domains, categories with the same description are taken. From the Caltech and Office dataset, ten similar categories are extracted: backpack, touring-bike, calculator, head-phones, computer-keyboard, laptop-101, computer-monitor, computer mouse, coffee-mug, and projector. They are the class labels from one to ten.
With this, a classifier should be trained in the training domain, e.g. on projector images (Class One) from amazon (Domain A), and should be able to classify the test image to the corresponding image category, e.g. projector (Class One) images from Caltech (Domain C) against other image types like head-phones (Class Two). The final feature extraction is done with Speeded Up Robust Features Extraction (SURF) and encoded with 800-bin histograms. Finally, the twelve combination of domain datasets are designed to be trained and tested against each other by the ten labels Gong2017. An overview of the image dataset is given in table 1.

Dataset Subsets #Samples #Features #Classes
Caltech-256 Caltech (C) 1123 800 10
Office Amazon (A) 958
DSLR (D) 295
Webcam(W) 157
Table 1: Dataset characteristics of image domain adaptation datasets containing datasets and corresponding subsets, numbers of samples, features and classes.

5.1.2 Text Datasets

In the following, the text datasets are discussed. The arranging of the text domain adaption datasets is different from the image datasets. The text datasets are structured into top categories and subcategories. These top categories are regarded as labels and the subcategories are used for training and testing. The variation of subcategories between training and testing creates a transfer problem. The difference to image datasets is that at the image datasets the (sub) -categories are labels and the difference in the top category (source of images) between training and testing, e.g Caltech to Amazon, creates the transfer problem. An overview of the text datasets is given in table 2.

Reuters-21578666http://www.daviddlewis.com/resources/testcollections/reuters21578: A collection of Reuters news-wire articles collected in 1987 with a hierarchical structure given as top-categories and subcategories to organize the articles. The three top categories Organization (Orgs), Places and People are used in our experiments. The category Orgs has 56 subcategories, Places has 176 and People has 269. In the category Places, all articles about USA are removed making the three nearly even distributed in terms of articles.

We follow the sampling scheme from WenyuanDai.2007, which will be discussed in the following. Note that the top categories, which are just mentioned, are the labels of the datasets.

All subcategories of a top category are randomly divided into two parts of subcategories with about the same number of articles. This selection is fixed for all experiments. For a top category this creates two parts and of subcategories and for another top category this creates the parts and . The top category is regarded as a class and as another one.

The transfer problem is created by using and for training and and is used for testing. This is a two class problem, because of the two top categories and . Such a configuration is called vs . If the second part is used for training, i.e.  and , and the first part for testing, i.e.  and , it is regarded as vs . The individual subcategories have different distribution but are related by the top category, creating a change in distribution between training and testing.

Based on this six datasets are generated: Orgs vs. Places, Orgs vs. People, People vs. Places, Places vs. Orgs, People vs. Places and Places vs. People. The articles are converted to lower case, words are stemmed and stopwords are removed. With the Document Frequency (DF)-Threshold of 3, the numbers of features are cut down. The features are generated with Term-Frequency Inverse-Document-Frequency (TFIDF). For a detailed choice of subcategories see WenyuanDai.2007.

Top Category (Names) Subcategory #Samples #Features #Classes
Comp comp.graphics 970 25804 2
comp.os.ms-windows.misc 963
comp.sys.ibm.pc.hardware 979
comp.sys.mac.hardware 958
Rec rec.autos 987
rec.motorcycles 993
rec.sport.baseball 991
rec.sport.hokey 997
Sci sci.crypt 989
sci.electronics 984
sci.med 987
sci.space 985
Talk talk.politics.guns 909
talk.politics.mideast 940
talk.politics.misc 774
talk.religion.misc 627
Orgs 56 subcategories 1237 4771 2
People 269 subcategories 1208
Places 176 subcategories 1016
Table 2: Dataset characteristics of text domain adaptation datasets containing top-categories and corresponding subcategories, numbers of samples, features and labels. Horizontal line separates dataset in 20-Newsgroup (upper half) and Reuters (lower half) Long2015. At reuters there are many subcategories, therefore we only show the number of subcategories.

20-Newsgroup777http://qwone.com/~jason/20Newsgroups/: The original collection has approximately 20000 text documents from 20 newsgroups. The four top categories are comp, rec, talk and sci and containing four subcategories each. We follow a data sampling scheme introduced by Long2015 and generate 216 cross domain datasets based on subcategories: Again the top categories are the labels and the sub categories are varied between training and testing to create a transfer problem.

Let be a top category and are subcategories of and another top category with and are subcategories of . A dataset is constructed by selecting two subcategories for each top-category, e.g. , , , and , for training and select another four, e.g. , , , and for testing. The top categories and are respective classes.

For two top categories every permutation is used and therefore datasets are generated. By combining each top category with each other there are 216 dataset combinations. The datasets are summarized as mean per top category combination, e.g. C vs K, which are comp vs rec, comp vs talk, comp vs sci, rec vs sci, rec vs talk and sci vs talk. The transfer problem is created by training and testing on different subcategories analogy to Reuters. This version of 20-Newsgroup has 25804 TF-IDF features within 15033 documents Long2015.

Note to reproduce the results below, one should use the linked version of datasets with same choice of subcategories. Regardless of dataset, features have been normalized to standard mean and variance. The samples for training and testing the classifiers are drawn with -fold sampling scheme suggested by Alpaydm.1999, with a transfer learning adapted data sampling scheme as suggested in Gong2017.

5.2 Comparison of Prediction Performance

The results of the experiments are summarized in table 3 and showing mean errors of the cross-validation study per dataset. To determine statistically significant differences, we follow Chen2009, using the Friedman Test Demsar2006 with a confidence level of and Bonferroni-Dunn Post-Hoc correction. The marks statistical significance against NTVM. The PCTKVM and NTVM are compared to baseline classifier to standard transfer learning methods and non-transfer learning baseline methods, i.e. SVM and PCVM. The PCTKVM has overall comparable performance to PCVM, however, is worse at Newsgroup, showing negative transfer Weiss2016. This should be investigated in future work.

The NTVM method has excellent performance and outperforms every other algorithm by far. In the overall comparison, the NTVM is significantly better compared to the other approaches, except SA.

Dataset SVM PCVM TCA JDA TKL SA
PCTKVM
Our Work
NTVM
Our Work
Orgs vs People 23.0 31.1 23.8 25.7 18.6 6.3 20.7 3.1
People vs Orgs 21.1 28.1 20.3 24.9 13.0 5.7 23.2 3.1
Orgs vs Place 30.8 33.4 28.7 27.4 22.7 6.4 33.9 2.6
Place vs Orgs 35.8 35.7 34.1 34.1 17.5 6.6 40.6 2.4
People vs Place 38.8 41.3 37.2 41.3 30.6 8.3 36.3 2.6
Place vs People 41.3 41.2 43.4 43.3 34.0 11.5 39.7 2.5
Reuters Mean 31.8 35.1 31.3 32.8 22.7 7.5 32.4 2.7
Comp vs Rec 12.7 17.9 8.1 7.8 3.0 1.8 37.3 0.5
Comp vs Sci 24.5 29.1 26.3 27.1 9.5 4.8 33.7 0.8
Comp vs Talk 5.1 6.2 2.9 4.2 2.4 0.9 14.9 7.2
Rec vs Sci 23.7 36.6 17.3 23.9 5.1 1.6 29.3 0.3
Rec vs Talk 18.7 27.8 13.6 15.2 5.6 1.8 33.4 3.1
Sci vs Talk 21.7 30.9 20.1 26.1 14.6 2.9 30.2 6.9
Newsgroup Mean 17.8 24.7 14.7 17.4 6.7 2.3 29.8 3.1
C vs A 48.0 55.9 49.7 49.0 49.1 38.0 52.8 19.9
C vs W 53.8 57.0 55.1 53.5 53.1 68.1 55.2 59.5
C vs D 62.2 65.9 65.2 58.9 59.4 68.3 60.9 47.4
A vs C 54.8 59.5 53.8 54.6 53.9 42.4 56.6 37.5
A vs W 61.0 66.1 58.6 58.8 57.2 67.5 59.7 55.6
A vs D 62.6 64.8 66.5 59.2 59.2 64.9 61.6 48.9
D vs C 68.6 72.5 62.4 61.1 65.2 66.1 68.7 19.9
D vs A 69.0 72.4 65.5 65.8 64.6 66.2 67.0 35.6
D vs W 40.6 62.5 28.2 30.1 27.4 30.7 45.8 47.9
W vs C 65.9 67.8 62.1 65.3 63.2 59.4 63.9 16.1
W vs A 67.6 69.0 63.3 67.0 63.0 64.7 68.4 38.5
W vs C 23.1 41.1 21.6 22.1 27.4 32.5 45.5 59.8
Image Mean 56.4 62.9 54.3 53.8 53.6 55.7 58.8 40.6
Overall Mean 35.3 40.9 33.4 34.6 27.7 21.8 40.4 15.5
Table 3: Result of cross-validation test shown in mean error per dataset. Mean over dataset group at the end of each section. Bold marks winner. marks statistical differences with significance level of against NTVM. The study shows that none of the listed algorithms is statistically significantly better as NTVM.

Especially at Reuters, NTVM convinces with stable and best performance over multiple datasets. Table 3 shows that NTVM is significantly better in terms of mean at Reuters except TKL and SA.

The NTVM also outperforms most of the time at image datasets, showing the capability to tackle multi-class problems and their independence from a certain domain adaptation task, unlike previous work stvm. Further, in terms of mean error on image, the NTVM outperforms SVM, PCVM, PCTKVM and SA with statistical significant differences.

The NTVM is also very good at Newsgroup, but not that outstanding. It is overall little worse than SA, but not statistically significant. Further, it is best at half of the datasets and convinces with error performances under one percent.

Note that the standard deviation is not shown, because it will not provide more insights into the performance. It is overall very similar and small, because the underlying classifier is the same.

Office - Caltech
Reuters
Figure 3: Relationship between number of landmarks and classification error as mean over all Office vs Caltech datasets, shown in left figure, and mean over all Reuters datasets, shown in right figure. Number of landmarks for datasets smaller as maximum rank are determined via .

The sensitivity of the number of landmarks on prediction error as the only parameter of NBT is demonstrated in figure 3. It shows a comparison of the number of landmarks and the mean of classification error over Reuters and Office - Caltech datasets. The plot indicates that the error decreases to a global minimum with an increasing number of landmarks, which supports the Nyström error expectation. However, further increasing to the maximum number of landmarks, i.e. all samples, the error starts to increase. We assume this indicates that only a subset of features is relevant for classification and remaining features are noise. Further, this subset is drawn randomly, hence by choosing various landmark matrices, other features becoming relevant or non-relevant, as features correlate with certain features and again other features correlate with others.

5.3 Comparison of Model Complexity

Dataset PCVM SVM TCA JDA SA TKL PCTKVM NTVM
Reuters(1153.66) 49.07 441.78 168.51 201.87 100.87 351.21 1.97 329.51
Image(633.25) 62.87 284.37 231.65 264.38 238.44 262.64 46.63 27.59
20 Newsgroup(3758.30) 74.23 1247.10 269.75 252.49 211.57 1046.26 92.89 74.70
Overall Mean 62.06 640.17 223.30 245.31 183.60 553.27 47.16 143.93
Table 4: Average mean number of model vectors of a classifier for Reuters, Image and Newsgroup datasets. The average number of examples in the datasets are shown on the right side of the dataset name in brackets.

We measured the model complexity with the number of model vectors, e.g. support vectors. The result of our experiment is shown as mean summarizing a dataset group in table 4. We see that the transfer learning models of the PCTKVM provide very sparse models while having good performance. The sparsity of NTVM is also very competitive. However, the overall sparsity is worse in comparison to PCTKVM and PCVM. In comparison to all non-PCVM methods, the PCTKVM outperforms the respective methods by far.

The difference in model complexity is exemplarily shown in figure 4. It shows a sample result of classification of PCTKVM and TKL-SVM on the text dataset Orgs vs People with the settings from above. The PCTKVM error is with three model vectors and the error of TKL-SVM is with support vectors.
PCTKVM achieves sustain performance by a small model complexity and provides a way to interpret the model. Note that the algorithms are trained in the original feature space and the models are plotted in a reduced space, using the t-distributed stochastic neighbor embedding algorithm vanDerMaaten.2008

. Note that the Kullback-Leibler divergence of the data shown in figure 

4 between input distribution (original space) and output distribution (reduced space) is 0.92.

(a) PCTKVM
(b) TKL-SVM
Figure 4: Sample run on Orgs vs People (Text dataset). Red colors for the class Orgs and blue for the class People. This plot includes training and testing data. Model complexity of PCTKVM on the left and SVM on the right. The PCTKVM uses three vectors and achieves an error of . The TKL-SVM needs vectors and has an error of . The black circled points are used model vectors. Reduced with t-SNE vanDerMaaten.2008. Best viewed in color.

5.4 Time Comparison

The mean time results in seconds of the cross validation study per data set group are shown in table 5. Note that SVM and PCVM are underlying classifier for compared approaches and are presented for the baseline and not marked as winners in the table. They are also included in the time measurement of transfer learning approaches. Overall the SVM is the fastest algorithm, because it is baseline and uses the LibSVM implementation. The overall fastest transfer learning approach is TKL, but JDA is also promising and fastest at Reuters and Newsgroup.
The PCVM is overall by far the slowest classifier. By integration of TKL and NBT the time performance of resulting PCTKVM and NTVM are a big magnitude faster. We assume that the PCVM converges faster with transfer learning resulting in less computational time. Overall NTVM is slightly faster and lower in time at Reuters and Newsgroup in comparison to PCTKVM, which supports the discussion about computational complexity in section 4.2.1. In comparison to other transfer learning approaches, both approaches are slower than other transfer learning approaches. But the reason for this should be the PCVM as underlying classifier, because TKL is the fastest transfer approach with SVM. Further work should aim to measure the time with same classifier to make results more comparable.

Dataset
SVM
Baseline
TCA JDA TKL SA
PCVM
Baseline
PCTKVM NTVM
Reuters 0.06 0.86 0.36 0.40 0.87 543.71 16.20 8.78
Newsgroup 1.35 21.39 4.79 2.80 59.70 1501.7 5.06 25.29
Image 0.02 0.29 0.16 0.44 0.08 258.78 17.69 3.41
Overall 0.48 7.51 1.77 1.21 20.22 768.06 12.98 12.49
Table 5: Result of cross-validation test shown in mean time in seconds per dataset group. The baseline classifiers are SVM is underlying classifier of TCA, JDA, TKL and SA and the PCVM is underlying classifier of PCTKVM and NTVM. Note that the SVM time is naturally lower to transfer learning approaches.

6 Conclusion

Summarizing, we proposed two transfer learning extensions for the PCVM, resulting in PCTKVM and NTVM. The first shows the best overall sparsity and comparable performance to common transfer learning approaches. The NTVM has an outstanding performance, both in absolute values and statistical significance. It has competitive sparsity and lowest computational complexity compared to discussed solutions. NBT is an enhancement of previous versions of Basis Transfer via Nyström methods and is no longer limited to specific domain adaptation tasks. The dimensionality reduction paired with projection of source data into the target subspace via NBT showed its reliability and robustness in this study. Proposed solutions are tested against standard benchmarks in the field in terms of algorithms and datasets. In future work, deep transfer learning, different baseline classifiers and real-world or different domain adaptation datasets should be integrated. Further, smart sampling techniques for landmark selection should be tackled.

References