Parameter Transfer Extreme Learning Machine based on Projective Model

09/04/2018
by   Chao Chen, et al.
Zhejiang University
0

Recent years, transfer learning has attracted much attention in the community of machine learning. In this paper, we mainly focus on the tasks of parameter transfer under the framework of extreme learning machine (ELM). Unlike the existing parameter transfer approaches, which incorporate the source model information into the target by regularizing the di erence between the source and target domain parameters, an intuitively appealing projective-model is proposed to bridge the source and target model parameters. Specifically, we formulate the parameter transfer in the ELM networks by the means of parameter projection, and train the model by optimizing the projection matrix and classifier parameters jointly. Further more, the `L2,1-norm structured sparsity penalty is imposed on the source domain parameters, which encourages the joint feature selection and parameter transfer. To evaluate the e ectiveness of the proposed method, comprehensive experiments on several commonly used domain adaptation datasets are presented. The results show that the proposed method significantly outperforms the non-transfer ELM networks and other classical transfer learning methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

01/02/2018

Optimal Bayesian Transfer Learning

Transfer learning has recently attracted significant research attention,...
08/26/2019

Improving Automatic Jazz Melody Generation by Transfer Learning Techniques

In this paper, we tackle the problem of transfer learning for Jazz autom...
07/06/2021

Transfer Learning in Information Criteria-based Feature Selection

This paper investigates the effectiveness of transfer learning based on ...
04/29/2022

Fix the Noise: Disentangling Source Feature for Transfer Learning of StyleGAN

Transfer learning of StyleGAN has recently shown great potential to solv...
06/09/2015

Estimating Posterior Ratio for Classification: Transfer Learning from Probabilistic Perspective

Transfer learning assumes classifiers of similar tasks share certain par...
05/26/2016

Domain Transfer Multi-Instance Dictionary Learning

In this paper, we invest the domain transfer learning problem with multi...
06/26/2020

Transfer Learning via ℓ_1 Regularization

Machine learning algorithms typically require abundant data under a stat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In traditional machine learning and pattern classification methods, there is a strong assumption that all the data are drawn from the same distribution. However, this assumption may not always hold in many real world scenarios. For example, in cases where the training samples are difficult or expensive to obtain, or when the distribution of the samples changes over time, we have to borrow knowledge from another different but highly related domain. Therefore, how to transfer knowledge from another different but related domain has become more and more important. During the past two decades, transfer learning has emerged as a new framework to solve this problem, and has received more and more attention in the machine learning and data mining community. As has been discussed in[1], feature matching based methods are the most widely used transfer learning approaches, which aim to learn a shared feature representation to minimize the distribution discrepancy between the source and target domain [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Among them, to learn a cross-domain transformations that maps the target features into the source[7, 10, 8, 9, 6, 11] is of great importance. Apart from this, the parameter transfer approach is another highly concerned line of works. It assumes that the transferred knowledge has been encoded into the hyper-parameters of the classification model[13, 14, 15, 16]

. Therefore, the source model and target model should share some parameters or prior distribution of the model parameters. Based on this assumption, parameter transfer approaches could adapt the learned source hyperplane to the target domain with a small number of target samples. As illustrated in Figure

1, we show the comparison of the transform-based methods and parameter transfer methods. The transform-based methods map the features to adapt to the learned hyperplane, while the parameter transfer approaches adjust the learned hyperplane to adapt to the shifted features. Despite that a large number of transform-based methods and parameter transfer methods have been proposed to address the knowledge transfer problem, few works have tried to combine these two methods together. In this paper, we attempt to comply with parameter transfer based on the projective-model, especially under the framework of the extreme learning machine.

!

Fig. 1: Illustration of the transform-based approaches parameter transfer approaches in a four-class classification problem. (a) The hyperplane learned in the source domain. (b) Use the hyperplane learned in the source domain on the target domain directly. Due to the domain-shift, several samples are misclassified. (c) Correct the domain-shift with the transform-based approaches first, then the learned hyperplane is used for classification. (d) Adapting the learned hyperplane to the target domain with a small number of target instances.

As a special feed-forward neural networks, the Extreme learning machine (ELM) first proposed by Huang

[17], which determines its input weights randomly, has become a very popular classifier due to its fast learning speed, satisfactory performance and little human intervention[18]. Therefore, since its first appearance, various extensions have been proposed to make the original ELM model more efficient and suitable for specific applications. Chen et al. optimized the input weights of ELM by generalized hebbian learning and intrinsic plasticity learning [20]

. Based on the manifold regularization framework, Huang et al. extended ELM to semi-supervised and unsupervised learning in

[19]. To handle the imbalanced data problem, Zong et al. extended the traditional ELM to the weighted ELM (WELM) in [21]. Considering the computational cost and spatial requirements, the online version of ELM has also been proposed and studied[22]. Besides, several multi-layer ELM frameworks [23, 24] have also been put forward recently to learn deep representation of the original data.

In this paper, we mainly focus on the parameter transfer approach based on the ELM algorithm. We would like to learn a high-quality ELM classifier using a small number of labeled target domain samples and a large number of source domain samples. To achieve this goal, we assume that the source domain classification hyperplane and the target domain classification hyperplane could be bridged by a projection matrix, i.e. the target domain parameters can be represented as a projection matrix multiplying with the source domain parameters. In this way, the parameter transfer ELM model can be learned by jointly optimizing the ELM model parameters and the projection matrix. Further more, the of the source domain parameters is incorporated into the objective function, which leads to selecting useful features in the source domain during model training. For ease of notation, the proposed parameter transfer ELM is referred to as PTELM.

The contributions of this paper are four-fold: Firstly, we are among the first to exploit projective-based model for parameter transfer, especially under the framework of the ELM. Secondly, unlike most of existing works which learn the transformations by minimizing the distribution discrepancy or maximizing some kind of similarity metric between the source and target feature space, the proposed PTELM jointly learns the projection matrix and the model parameter by minimizing the classification error directly. Thirdly, the is imposed on the source domain hyperplane. In this respect, the learned source model tends to select informative features for knowledge transfer. Lastly, we demonstrate that the proposed parameter transfer ELM can also be regarded as a special transform-based domain adaptation method.

Ii Related Works

Recently, some researchers have focused their attention on the domain adaptation ELM. Zhang et al. proposed a domain adaptation ELM to address the sensor drift problem in the E-nose system[25]. In [5], a unified subspace transfer framework based on ELM was proposed, which learns a subspace that jointly minimizes the mean distribution discrepancy (MMD) and maximum margin criterion (MMC). Uzair et al. [26] proposed a blind domain adaptation ELM with extreme learning machine auto-encoder (ELM-AE), which does not need target domain samples for training. In[4], Zhang et al. proposed a ELM-based domain adaptation (EDA) for visual knowledge transfer and extended the EDA to multi-view learning. In EDA, the manifold regularization was incorporated into the objective function, and the author minimizes the of the hyperplane and prediction error simultaneously. Besides, a parameter transfer approach based transfer learning ELM (TLELM) has also been proposed in [16], which regularizes the difference between the source and target parameters. In addition, Salaken et al.[27] summarized all the available literatures in the filed of ELM based transfer learning methods.

Among the parameter transfer approaches, the majority of related works incorporated the source model information into the target by regularizing the difference of the parameters between the source and the target domain[13, 14, 15, 16]. The representative method is the adaptive SVM (A-SVM)[13], which learns from the source domain parameters by directly regularizing the distance between the learned model and the target model. After that, Aytar et al. [14] proposed two new parameter transfer SVM, which extends and relaxes the A-SVM. Li et al.[16] proposed transfer learning ELM by introducing the same regularizer as A-SVM in the ELM.

Iii Preliminaries

Iii-a A Brief Review of ELM

Considering a supervised learning problem where the training set with

samples and the corresponding targets are given as . Here is the n-dimensional input data and is its associated one-hot labels. The ELM networks learns a decision rule with the following two stages. In the first stage, it randomly generates the input weights and bias , and maps the original data from the input space into the -dimensional feature space , where is the number of hidden nodes, , and

is the activation function. In this respect, the only free parameter of the ELM is the output weights

. In the second stage, the ELM solves the output weights by minimizing the square loss summation of prediction errors and the norm of the output weights simultaneously, leads to

(1)

where is the prediction error with respect to the -th training sample, the first term of the objective function is the regularization term to prevent the network from overfitting. By substituting the constrain into the objective function, the problem (1) can be simplified to such an unconstrain optimization problem:

(2)

where . The optimal solution of can then be analytically determined by setting the derivatives of with respect to to zero, i.e.

(3)

Then, the output weights can be effectively solved by

(4)

where

is the identity matrix and

is the regularization coefficient. With the closed-form solution, the ELM model is remarkably efficient and tends to reach a global optimum.

Iii-B Notations and Definitions

We summarize the frequently used notations and definitions as below.
Notations: For a matrix , let the -th row of denoted by . The Frobenius norm of the matrix is defined as

(5)

The of a matrix, introduced in [28] firstly as rotation invariant which ensures the row sparsity of a matrix, was widely used for feature selection and structured sparsity regularizer[29, 30, 2, 4]. It is defined as

(6)

Definition 1. Domain. A domain is composed of a feature space and a marginal distribution . and represent the source and target domain respectively, which are sampled from different but related distributions. Generally, and .
Definition 2. Transfer Learning. For the given source domain , and target domain . Generally, . Data in the target domain are insufficient to learn a high-quality classification model. Transfer learning aims to learn a satisfied classifier with the incorporation of the source domain information.
Definition 3. Parameter Transfer. Define and are the model parameters learned from the two domains.

(7)

where the first item is the loss function and the second item is the parameter regularization.

is the classification model learned from the source domain , and is the classification model learned from the target domain . Based on the assumption that and should share some parameters or prior distribution, parameter transfer learning aims to transfer knowledge from the to improve the target domain classification model.

(8)

The last item tries to incorporate the information of the into the .
Most parameter transfer approaches seek to leverage the target model by the discrepancy between and , i.e. . This penalty directly regularizing the distance between and is too strict sometimes. When is large enough, it leads to . In order to relax this constraint, in this paper, we propose the projective-model based parameter transfer approach to bridge the source and target domain parameters.
Definition 4. Projective-Model based Parameter Transfer. Define a projection matrix , the projective-model based parameter transfer assumes that the source domain parameters and target domain parameters could be bridged by .

Iv Proposed Method

In this section, we present the proposed projective-model based parameter transfer ELM and its learning algorithm.

Iv-a Problem Formulation

Suppose we have a source domain with labeled samples , and a target domain with labeled samples . Generally, in domain adaptation algorithm, is a very small number and . Denoting and as the source and target ELM model parameters need to be optimized. As discussed above, we tend to bridge the source domain parameters and the target parameters by a projection matrix , i.e. . Our goal is to learn the ELM classification hyperplane and the projection matrix jointly. In this respect, the objective function can be formulated as

(9)
(10)

where , and

denote the outputs of the hidden layer, the one-hot label vector and the prediction error with respect to the

-th samples from the source domain. Similarly, , and denote the output of the hidden layer, the one-hot label vector and the prediction error with respect to the -th samples from the target domain. As can be seen, there are four terms altogether in the objective function, which are intuitive to understand. The first two terms tend to simultaneously minimize the training error in the source and target domain, and the last two terms are used for preventing the source and target ELM model from overfitting. , , are trade-off parameters used to balance the contributions of the four terms to the objective function. The merits that distinguish our proposal from other related works are two-fold: On the one hand, different from the traditional parameter transfer approach[16], we bridge the source domain and the target domain parameters by a projection matrix. On the other hand, the instead of the Frobenius norm is imposed on the source domain hyperplane as a regularizer. With this penalty, the row sparsity will be obtained. Benefiting from this property, our model tends to select the informative features in the source domain for knowledge transfer.
By substituting the constrains into the objective function, the optimization function (9) can be easily reformulated as an equivalent unconstrained optimization problem.

(11)

Where and denote the hidden layer outputs of the target and the source ELM model, and denote the output weights of the target and source ELM model. and denote the label matrix of the target and source domain samples. Here, denotes the number of hidden nodes in the ELM model, and is the number of classes of the source and target domains.

Iv-B Learning Algorithm

As can be seen in problem (11), our goal is to jointly learn the output weights of the source ELM model and the projection matrix . Then, the target ELM model parameters can be easily obtained by . However, with two free parameters to be solved, this optimization problem can not be directly solved like the problem (2). Therefore, we adopt the coordinate descent method to alternatively optimize the two free parameters.
(1) Fix and optimize on : In the first step, we fix the projection matrix as , then, the sub-problem can be solved by setting the derivative of objective function to zero. Then we have

(12)

Note that is a non-smooth function at zero, therefore, we compute its sub-gradient instead. , where is a diagonal sub-gradient matrix with the -th element as

(13)

Here, denotes the -th row of , set as a very small constant to prevent the dividend to be zero. With the fixed matrix , could be solved according to Eq. (12), as

(14)

Note that the sub-gradient matrix is dependent on the unsolved parameters . Thus, we employ an alternate optimization strategy to solve according to Eq. (13) and Eq. (14). In each iteration, only one parameter is updated with the other one fixed. The algorithm is summarized in algorithm 1. It is worth noting that the iterative procedure will be terminated once the number of iterations reaches or the tends to convergence. The convergence of this algorithm can be easily proved similar to [30].

Input:
Output:
Set t=0. Initialize as an identity matrix;
repeat
       Update according to Eq. (14);
       Update according to Eq. (13);
       t=t+1
until Converges;
Algorithm 1 An efficient iterative algorithm to solve

(2) Fix and optimize on : With the fixed , the sub-problem can be easily solved by taking the derivative of Eq. (11) with respect to to zero. We get

(15)

which leads to

(16)

The overall learning algorithm is summarized in Algorithm 2. With the randomly initialized input parameters, the hidden layer outputs of the source and target ELM model , which are represented as and , could be calculated beforehand. In each iteration, we update with current , then, update with the current calculated . Owning to the closed-form solutions in each iteration, the learning algorithm will converge after several iterations.

Input:
Output: ,
Calculate and with random initialized input parameters;
Set t=0. Initialize as an identity matrix;
repeat
       Update according to Algorithm 1;
       Update according to Eq. (16);
       t=t+1
until Converges;
Algorithm 2 Learning Algorithm of the PTELM Method

Iv-C Relationship to Transform-based Methods

Most existing domain adaptation methods apply knowledge transfer by learning a cross-domain transformations [9, 10, 31, 11], which maps the source domain data into the target by applying . Instead, our proposed PTELM aims to transform the source domain hyperplane into the target by . In fact, the proposed PTELM can also be regarded as the transform-based method. As can be seen in Eq. (11), we implicitly define , and the is set to be zero. Such that the objective function can be reformulated as

(17)

Similar to the cross-domain transformation approaches, the above rewritten objective function aims to jointly learn a transformation matrix that transforms the target feature into the source, and the classification hyperplane. The differences between our proposal and the other transform-based methods are three-fold. On the one hand, our proposed PTELM transforms the target into the source by column transformation, while the majority of transform-based methods align the source and target by applying row transformation on the source data. On the other hand, the PTELM learns the transformation directly based on the prediction error, while other related works take the distribution discrepancy or similarity metric as guidelines. Lastly, the PTELM learns the transformation and a ELM classifier simultaneously, while many of transform-based methods simply learn the transformation, and then utilize other classifiers (e.g. KNN) for classification.

V Experiments

In this section, we evaluate our proposed PTELM method on several challenging real-world datasets. The source code of the PTELM is released online111https://github.com/BoyuanJiang/PTELM.

V-a Datasets and Setup

Two types of domain adaptation problems are considered: object recognition and text categorization. A summary of the properties of each domain considered in our experiments is provided in Table I.

Caltech-Office dataset. This dataset [8] consists of Office [7] and Caltech-256 [32] datasets. It contains images from four different domains: Amazon (product images download form amazon.com), Webcam (low-resolution images taken by a webcam), Dslr (high-resolution images taken by a digital SLR camera) and Caltech. 10 common categories are extracted from all four domains with each category consisting of 8 to 151 samples, and 2533 images in total. Several factors (such as image resolution, lighting condition, noise, background and viewpoint) cause the shift of each domain. Figure 2 highlights the differences among these domains with example images from categories of keyboards and headphones. We consider the SURF-BoW image features (SURF in short) provided by [8], which encode the images with 800-bin histograms with the codebook trained from a subset of Amazon images using SURF descriptors [33]

. These histograms are then normalized to be zero means and unit variance in each dimension.

Multilingual Reuters Collection dataset. This dataset222http://ama.liglab.fr/~amini/DataSets/Classification/Multiview/ReutersMutliLingualMultiView.htm [34, 35], which is collected by sampling from the Reuters RCV1 and RCV2 collections, contains feature characteristics of 111,740 documents originally written in five different languages and their translations (i.e., English, French, German, Italian, and Spanish), over a common set of 6 categories (i.e., C15, CCAT, E21, ECAT, GCAT, and M11). Documents belonging to more than one of the 6 categories are assigned the label of their smallest category. Therefore, there are 12-30K documents per language, and 11-34K documents per category. All documents are represented as a bag of words and then the TF-IDF features are extracted.

Baselines We compare the results with the following baselines and competing methods that are well adapted for domain shift scenarios:

  • SVM:Support vector machine trained on source.

  • SVM: Support vector machine trained on target.

  • ELM: Extreme learning machine trained on source.

  • ELM: Extreme learning machine trained on target.

  • GFK: Geodesic Flow Kernel [8].

  • MMDT: Max-Margin Domain Transforms [9, 6].

  • CDLS: Cross-Domain Landmark Selection [31].

max width= Problem Domains Dataset # Samples # Features # Classes Abbr. Objects Amazon Office 958 800 10 A Webcam Office 295 800 10 W DSLR Office 157 800 10 D Caltech Caltech-256 1,123 800 10 C Texts English Multilingual 18,758 11,547 6 EN French Multilingual 26,648 11,547 6 FR German Multilingual 29,953 11,547 6 GR Italian Multilingual 24,039 11,547 6 IT Spanish Multilingual 12,342 11,547 6 SP

TABLE I: Summary of the Domains used in the experiments

max width= Method AC AD AW CA CD CW DA DC DW WA WC WD Mean SVM 38.60.4 33.41.3 34.80.8 38.50.6 33.91.0 30.21.0 36.40.5 32.80.3 76.60.8 34.10.6 29.60.6 67.90.7 40.60.7 SVM 34.20.6 55.50.8 63.10.8 47.01.1 55.31.1 59.41.4 46.51.0 33.40.6 60.31.2 48.50.9 31.10.8 53.51.0 49.00.9 ELM 36.80.4 31.21.2 31.01.1 38.10.7 35.21.0 30.31.3 36.50.6 30.70.5 78.20.5 32.70.7 29.10.5 72.80.9 40.20.8 ELM 33.20.7 54.51.0 65.51.1 48.80.9 56.60.8 64.81.4 48.60.9 34.00.7 65.90.8 49.91.0 31.40.9 57.60.8 50.90.9 GFK 36.00.5 50.70.8 58.61.0 44.70.8 57.71.1 63.70.8 45.70.6 32.90.5 76.50.5 44.10.4 31.10.6 70.50.7 51.00.7 MMDT 36.40.8 56.71.3 64.61.2 49.40.8 56.50.9 63.81.1 46.91.0 34.10.8 74.10.8 47.70.9 32.20.8 64.00.7 52.20.9 CDLS 28.71.0 54.41.3 60.51.1 41.01.0 53.21.1 61.60.9 49.10.8 35.70.6 75.10.8 49.80.7 34.60.6 64.00.7 50.60.9 PTELM 36.00.7 57.00.8 67.00.8 51.20.9 57.30.8 64.91.0 50.60.8 36.20.6 67.20.8 52.30.7 33.50.9 59.20.8 52.70.8 Red indicates the best result for each domain split. Blue indicates the group of results that are close to the best performing result. (A: Amazon, C: Caltech, D: DSLR and W: Webcam)

TABLE II: RECOGNITION ACCURACIES ON THE Caltech-Office datasets with SURF feature

max width= Source # labeled target domain data / category = 10 # labeled target domain data / category = 20 Articles SVM SVM ELM ELM GFK MMDT CDLS PTELM SVM SVM ELM ELM GFK MMDT CDLS PTELM English 28.81.3 68.51.0 39.91.6 67.01.0 64.20.7 71.40.6 70.20.7 72.20.3 28.91.3 74.50.6 40.91.5 72.20.5 71.70.5 75.20.6 76.50.5 77.20.3 French 53.00.9 56.61.0 66.90.6 72.80.4 70.50.8 73.10.5 52.60.9 58.30.8 72.30.6 74.70.4 75.60.6 76.70.3 German 39.11.2 48.20.8 65.20.7 72.10.6 70.80.8 73.80.4 39.01.2 46.01.3 70.80.5 75.70.5 75.90.5 77.20.3 Italian 63.50.6 56.91.0 65.70.7 72.50.6 71.00.9 73.30.5 63.20.6 57.90.7 71.60.6 76.20.5 75.90.5 76.20.4 Mean 46.11.0 68.51.0 50.41.1 67.01.0 65.50.7 72.20.6 70.60.8 73.10.4 45.91.0 74.50.6 50.81.1 72.20.5 71.60.6 75.50.5 76.00.5 76.80.3 Red indicates the best result for each domain split. Blue indicates the group of results that are close to the best performing result.

TABLE III: RECOGNITION ACCURACIES ON THE Multilingual Reuters Collection datasets with Spanish as target domain

!

Fig. 2: Example images of Office-Caltech dataset. Amazon, Dslr and Webcam are from Office dataset while Caltech is from Caltech-256 dataset. It is obvious that domain shifts are large among different domains. (Best viewed in color.)

V-B Cross-Domain Object Recognition

For our first experiment, we use the Caltech-Office

domain adaptation benchmark dataset to evaluate our method on the real world computer vision adaptation tasks.

V-B1 Experiment Setup

Following the setup of [8, 7, 9], the number of selected labeled source samples per class for amazon, webcam, dslr and caltech is 20, 8, 8, and 8, respectively. Instead, when they serve as target domain, 3 labeled target samples are used. We use the same 20 random train/test splits download from the website333https://people.eecs.berkeley.edu/~jhoffman/domainadapt/ provided by the authors [9] for fair comparison and report averaged results across them.

For our method, we fix , and . The number of hidden nodes of the ELM networks is set as 500 in all experiments. For other baseline methods, we use the recommended parameters.

V-B2 Results

We report the mean and standard deviation of classification accuracies for all methods on the

Office-Caltech dataset in Table II. Each result in the same column is based on the same 20 random trials. As can be seen, our proposed method outperforms all other methods in 7 out of the 12 individual domain shifts and achieves the highest average accuracy 52.7% over the all 12 domain shift experiments. It is worth noticing that our PTELM typically outperforms the other competing methods when amazon serve as source or target domain. We believe the reason is that the domain discrepancy between amazonwebcam and amazondslr are much more significant than other domain shifts, as the larger performance discrepancy between the ELM and ELM in these domain shifts. Therefore, it is obvious that our approach is more effective to deal with large domain shifts.

We also visualize the effectiveness of the proposed PTELM via the confusion matrix. Figure

3 plots the confusion matrices of ELM, PTELM and ELM on amazonwebcam domain shift experiment. By inspecting the confusion matrix of ELM, which trained with 20 labeled source samples per class, we find that the source only model is heavily confused about several classes. It also reveals the large domain shift between amazon and webcam and gives explanation for the performance discrepancy between ELM and ELM. On the other hand, the confusion matrix of ELM, which trained with 3 labeled target samples per class, is also somewhat confused. In contrast, as can be seen in Figure 3(b), the off-diagonal elements in confusion matrix are close to zero, which demonstrates that our PTELM method can effectively utilize source and target damain samples together to train a high-quality classifier.

0.89!

Fig. 3: Confusion matrices of the amazonwebcam domain shift experiment. Left: ELM model trained with source domain only. Middle: Our proposed PTELM method trained with source and target domain together. Right: ELM model trained with target domain only.

V-C Cross-Domain Text Categorization

For the second experiment, we utilize the Multilingual Reuters Collection dataset to evaluate our method in the context of text categorization.

V-C1 Experiment Setup

In this dataset, documents written in different languages can be viewed as different domains. We take Spanish as target domain, and other four languages (English, French, German and Italian) as individual source domain. Therefore, there are four combinations in total. For each category, we randomly sample 100 labeled training documents from source domain and labeled training documents from target domain, where 5, 10, 15 and 20, respectively. And the remaining documents in the target domain are used as the test set444The splits we used can be downloaded from https://github.com/BoyuanJiang/PTELM/tree/master/DataSplits

. Note that the dimensions of the original TF-IDF features are up to 11,547, in order to fairly compare our method with other competing methods, we perform principal components analysis

555

The PCA uses randomized singular value decomposition algorithm as SVD solver for efficiency.

for dimension reduction and the dimensions after PCA are 40.

In this experiment, we also fix , and . The number of hidden nodes is set as instead.

V-C2 Results

0.89!

Fig. 4: Classification accuracies of all methods with varied labeled target data per class (i.e. = 5, 10, 15 and 20) on the Multilingual Reuters Collection dataset. Note that Spanish is considered as target domain, while the source domains are selected from (a) English, (b) French, (c) German and (d) Italian, respectively

0.89!

Fig. 5: Parameter sensitivity study for the PTELM algoritnm on ITSP and amazonwebcam domain shifts.

We report means and standard deviations of all methods on the Multilingual Reuters Collection dataset when and in Table III. It is obvious that our proposed PTELM method beats other competing methods under both settings. It is interesting to note that the GKF algorithm works worse than the ELM and SVM. A possible explanation is that the GFK is put forward for unsupervised domain adaptation, therefore, does not utilize the given target label for training.

We also plot means and standard deviations of all methods over different number of labeled target samples (5, 10, 15 and 20 respectively) in Figure 4 except SVM and ELM, as these two methods perform much worse than the other methods. From the figure, it can be seen that the performance of all the methods is improved with the increase of the number of labeled target samples and our method performs best in most cases. It is worth noting that MMDT performs a little better than our method and much better than other methods when , which demonstrates that MMDT is more suitable when few labeled target samples are available. Besides, another key insight from the figure is that our method is more stable than the competing methods with lower standard deviations.

V-D Parameter Sensitivity

In this section, we investigate the sensitivity of four parameters involved in our method, which are three trade-off parameters and the number of hidden nodes , respectively. Due to space limitation, we only choose amazonwebcam from the Office-Caltech dataset and ITSP from the Multilingual Reuters Collection dataset to evaluate accuracy. Each time, only one parameter is allowed to change with the other parameters fixed. The results are shown in Figure 5 and we give a brief analysis here. For , it is the trade-off parameter to balance the contribution of the source and target domain. When is smaller than 1, the model learns more from the source domain. On the contrary, when is larger than 1, the target domain counts more. Therefore, a reasonable value of is close to 1, as can be seen in Figure 5 (a). and are two penalty terms to prevent the model from overfitting the source and target domain data. As can be seen in Figure 5 (b) and (c), the reasonable choices could be and . For the number of hidden nodes , it is highly related to feature dimensions and a reasonable value is about 500 in our experiments.

Vi Conclusion and Future Work

In this paper, we presented a novel approach for parameter transfer under the ELM framework, which explicitly bridges the source domain parameters and the target domain parameters by a projection matrix. In order to select informative source domain features for knowledge transfer, the was applied to the source parameters. Additionally, an effective alternate optimization method was introduced to jointly learn the projection matrix and the model parameters. Experiments on several challenging datasets showed that the proposed PTELM significantly outperforms the non-transfer ELM and SVM by a large margin, besides, achieves better performance than the other representative methods.
In the future, we plan to extend our proposal in the following two aspects. (1) Extending the PTELM to multiple source domain adaptation method. (2) Reformulating the model by transforming the source and target parameters into a shared parameter space by two different projection matrices.

References

  • [1] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
  • [2] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer joint matching for unsupervised domain adaptation,” pp. 1410–1417, 2014.
  • [3] M. Long, J. Wang, G. Ding, S. J. Pan, and S. Y. Philip, “Adaptation regularization: A general framework for transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 5, pp. 1076–1089, 2014.
  • [4] L. Zhang and D. Zhang, “Robust visual knowledge transfer via extreme learning machine-based domain adaptation,” IEEE Transactions on Image Processing, vol. 25, no. 10, pp. 4959–4973, 2016.
  • [5] Y. Liu, L. Zhang, P. Deng, and Z. He, “Common subspace learning via cross-domain extreme learning machine,” Cognitive Computation, pp. 1–9, 2017.
  • [6] J. Hoffman, E. Rodner, J. Donahue, B. Kulis, and K. Saenko, “Asymmetric and category invariant feature transformations for domain adaptation,” International journal of computer vision, vol. 109, no. 1-2, pp. 28–41, 2014.
  • [7] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” Computer Vision–ECCV 2010, pp. 213–226, 2010.
  • [8] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” pp. 2066–2073, 2012.
  • [9] J. Hoffman, E. Rodner, J. Donahue, T. Darrell, and K. Saenko, “Efficient learning of domain-invariant image representations,” international conference on learning representations, 2013.
  • [10] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain adaptation using asymmetric kernel transforms,” pp. 1785–1792, 2011.
  • [11] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,”

    national conference on artificial intelligence

    , pp. 2058–2065, 2016.
  • [12] C. Chen, Z. Chen, B. Jiang, and X. Jin, “Joint domain alignment and discriminative feature learning for unsupervised deep domain adaptation,” arXiv preprint arXiv:1808.09347, 2018.
  • [13] J. Yang, R. Yan, and A. G. Hauptmann, “Adapting svm classifiers to data with shifted distributions,” pp. 69–76, 2007.
  • [14] Y. Aytar and A. Zisserman, “Tabula rasa: Model transfer for object category detection,” pp. 2252–2259, 2011.
  • [15] T. Tommasi, F. Orabona, and B. Caputo, “Learning categories from few examples with multi model knowledge transfer,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 5, pp. 928–941, 2014.
  • [16] X. Li, W. Mao, and W. Jiang, “Extreme learning machine based transfer learning for data classification,” Neurocomputing, vol. 174, pp. 203–210, 2016.
  • [17] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006.
  • [18] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 2, pp. 513–529, 2012.
  • [19] G. Huang, S. Song, J. N. Gupta, and C. Wu, “Semi-supervised and unsupervised extreme learning machines,” IEEE transactions on cybernetics, vol. 44, no. 12, pp. 2405–2417, 2014.
  • [20] C. Chen, X. Jin, B. Jiang, and L. Li, “Optimizing extreme learning machine via generalized hebbian learning and intrinsic plasticity learning,” Neural Processing Letters, pp. 1–17, 2018.
  • [21] W. Zong, G.-B. Huang, and Y. Chen, “Weighted extreme learning machine for imbalance learning,” Neurocomputing, vol. 101, pp. 229–242, 2013.
  • [22] N.-Y. Liang, G.-B. Huang, P. Saratchandran, and N. Sundararajan, “A fast and accurate online sequential learning algorithm for feedforward networks,” IEEE Transactions on neural networks, vol. 17, no. 6, pp. 1411–1423, 2006.
  • [23] H. Zhou, G.-B. Huang, Z. Lin, H. Wang, and Y. C. Soh, “Stacked extreme learning machines,” IEEE transactions on cybernetics, vol. 45, no. 9, pp. 2013–2025, 2015.
  • [24] G.-B. Huang, Z. Bai, L. L. C. Kasun, and C. M. Vong, “Local receptive fields based extreme learning machine,” IEEE Computational Intelligence Magazine, vol. 10, no. 2, pp. 18–29, 2015.
  • [25] L. Zhang and D. Zhang, “Domain adaptation extreme learning machines for drift compensation in e-nose systems,” IEEE Transactions on instrumentation and measurement, vol. 64, no. 7, pp. 1790–1801, 2015.
  • [26] M. Uzair and A. Mian, “Blind domain adaptation with augmented extreme learning machine features,” IEEE transactions on cybernetics, vol. 47, no. 3, pp. 651–660, 2017.
  • [27] S. M. Salaken, A. Khosravi, T. Nguyen, and S. Nahavandi, “Extreme learning machine based transfer learning algorithms: A survey,” Neurocomputing, vol. 267, pp. 516–524, 2017.
  • [28] C. H. Q. Ding, D. Zhou, X. He, and H. Zha, “R1-pca: rotational invariant principal component analysis for robust subspace factorization,” pp. 281–288, 2006.
  • [29] Q. Gu, Z. Li, and J. Han, “Joint feature selection and subspace learning,” pp. 1294–1299, 2011.
  • [30] F. Nie, H. Huang, X. Cai, and C. H. Q. Ding, “Efficient and robust feature selection via joint minimization,” pp. 1813–1821, 2010.
  • [31] Y. H. Tsai, Y. Yeh, and Y. F. Wang, “Learning cross-domain landmarks for heterogeneous domain adaptation,” pp. 5081–5090, 2016.
  • [32] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” 2007.
  • [33] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” Computer vision–ECCV 2006, pp. 404–417, 2006.
  • [34] M.-R. Amini, N. Usunier, and C. Goutte, “Learning from multiple partially observed views - an application to multilingual text categorization,” in NIPS 22, 2009.
  • [35] N. Ueffing, M. Simard, S. Larkin, and H. Johnson, “Nrc’s portage system for wmt 2007,” ACL 2007, pp. 185–188, 2007.