How to Pick the Best Source Data? Measuring Transferability for Heterogeneous Domains

12/23/2019
by   Seungcheol Park, et al.
Seoul National University
SAMSUNG
0

Given a set of source data with pre-trained classification models, how can we fast and accurately select the most useful source data to improve the performance of a target task? We address the problem of measuring transferability for heterogeneous domains, where the source and the target data have different feature spaces and distributions. We propose Transmeter, a novel method to efficiently and accurately measure transferability of two datasets. Transmeter utilizes a pre-trained source classifier and a reconstruction loss to increase its efficiency and performance. Furthermore, Transmeter uses feature transformation layers, label-wise discriminators, and a mean distance loss to learn common representations for source and target domains. As a result, Transmeter and its variant give the most accurate performance in measuring transferability, while giving comparable running times compared to those of competitors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

11/24/2021

Transferability Estimation using Bhattacharyya Class Separability

Transfer learning has become a popular method for leveraging pre-trained...
12/01/2021

Ranking Distance Calibration for Cross-Domain Few-Shot Learning

Recent progress in few-shot learning promotes a more realistic cross-dom...
02/17/2021

Transferability of Neural Network-based De-identification Systems

Methods and Materials: We investigated transferability of neural network...
06/07/2021

Quantifying and Improving Transferability in Domain Generalization

Out-of-distribution generalization is one of the key challenges when tra...
06/17/2021

Frustratingly Easy Transferability Estimation

Transferability estimation has been an essential tool in selecting a pre...
02/27/2020

LEEP: A New Measure to Evaluate Transferability of Learned Representations

We introduce a new measure to evaluate the transferability of representa...
03/29/2020

Sequential Transfer Machine Learning in Networks: Measuring the Impact of Data and Neural Net Similarity on Transferability

In networks of independent entities that face similar predictive tasks, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a set of source data with pre-trained classification models, how can we fast and accurately select the most useful source data to improve the performance of a target task? In supervised learning, the amount of labeled data has a direct effect on the performance of the target task. However, labeling a sufficient amount of data is cost and time-intensive, and it is often impossible to get enough data when it comes to rare events or restricted data e.g., mechanical faults or personal information. For this reason, there have been growing interests in

Transfer Learning

which aims to transfer data or model from a source task to a target task, to reduce the demand of the target data. There are several flavors of transfer learning. In

Homogeneous Transfer Learning [7, 21, 11, 6, 10], both source and target domain have an identical feature space. In Heterogeneous Transfer Learning [15, 8, 17, 3, 23, 9, 22] which we focus on in this paper, two feature spaces have different dimensions.

The heterogeneous transfer learning enlarges the pool of available source data for transfer learning; however, it also introduces a significant challenge: the distribution as well as the meaning of the features of source and target domains are different. Such difference may lead to negative transfer, where the accuracy of the target task decreases after the transfer. Thus, it is important to quickly measure the transferability between a source data and a target data, such that we avoid transferring source data which lead to negative transfer.

In this paper, we propose Transmeter, a novel method to accurately and quickly measure the transferability between a source data and a target data. The base model of Transmeter consists of four modules: source encoder, decoder, label predictor, and domain classifier (see Figure 2). The source encoder maps the source data into the target’s feature space such that they have homogeneous representations. We feed the target data and the mapped source data into the same decoder and the label predictor to predict labels. We decrease the training time by reusing the pre-trained source model with fixed weights. The domain classifier forms an adversarial architecture with the source encoder to maximize the accuracy: the domain gap between the source and the target data is reduced by the competition between the domain classifier and the source encoder. This reduced domain gap enhances the accuracy of measuring the transferability. We improve the accuracy of the base model of Transmeter by three additional ideas: label-wise domain classifiers for better adversarial training, a reconstruction loss for enhancing the label prediction, and a mean distance loss for better learning the homogeneous representations. Extensive experiments show that Transmeter and its variant give the best accuracy, while giving comparable running times compared to those of competitors (see Figure 1).

(a) Spearman’s rank correlation (b) Running time
Figure 1: Comparison of (a) the rank correlation and (b) the running time among Transmeter, Transmeter-M, and competitors. Note that Transmeter and Transmeter-M give the highest overall rank correlation while giving comparable running times compared to other methods.

The main contributions of this paper are as follows.

Problem Definition. We define the problem of measuring the transferability between heterogeneous datasets. Unlike previous works that focus on fully transferring models, the problem focuses on measuring the transferability between datasets efficiently and accurately.

Method. Our proposed method Transmeter uses a pre-trained source model and an adversarial architecture to efficiently and accurately measure the transferability between two datasets. Transmeter learns homogeneous representations of source and target domains using feature transformation layers, label-wise discriminators, and a newly designed mean distance function.

Experiments. Extensive experiments how that Transmeter and its variant give the best accuracy in measuring transferability, with similar running times compared to other methods.

The rest of the paper is organized as follows: related works in Section 2, proposed method in Section 3, experiments in Section 4, and conclusion in Section 5.

2 Related Works

We review previous works on heterogeneous domain adaptation, negative transfer, and measuring transferability.

2.1 Heterogeneous Domain Adaptation

Heterogeneous domain adaptation aims for transfer learning in heterogeneous domains. However, the different feature spaces and the distributions impose significant challenges. Recent studies to address the challenges are divided into two groups: symmetric and asymmetric feature-based transfer learning. Symmetric approaches transform both the source and the target domains into a common latent space [15, 3, 20] while asymmetric approaches transform only the source domain to the target domain [8, 23, 22]. The base model of Transmeter can be regarded as an asymmetric feature-based transfer learning, but it also uses the idea of symmetrically transforming the target feature space to increase flexibility.

2.2 Negative Transfer

[12, 19] define negative transfer as ”transferring knowledge from the source can have a negative impact on the target learner”. The negative transfer comes from the difference between the source and the target data distributions, and was observed in various settings [13, 2, 5, 1]. [18]

observed that three main factors for negative transfer are algorithms for transfer learning, the divergence between joint distributions, and the size of the labeled target data.

2.3 Measuring Transferability

As the number of available source data gets larger, it becomes very important to exploit the source data to to boost the performance of a target task. Then it is necessary to efficiently and accurately estimate the transferability between a source and a target data before a full transfer.

[16] and [14] use the ratio of the clustered source data in the unified feature space, and the confidence of the pseudo-labeled target data, respectively, to remove the suspicious source data to avoid negative transfer. However, none of the previous works explicitly evaluate the transferability. Our proposed Transmeter explicitly measures the transferability and chooses the best data.

Symbol Description
Number of given source tasks
A set of source datasets
A set of source classifiers
Input features of the -th source data
Label of the -th source data
Classifier of the -th source data
Input dimension of the -th source data
Input features of the target data
Label of the target data
Classifier of the target data
Input dimension of the target data
Table 1: Symbol description.
Figure 2: Architecture of the Transmeter.

3 Proposed Method

We formally define the problem and propose Transmeter, our novel method for transferability measurement. Table 1 summarizes the symbols used.

3.1 Problem Definition

Given a set of source datasets with classifiers and the target data where , , and , , our objective is to find the best source data and its related classifier that improve the target performance the most after transferring them to the target task. We focus on heterogeneous transfer learning where .

3.2 Overview

We propose Transmeter, a novel method to determine the most useful source data by measuring the transferability. Figure 2 depicts the overall structure of Transmeter, which consists of three learnable networks: source encoder (), decoder (), and domain classifiers ( and ). The label predictor () denotes the pre-trained source model and is fixed while training. The source encoder generates the homogeneous representation () by mapping the source input features () into target feature space (). For the sake of utilizing the pre-trained source model where the input dimension of the model should be , the decoder transforms the dimension of homogeneous representations ( and ) from to

. Similar to the Domain Adversarial Neural Networks 

[4], the domain classifier () distinguishes the source and the target domains, while the source encoder extracts the domain-invariant features. After training, the domain classifier will not be able to discriminate the two domains, and not be used in the inference step.

0:  source labeled data , parameters of the pre-trained source model, and target labeled data
0:  learned parameters: of the source encoder , of the decoder , and and of the label-wise domain classifiers and
1:  initialize: , and randomly
2:  while stopping criterion is not met do
3:      softmax()
4:      softmax()
5:     
6:     if label is 0 then
7:          softmax()
8:          softmax()
9:         
10:         
11:     else
12:          softmax()
13:          softmax()
14:         
15:         
16:     end if
17:     compute , and according to the equations (1)-(8)
18:     update parameters , and using gradient descent
19:  end while
Algorithm 1 Transmeter
Data Abbreviation Field Features Instances
Australian Credit Approval Australian Financial 14 690
Breast Cancer Wisconsin (Diagnostic) Cancer-diag Health 32 569
Breast Cancer Wisconsin (Original) Cancer-orig Health 10 699
Student Grade Prediction Grade Education 33 649
HTRU2 Data Set Pulsar Astronomy 8 17898
Table 2: Description of the Datasets.

Algorithm 1 shows Transmeter. Given a source data, a pre-trained source model with the source data, and a target data, the algorithm learns the model parameters and to measure transferability. Note that the input parameters of the pretrained source model are fixed. All the learned parameters are initialized with random values. In the training process, the source and the target data flow into the model simultaneously, and the parameters are updated using gradient descent.

3.3 Objective Function

We define our learning objective as follows:

(1)

, and

denote the loss functions for the parameters in the source encoder, the decoder, and the domain classifiers, respectively.

, and are constructed from four different loss functions: label predictor loss

, feature reconstruction loss

, domain discrimination loss , and mean distance loss . and

are hyperparameters. In the following, we describe the four loss functions in detail.

3.3.1 Label Predictor

The label predictor loss is designed to correctly classify instances.

(2)
(3)

and denote the ground truth and the predicted label, respectively. and are summation of instance losses from the source and the target data, respectively. and are the numbers of the source and the target instances, respectively.

3.3.2 Feature Reconstruction

The feature reconstruction loss

is designed to recover the original source features to reuse the pretrained source model. This can be thought of as an autoencoder where the source encoder maps the source input features into a constrained code, and the decoder recovers the code to the input features. The model is trained to minimize the reconstruction error for each source data point.

(4)

denotes the decoded features of the th source instance.

3.3.3 Label-wise Discrimination

The domain discrimination loss is designed to improve the accuracy of the label prediction while making the source and the target features indistinguishable. We separate instances for labels 0 and 1, and perform domain classification for each label. Such separation prevents all the source instances from being mapped close to target points of only a single label.

(5)
(6)

, , , and represent the numbers of source and target instances with labels 0 and 1, respectively. and denote the predicted domain classes of the th source instances with labels 0 and 1, respectively, while and denote those of the th target instances with labels 0 and 1, respectively.

3.3.4 Mean Distance

The means distance loss

is designed to further make the source and the target data indistinguishable. For a batch of source and target instances, we minimize the distance between the average source vector and the average target vector in the homogeneous representation.

(7)
(8)

and are the mean vectors for each label and domain.

4 Experiments

Symbol Description
Transmeter Our proposed model
Transmeter-0 Transmeter without any improvements
Transmeter-A Transmeter without reconstruction loss
Transmeter-L Transmeter without label-wise discriminators
Transmeter-M Transmeter without mean distance loss
Table 3: Descriptions of Transmeter and its four variants.
Australian Cancer-diag Cancer-orig
Model Accuracy Baseline Accuracy Baseline Accuracy Baseline
Transmeter 65.70 65.22 94.74 93.57 90.73 90.73
Transmeter-0 61.84 65.22 92.98 93.57 90.73 90.73
Transmeter-A 65.70 65.22 94.74 93.57 90.73 90.73
Transmeter-L 65.70 65.22 92.98 93.57 90.73 90.73
Transmeter-M 62.80 65.22 94.74 93.57 90.73 90.73
Grade Pulsar Average
Model Accuracy Baseline Accuracy Baseline improvement
Transmeter 68.07 63.87 96.78 97.62 1.00
Transmeter-0 62.18 63.87 97.32 97.62 -1.19
Transmeter-A 65.55 63.87 96.67 97.62 0.48
Transmeter-L 64.71 63.87 97.08 97.62 0.04
Transmeter-M 66.39 63.87 97.21 97.62 0.17
Table 4: The result of the self-transfer. Transmeter and its variants succeed in improving the accuracy (%) except for Transmeter-0.

We conduct experiments to answer the following questions on the performance and efficiency of Transmeter.

Q1.Model sanity check (Section 4.2). Does Transmeter improve the accuracy of a target task?

Q2.Ablation study (Section 4.3). Which variant of Transmeter provides the best accuracy?

Q3.Comparison to competitors (Section 4.4). What are the results of comparison between Transmeter and competitors?

4.1 Experimental Settings

We introduce experimental settings including datasets, pre-trained models, and baseline methods. All of our experiments are done in a workstation with Geforce GTX 1080 Ti.

4.1.1 Datasets

We use five datasets for binary classification in Table 2

from the UCI Machine Learning Repository

111https://archive.ics.uci.edu/ml/index.php. We select diverse datasets with different domains, sizes, and dimensions.

4.1.2 Pre-trained models

We train MLPs for each dataset and use them as pre-trained source models in Transmeter. Since we use the part of the features or training dataset in Sections 4.2 and 4.3, we also train MLPs using those datasets.

4.1.3 Baselines

We compare the results of Transmeter with HeMap [15] and its variant. HeMap [15] is the most recent heterogeneous transfer learning method. HeMap samples source data near a given target data, and determines that it is too risky to transfer when the ratio of the selected source data is lower than a threshold. We exploit this ratio of the selected source data as a transferability between two datasets, and use this as a baseline method named HeMap-t. We also use the full-process of HeMap as a full-transfer algorithm.

As ablation studies, we further compare Transmeter with its variants shown in Table 3.

4.2 Model Sanity Check

To verify that Transmeter and its variants improve the accuracy of a target task, we perform self-transfer by transferring a source data to the feature-removed copy of the same data. For each dataset, we generate a new dataset by keeping only 20% of its features, and check whether Transmeter and its variants improve the accuracy from transfer learning.

Table 4 shows the result of the self-transfer. Note that Transmeter and its variants, except Transmeter-0, improve the accuracy over the baseline on average and in most cases.

4.3 Ablation Study

We compare the performance of Transmeter with those of its variants by measuring the transferability between all source and target pairs.

Model Average Transferability
Transmeter 0.77
Transmeter-0 -2.13
Transmeter-A -0.91
Transmeter-L -0.74
Transmeter-M 0.59
Table 5: Average transferability (%) of each model. Transmeter outperforms its variants.

Table 5 shows the average transferability. Note that Transmeter outperforms all of its variants, and shows positive transferabilities on average. We also observe that Transmeter-A, Transmeter-L, and Transmeter-M outperform Transmeter-0; it means the three improvements are meaningful. Based on the result, we select Transmeter and Transmeter-M as our best models when comparing to other competitors in Section 4.4.

4.4 Comparison to Competitors

We compare Transmeter to other methods for measuring transferabilities. We compare 1) the accuracies of the transferability measurement, and 2) the running times of all the methods. We use Spearman’s rank correlation coefficient between the predicted ranks and the ground truth ranks to evaluate the accuracy. Since Transmeter and Transmeter-M often improve the accuracy of the target task and outperform HeMap, we define the ground truth rank using the maximum accuracy of HeMap, Transmeter, and Transmeter-M for each source and target pair.

Figure 1 shows the results. Note that Transmeter and Transmeter-M provide the overall best accuracy. In terms of running time, HeMap is the slowest, but there is no clear winner among Transmeter, Transmeter-M, and HeMap-t. This shows that Transmeter and its variant give the most accurate transferability measurement, and their running times are not worse than competitors.

5 Conclusion

In this paper, we propose Transmeter, a novel algorithm that measures the transferability between two datasets. The base model of Transmeter comprises feature transformation layers, a label predictor, and domain classifiers. We use 1) a pre-trained source model as a label predictor to reduce the training time, and 2) domain classifier to reduce the gap between the source and the target domains. We improve the accuracy of the base model by introducing a reconstruction loss, label-wise discriminators, and a mean distance loss. Experiments show that Transmeter gives the best accuracy in measuring transferability, with comparable running time to those of competitors.

Acknowledgement

This work is supported by Samsung Research, Samsung Electronics Co., Ltd.

References

  • [1] Z. Cao, M. Long, J. Wang, and M. I. Jordan (2018) Partial transfer learning with selective adversarial networks. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    ,
    pp. 2724–2732. Cited by: §2.2.
  • [2] L. Duan, D. Xu, and S. Chang (2012) Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pp. 1338–1345. Cited by: §2.2.
  • [3] L. Duan, D. Xu, and I. W. Tsang (2012) Learning with augmented features for heterogeneous domain adaptation. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, Cited by: §1, §2.1.
  • [4] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky (2017) Domain-adversarial training of neural networks. In Domain Adaptation in Computer Vision Applications, pp. 189–209. External Links: Link, Document Cited by: §3.2.
  • [5] L. Ge, J. Gao, H. Q. Ngo, K. Li, and A. Zhang (2014) On handling negative transfer and imbalanced distributions in multiple source transfer learning. Statistical Analysis and Data Mining 7 (4), pp. 254–271. Cited by: §2.2.
  • [6] B. Gong, Y. Shi, F. Sha, and K. Grauman (2012) Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pp. 2066–2073. Cited by: §1.
  • [7] H. D. III (2007) Frustratingly easy domain adaptation. In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, Cited by: §1.
  • [8] B. Kulis, K. Saenko, and T. Darrell (2011) What you saw is not what you get: domain adaptation using asymmetric kernel transforms. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pp. 1785–1792. Cited by: §1, §2.1.
  • [9] W. Li, L. Duan, D. Xu, and I. W. Tsang (2014) Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 36 (6), pp. 1134–1148. Cited by: §1.
  • [10] M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2014)

    Learning and transferring mid-level image representations using convolutional neural networks

    .
    In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pp. 1717–1724. Cited by: §1.
  • [11] S. J. Pan, X. Ni, J. Sun, Q. Yang, and Z. Chen (2010) Cross-domain sentiment classification via spectral feature alignment. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pp. 751–760. Cited by: §1.
  • [12] S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22 (10), pp. 1345–1359. Cited by: §2.2.
  • [13] M. T. Rosenstein, Z. Marx, and L. P. Kaebling (2005) To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, Vol. 898, pp. 1–4. Cited by: §2.2.
  • [14] C. Seah, Y. Ong, and I. W. Tsang (2013) Combating negative transfer from predictive distribution differences. IEEE Trans. Cybernetics 43 (4), pp. 1153–1165. Cited by: §2.3.
  • [15] X. Shi, Q. Liu, W. Fan, P. S. Yu, and R. Zhu (2010-12) Transfer learning on heterogenous feature spaces via spectral transformation. In 2010 IEEE International Conference on Data Mining, Vol. , pp. 1049–1054. External Links: Document, ISSN Cited by: §1, §2.1, §4.1.3.
  • [16] X. Shi, Q. Liu, W. Fan, and P. S. Yu (2013) Transfer across completely different feature spaces via spectral embedding. IEEE Trans. Knowl. Data Eng. 25 (4), pp. 906–918. Cited by: §2.3.
  • [17] C. Wang and S. Mahadevan (2011) Heterogeneous domain adaptation using manifold alignment. In

    IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011

    ,
    pp. 1541–1546. Cited by: §1.
  • [18] Z. Wang, Z. Dai, B. Póczos, and J. G. Carbonell (2018) Characterizing and avoiding negative transfer. CoRR abs/1811.09751. External Links: Link, 1811.09751 Cited by: §2.2.
  • [19] K. R. Weiss, T. M. Khoshgoftaar, and D. Wang (2016) A survey of transfer learning. J. Big Data 3, pp. 9. Cited by: §2.2.
  • [20] Y. Yan, W. Li, M. K. P. Ng, M. Tan, H. Wu, H. Min, and Q. Wu (2017) Learning discriminative correlation subspace for heterogeneous domain adaptation. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 3252–3258. Cited by: §2.1.
  • [21] Y. Yao and G. Doretto (2010) Boosting for transfer learning with multiple sources. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pp. 1855–1862. External Links: Link, Document Cited by: §1.
  • [22] H. Ye, X. Sheng, D. Zhan, and P. He (2018) Distance metric facilitated transportation between heterogeneous domains. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pp. 3012–3018. Cited by: §1, §2.1.
  • [23] J. T. Zhou, I. W. Tsang, S. J. Pan, and M. Tan (2014) Heterogeneous domain adaptation for multiple classes. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, pp. 1095–1103. Cited by: §1, §2.1.