1 Introduction
Given a set of source data with pre-trained classification models, how can we fast and accurately select the most useful source data to improve the performance of a target task? In supervised learning, the amount of labeled data has a direct effect on the performance of the target task. However, labeling a sufficient amount of data is cost and time-intensive, and it is often impossible to get enough data when it comes to rare events or restricted data e.g., mechanical faults or personal information. For this reason, there have been growing interests in
Transfer Learningwhich aims to transfer data or model from a source task to a target task, to reduce the demand of the target data. There are several flavors of transfer learning. In
Homogeneous Transfer Learning [7, 21, 11, 6, 10], both source and target domain have an identical feature space. In Heterogeneous Transfer Learning [15, 8, 17, 3, 23, 9, 22] which we focus on in this paper, two feature spaces have different dimensions.The heterogeneous transfer learning enlarges the pool of available source data for transfer learning; however, it also introduces a significant challenge: the distribution as well as the meaning of the features of source and target domains are different. Such difference may lead to negative transfer, where the accuracy of the target task decreases after the transfer. Thus, it is important to quickly measure the transferability between a source data and a target data, such that we avoid transferring source data which lead to negative transfer.
In this paper, we propose Transmeter, a novel method to accurately and quickly measure the transferability between a source data and a target data. The base model of Transmeter consists of four modules: source encoder, decoder, label predictor, and domain classifier (see Figure 2). The source encoder maps the source data into the target’s feature space such that they have homogeneous representations. We feed the target data and the mapped source data into the same decoder and the label predictor to predict labels. We decrease the training time by reusing the pre-trained source model with fixed weights. The domain classifier forms an adversarial architecture with the source encoder to maximize the accuracy: the domain gap between the source and the target data is reduced by the competition between the domain classifier and the source encoder. This reduced domain gap enhances the accuracy of measuring the transferability. We improve the accuracy of the base model of Transmeter by three additional ideas: label-wise domain classifiers for better adversarial training, a reconstruction loss for enhancing the label prediction, and a mean distance loss for better learning the homogeneous representations. Extensive experiments show that Transmeter and its variant give the best accuracy, while giving comparable running times compared to those of competitors (see Figure 1).
![]() |
|
![]() |
![]() |
(a) Spearman’s rank correlation | (b) Running time |
The main contributions of this paper are as follows.
Problem Definition. We define the problem of measuring the transferability between heterogeneous datasets. Unlike previous works that focus on fully transferring models, the problem focuses on measuring the transferability between datasets efficiently and accurately.
Method. Our proposed method Transmeter uses a pre-trained source model and an adversarial architecture to efficiently and accurately measure the transferability between two datasets. Transmeter learns homogeneous representations of source and target domains using feature transformation layers, label-wise discriminators, and a newly designed mean distance function.
Experiments. Extensive experiments how that Transmeter and its variant give the best accuracy in measuring transferability, with similar running times compared to other methods.
2 Related Works
We review previous works on heterogeneous domain adaptation, negative transfer, and measuring transferability.
2.1 Heterogeneous Domain Adaptation
Heterogeneous domain adaptation aims for transfer learning in heterogeneous domains. However, the different feature spaces and the distributions impose significant challenges. Recent studies to address the challenges are divided into two groups: symmetric and asymmetric feature-based transfer learning. Symmetric approaches transform both the source and the target domains into a common latent space [15, 3, 20] while asymmetric approaches transform only the source domain to the target domain [8, 23, 22]. The base model of Transmeter can be regarded as an asymmetric feature-based transfer learning, but it also uses the idea of symmetrically transforming the target feature space to increase flexibility.
2.2 Negative Transfer
[12, 19] define negative transfer as ”transferring knowledge from the source can have a negative impact on the target learner”. The negative transfer comes from the difference between the source and the target data distributions, and was observed in various settings [13, 2, 5, 1]. [18]
observed that three main factors for negative transfer are algorithms for transfer learning, the divergence between joint distributions, and the size of the labeled target data.
2.3 Measuring Transferability
As the number of available source data gets larger, it becomes very important to exploit the source data to to boost the performance of a target task. Then it is necessary to efficiently and accurately estimate the transferability between a source and a target data before a full transfer.
[16] and [14] use the ratio of the clustered source data in the unified feature space, and the confidence of the pseudo-labeled target data, respectively, to remove the suspicious source data to avoid negative transfer. However, none of the previous works explicitly evaluate the transferability. Our proposed Transmeter explicitly measures the transferability and chooses the best data.Symbol | Description |
---|---|
Number of given source tasks | |
A set of source datasets | |
A set of source classifiers | |
Input features of the -th source data | |
Label of the -th source data | |
Classifier of the -th source data | |
Input dimension of the -th source data | |
Input features of the target data | |
Label of the target data | |
Classifier of the target data | |
Input dimension of the target data |

3 Proposed Method
We formally define the problem and propose Transmeter, our novel method for transferability measurement. Table 1 summarizes the symbols used.
3.1 Problem Definition
Given a set of source datasets with classifiers and the target data where , , and , , our objective is to find the best source data and its related classifier that improve the target performance the most after transferring them to the target task. We focus on heterogeneous transfer learning where .
3.2 Overview
We propose Transmeter, a novel method to determine the most useful source data by measuring the transferability. Figure 2 depicts the overall structure of Transmeter, which consists of three learnable networks: source encoder (), decoder (), and domain classifiers ( and ). The label predictor () denotes the pre-trained source model and is fixed while training. The source encoder generates the homogeneous representation () by mapping the source input features () into target feature space (). For the sake of utilizing the pre-trained source model where the input dimension of the model should be , the decoder transforms the dimension of homogeneous representations ( and ) from to
. Similar to the Domain Adversarial Neural Networks
[4], the domain classifier () distinguishes the source and the target domains, while the source encoder extracts the domain-invariant features. After training, the domain classifier will not be able to discriminate the two domains, and not be used in the inference step.Data | Abbreviation | Field | Features | Instances |
---|---|---|---|---|
Australian Credit Approval | Australian | Financial | 14 | 690 |
Breast Cancer Wisconsin (Diagnostic) | Cancer-diag | Health | 32 | 569 |
Breast Cancer Wisconsin (Original) | Cancer-orig | Health | 10 | 699 |
Student Grade Prediction | Grade | Education | 33 | 649 |
HTRU2 Data Set | Pulsar | Astronomy | 8 | 17898 |
Algorithm 1 shows Transmeter. Given a source data, a pre-trained source model with the source data, and a target data, the algorithm learns the model parameters and to measure transferability. Note that the input parameters of the pretrained source model are fixed. All the learned parameters are initialized with random values. In the training process, the source and the target data flow into the model simultaneously, and the parameters are updated using gradient descent.
3.3 Objective Function
We define our learning objective as follows:
(1) |
, and
denote the loss functions for the parameters in the source encoder, the decoder, and the domain classifiers, respectively.
, and are constructed from four different loss functions: label predictor loss , domain discrimination loss , and mean distance loss . andare hyperparameters. In the following, we describe the four loss functions in detail.
3.3.1 Label Predictor
The label predictor loss is designed to correctly classify instances.
(2) |
(3) |
and denote the ground truth and the predicted label, respectively. and are summation of instance losses from the source and the target data, respectively. and are the numbers of the source and the target instances, respectively.
3.3.2 Feature Reconstruction
The feature reconstruction loss
is designed to recover the original source features to reuse the pretrained source model. This can be thought of as an autoencoder where the source encoder maps the source input features into a constrained code, and the decoder recovers the code to the input features. The model is trained to minimize the reconstruction error for each source data point.
(4) |
denotes the decoded features of the th source instance.
3.3.3 Label-wise Discrimination
The domain discrimination loss is designed to improve the accuracy of the label prediction while making the source and the target features indistinguishable. We separate instances for labels 0 and 1, and perform domain classification for each label. Such separation prevents all the source instances from being mapped close to target points of only a single label.
(5) |
(6) |
, , , and represent the numbers of source and target instances with labels 0 and 1, respectively. and denote the predicted domain classes of the th source instances with labels 0 and 1, respectively, while and denote those of the th target instances with labels 0 and 1, respectively.
3.3.4 Mean Distance
The means distance loss
is designed to further make the source and the target data indistinguishable. For a batch of source and target instances, we minimize the distance between the average source vector and the average target vector in the homogeneous representation.
(7) |
(8) |
and are the mean vectors for each label and domain.
4 Experiments
Symbol | Description |
---|---|
Transmeter | Our proposed model |
Transmeter-0 | Transmeter without any improvements |
Transmeter-A | Transmeter without reconstruction loss |
Transmeter-L | Transmeter without label-wise discriminators |
Transmeter-M | Transmeter without mean distance loss |
Australian | Cancer-diag | Cancer-orig | ||||
Model | Accuracy | Baseline | Accuracy | Baseline | Accuracy | Baseline |
Transmeter | 65.70 | 65.22 | 94.74 | 93.57 | 90.73 | 90.73 |
Transmeter-0 | 61.84 | 65.22 | 92.98 | 93.57 | 90.73 | 90.73 |
Transmeter-A | 65.70 | 65.22 | 94.74 | 93.57 | 90.73 | 90.73 |
Transmeter-L | 65.70 | 65.22 | 92.98 | 93.57 | 90.73 | 90.73 |
Transmeter-M | 62.80 | 65.22 | 94.74 | 93.57 | 90.73 | 90.73 |
Grade | Pulsar | Average | ||||
Model | Accuracy | Baseline | Accuracy | Baseline | improvement | |
Transmeter | 68.07 | 63.87 | 96.78 | 97.62 | 1.00 | |
Transmeter-0 | 62.18 | 63.87 | 97.32 | 97.62 | -1.19 | |
Transmeter-A | 65.55 | 63.87 | 96.67 | 97.62 | 0.48 | |
Transmeter-L | 64.71 | 63.87 | 97.08 | 97.62 | 0.04 | |
Transmeter-M | 66.39 | 63.87 | 97.21 | 97.62 | 0.17 |
We conduct experiments to answer the following questions on the performance and efficiency of Transmeter.
Q1.Model sanity check (Section 4.2). Does Transmeter improve the accuracy of a target task?
Q2.Ablation study (Section 4.3). Which variant of Transmeter provides the best accuracy?
Q3.Comparison to competitors (Section 4.4). What are the results of comparison between Transmeter and competitors?
4.1 Experimental Settings
We introduce experimental settings including datasets, pre-trained models, and baseline methods. All of our experiments are done in a workstation with Geforce GTX 1080 Ti.
4.1.1 Datasets
We use five datasets for binary classification in Table 2
from the UCI Machine Learning Repository
111https://archive.ics.uci.edu/ml/index.php. We select diverse datasets with different domains, sizes, and dimensions.4.1.2 Pre-trained models
4.1.3 Baselines
We compare the results of Transmeter with HeMap [15] and its variant. HeMap [15] is the most recent heterogeneous transfer learning method. HeMap samples source data near a given target data, and determines that it is too risky to transfer when the ratio of the selected source data is lower than a threshold. We exploit this ratio of the selected source data as a transferability between two datasets, and use this as a baseline method named HeMap-t. We also use the full-process of HeMap as a full-transfer algorithm.
As ablation studies, we further compare Transmeter with its variants shown in Table 3.
4.2 Model Sanity Check
To verify that Transmeter and its variants improve the accuracy of a target task, we perform self-transfer by transferring a source data to the feature-removed copy of the same data. For each dataset, we generate a new dataset by keeping only 20% of its features, and check whether Transmeter and its variants improve the accuracy from transfer learning.
Table 4 shows the result of the self-transfer. Note that Transmeter and its variants, except Transmeter-0, improve the accuracy over the baseline on average and in most cases.
4.3 Ablation Study
We compare the performance of Transmeter with those of its variants by measuring the transferability between all source and target pairs.
Model | Average Transferability |
---|---|
Transmeter | 0.77 |
Transmeter-0 | -2.13 |
Transmeter-A | -0.91 |
Transmeter-L | -0.74 |
Transmeter-M | 0.59 |
Table 5 shows the average transferability. Note that Transmeter outperforms all of its variants, and shows positive transferabilities on average. We also observe that Transmeter-A, Transmeter-L, and Transmeter-M outperform Transmeter-0; it means the three improvements are meaningful. Based on the result, we select Transmeter and Transmeter-M as our best models when comparing to other competitors in Section 4.4.
4.4 Comparison to Competitors
We compare Transmeter to other methods for measuring transferabilities. We compare 1) the accuracies of the transferability measurement, and 2) the running times of all the methods. We use Spearman’s rank correlation coefficient between the predicted ranks and the ground truth ranks to evaluate the accuracy. Since Transmeter and Transmeter-M often improve the accuracy of the target task and outperform HeMap, we define the ground truth rank using the maximum accuracy of HeMap, Transmeter, and Transmeter-M for each source and target pair.
Figure 1 shows the results. Note that Transmeter and Transmeter-M provide the overall best accuracy. In terms of running time, HeMap is the slowest, but there is no clear winner among Transmeter, Transmeter-M, and HeMap-t. This shows that Transmeter and its variant give the most accurate transferability measurement, and their running times are not worse than competitors.
5 Conclusion
In this paper, we propose Transmeter, a novel algorithm that measures the transferability between two datasets. The base model of Transmeter comprises feature transformation layers, a label predictor, and domain classifiers. We use 1) a pre-trained source model as a label predictor to reduce the training time, and 2) domain classifier to reduce the gap between the source and the target domains. We improve the accuracy of the base model by introducing a reconstruction loss, label-wise discriminators, and a mean distance loss. Experiments show that Transmeter gives the best accuracy in measuring transferability, with comparable running time to those of competitors.
Acknowledgement
This work is supported by Samsung Research, Samsung Electronics Co., Ltd.
References
-
[1]
(2018)
Partial transfer learning with selective adversarial networks.
In
2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018
, pp. 2724–2732. Cited by: §2.2. - [2] (2012) Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pp. 1338–1345. Cited by: §2.2.
- [3] (2012) Learning with augmented features for heterogeneous domain adaptation. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, Cited by: §1, §2.1.
- [4] (2017) Domain-adversarial training of neural networks. In Domain Adaptation in Computer Vision Applications, pp. 189–209. External Links: Link, Document Cited by: §3.2.
- [5] (2014) On handling negative transfer and imbalanced distributions in multiple source transfer learning. Statistical Analysis and Data Mining 7 (4), pp. 254–271. Cited by: §2.2.
- [6] (2012) Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pp. 2066–2073. Cited by: §1.
- [7] (2007) Frustratingly easy domain adaptation. In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, Cited by: §1.
- [8] (2011) What you saw is not what you get: domain adaptation using asymmetric kernel transforms. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pp. 1785–1792. Cited by: §1, §2.1.
- [9] (2014) Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 36 (6), pp. 1134–1148. Cited by: §1.
-
[10]
(2014)
Learning and transferring mid-level image representations using convolutional neural networks
. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pp. 1717–1724. Cited by: §1. - [11] (2010) Cross-domain sentiment classification via spectral feature alignment. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pp. 751–760. Cited by: §1.
- [12] (2010) A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22 (10), pp. 1345–1359. Cited by: §2.2.
- [13] (2005) To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, Vol. 898, pp. 1–4. Cited by: §2.2.
- [14] (2013) Combating negative transfer from predictive distribution differences. IEEE Trans. Cybernetics 43 (4), pp. 1153–1165. Cited by: §2.3.
- [15] (2010-12) Transfer learning on heterogenous feature spaces via spectral transformation. In 2010 IEEE International Conference on Data Mining, Vol. , pp. 1049–1054. External Links: Document, ISSN Cited by: §1, §2.1, §4.1.3.
- [16] (2013) Transfer across completely different feature spaces via spectral embedding. IEEE Trans. Knowl. Data Eng. 25 (4), pp. 906–918. Cited by: §2.3.
-
[17]
(2011)
Heterogeneous domain adaptation using manifold alignment.
In
IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011
, pp. 1541–1546. Cited by: §1. - [18] (2018) Characterizing and avoiding negative transfer. CoRR abs/1811.09751. External Links: Link, 1811.09751 Cited by: §2.2.
- [19] (2016) A survey of transfer learning. J. Big Data 3, pp. 9. Cited by: §2.2.
- [20] (2017) Learning discriminative correlation subspace for heterogeneous domain adaptation. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 3252–3258. Cited by: §2.1.
- [21] (2010) Boosting for transfer learning with multiple sources. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pp. 1855–1862. External Links: Link, Document Cited by: §1.
- [22] (2018) Distance metric facilitated transportation between heterogeneous domains. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pp. 3012–3018. Cited by: §1, §2.1.
- [23] (2014) Heterogeneous domain adaptation for multiple classes. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, pp. 1095–1103. Cited by: §1, §2.1.