P2L: Predicting Transfer Learning for Images and Semantic Relations

08/20/2019 ∙ by Bishwaranjan Bhattacharjee, et al. ∙ Keck Graduate Institute ibm Columbia University 0

Transfer learning enhances learning across tasks, by leveraging previously learned representations -- if they are properly chosen. We describe an efficient method to accurately estimate the appropriateness of a previously trained model for use in a new learning task. We use this measure, which we call "Predict To Learn" ("P2L"), in the two very different domains of images and semantic relations, where it predicts, from a set of "source" models, the one model most likely to produce effective transfer for training a given "target" model. We validate our approach thoroughly, by assembling a collection of candidate source models, then fine-tuning each candidate to perform each of a collection of target tasks, and finally measuring how well transfer has been enhanced. Across 95 tasks within multiple domains (images classification and semantic relations), the P2L approach was able to select the best transfer learning model on average, while the heuristic of choosing model trained with the largest data set selected the best model in only 55 cases. These results suggest that P2L captures important information in common between source and target tasks, and that this shared informational structure contributes to successful transfer learning more than simple data size.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Good machine learning quality often benefits from a large number of examples to capture a robust representation of the unknown input distribution

[14] [16]. Small data sets may not sufficiently sample the input space. However, in practice, small training jobs are common and labeled data is scarce in many domains. In a survey of industry visual recognition workloads the average number of images submitted by customers was only 250, and the average number of labels was 5 (see Section 4.3.4).

To be clear, our goal is not cross-task transfer: our aim is to devise a heuristic for domain adaptation, for intra-task (such as image classification, or relationship prediction) cross-domain transfer, such as transfer from a classification model trained on a subset of ImageNet to a classification model for some unknown image classes.

Inductive transfer learning methods [24], [35] have been identified as a possible solution to this problem. These methods use knowledge acquired in a "source" task to enhance the learning of a new "target" task. However, these methods commonly assume that there is a "best" transfer model, usually the model trained with the largest data set [26]. Yet this assumption stands in tension with results showing that while a well chosen source can improve performance significantly, a poorly chosen one can be worse than random initialization [24] [28]. An open challenge remains: for fine-tuning of neural nets, how to predict the degree of transfer between different source and target domains prior to training.

In this work, we describe a method for identifying good transfer models prior to training that we validate in both the semantic relations and image domains. This is valuable since a general learning service must be prepared to train accurate models from widely varied target tasks automatically. Such a service must balance efficient training time and classification accuracy, precluding the exhaustive approach of fine-tuning all existing source models. At target task training time, P2L requires only a single forward pass of the target data set through a single reference model to identify the most likely candidate for fine-tuning.

Beginning with a single reference model (for images, VGG16 trained on ImageNet-1k, and for semantic relations PCNN), we generate feature vectors for each source dataset. We then use these models to characterize the similarity between source domain features and each target domain to select the source most likely to "donate" useful features, independent of the reference model. Using this metric, we estimate the similarity between a conceptual category of inputs, and each member of our family of classifiers. We then fine-tune a network for each combination of source and target to assess the degree to which each of the source models enhanced learning.

2 Related Work

The transfer learning literature explores a number of different topics and strategies such as few-shot learning [4] [30], domain adaptation [25], weight synthesis [31], and multi-task learning [15] [22] [32]. Some works propose novel combinations of these approaches, yielding new training architectures and optimization objectives to improve transfer performance under conditions of domain transfer with limited or incomplete annotations [19].

Representation transfer: Several approaches have been tried to transfer robust representations based on large numbers of examples to new tasks. These transfer learning approaches share a common intuition [3]: that networks which have learned compact representations of a "source" task, can reuse these representations to achieve higher performance on a related "target" task. Different approaches use different techniques to transfer previous representations. Instance-based approaches attempt to identify appropriate data used in the source task to supplement target task training, feature-representation approaches attempt to leverage source task weight matrices, and parameter-transfer approaches involve re-using the architecture or hyper-parameters of the source network [6] [24]. These approaches, often supplemented by related small-data techniques such as bootstrapping, can yield improvements in performance [2] [20] [22] [29] [37]. One approach to transfer learning is to leverage existing deep nets trained on a large dataset, for example VGG16 [26] [29] for images, or PCNN [39] for relation prediction. The trained weights in these networks have captured a representation of the input that can be transferred by fine-tuning the weights or retraining the final dense layer of the network on the new task.

While all these methods seek to improve performance on the target task by transfer from the source task, most assume there is only one source model, usually trained from ImageNet. [8] Additionally, this approach involves a number of meta-learning decisions, although in general each change from the original source architecture tends to decrease resulting classification performance [37]. Meta-learning [18] is another approach for representation transfer. While meta-learning typically deals with training a base model on a variety of different learning tasks, transfer learning is about learning from multiple related learning tasks [9]. Efficiency of transfer learning depends on the right source data selection, whereas meta-learning models could suffer from ’negative transfer’ [24] of knowledge if source and target domains are unrelated. Surprisingly, in image classification performance gains are commonly observed even in cases where initialization data appears visually and semantically different from the target dataset (such as ImageNet and Medical Imaging datasets).

The Learning to Transfer [36] framework learns a reflection function

that transforms feature vector representations to be more effectively classified by a kNN approach. Although it uses a model trained on ImageNet to produce the initial feature vectors, it is not a parameter-transfer method, since this model is not fine tuned on the target domains.

In contrast, for relation prediction, semantic dissimilarity between source and target task typically prevents effective transfer learning [20], and so the semantic-relations transfer is more poorly explored. However, semantic relations can contain information that can support transfer, one approach used vectorized representations of semantic relations as an added source of information to support image-segmentation [21] [27].

Fine-tuning with co-training: Our approach is most similar to that of fine-tuning with co-training [37]. That method begins by using low-level features to identify images within a source dataset having similar textures to a target dataset, and concludes by using a multi-task objective to fine-tune the target task using these images. A related approach has been used to enhance performance and reduce training time in document classification [7] and to identify examples to supplement training data [11] [37]

. Our goal is to extend this approach to high-level features, and to domains outside computer vision to construct a more complete map of the feature space of a trained network. In this way our approach has some parallels with "learning to transfer" approaches

[34], which attempt to train a source model optimized for transfer rather than target accuracy.

Taskonomy and Model-Recommendation: When transferring information captured by previous task-learning for a new task, it is important to take into account the nature of both tasks. One promising approach involves use recommender systems which identify models with similar latent-space representations of labeled data. In an object-detection context [33], this approach has been used to select likely candidates for inclusion in an ensemble model for object recognition. In multi-task visual learning, a model learned to estimate the similarity space of various visual tasks, to estimate the degree to which models trained to perform these tasks might contribute to transfer on a novel task [38]. The current paper aims, in part, to combine the low computational cost of the former technique with the enhanced transfer performance of the latter by learning a novel method for selection among previously trained source models. However, our goal is not cross-task transfer - for example, we are not trying to transfer a model learned for a depth estimation task to a classification task. Although within a task type, (such as image classification) we do sometimes refer to a source task and target task, our aim is to devise a heuristic for domain adaptation, for intra-task, cross-domain transfer, such as transfer from a classification model trained on ImageNet to a classification model for Oxford Flowers. We show that P2L works for multiple domains and intra-task for 2 types of tasks: image classification and semantic relation prediction.

3 Methods

3.1 Theoretical Framework

This work addresses the problem of how to make an optimal choice of pre-trained network weights learned from source tasks for some target task . Given a target task and dataset , a model is generated by first training on a source task and dataset , and then transferring this information to , through mechanisms such as fine-tuning. For each pair , performance improvement by transfer in each scenario can be measured:


where is some defined performance evaluation (such as accuracy), represents the nil dataset (randomly initialized weights), and is the measured performance improvement. Selecting the optimal would then trivially be achieved by:


However, since training a model for all combinations of and is computationally expensive, we build as an estimate of , which could be used in its place to predict more efficiently. In this work, we demonstrate the approach by simply defining as


where is an empirically-derived monotonically increasing goodness measure, and is a statistical aggregation technique to combine sets of individual data instances into vectors representing the entire dataset. As an example, are a set of feature vectors over images contained in , and is simply the average over those feature vectors. As another example, could be a set of SIFT features over images in the dataset, and is a corresponding codebook histogram. Noting that performance of should increase as the cardinality of the datasets increase, is the number of elements, or size, of dataset . Specifically,

Figure 1:

Image Deep Learning Pipeline

where represents the statistical dissimilarities in the datasets as measured by standard methods, such as KL or Jensen-Shannon divergence ([10] [17]), or Chi-square or Euclidean distance.

) are the mean and standard deviations of the dissimilarities and the source dataset size, respectively, and

are learned parameters (from ) that change how quickly each term reaches saturation.

is the logistic sigmoid function


Intuitively, the first term captures the negative effect of dissimilarity of the average feature vectors, while the second term captures the positive effect of dataset size. Both effects are normalized and bounded, and the central multiplication effectively "ANDs" them together. In practice, we have found that the KL-divergence measure works best, possibly because of its asymmetry. In order to evaluate the performance of a given approximation function in comparison to some ground truth , is learned by experimenting on a collection of target and source datasets , and then evaluating on a held-out set of datasets . In order to evaluate the performance of in comparison to exhaustive ground truth, we measure both the Spearman’s of its choices, as well as the accuracy of its Top-1 choice.

While this work takes an engineering design approach to an approximation function , this framework paves the way for future works which may explicitly learn linear or non-linear functions to approximate .

3.2 Implementation Details for Images

As described in Figure 1, We use the VGG16 model pre-trained on ImageNet-1k. For , we extract the response of the penultimate full connection layer, a 4096-dimensional vector. For in a learning task with images, we extract such vectors , compute their mean, , and then L1-normalize this mean, giving as the summary feature vector for this task. For , we compute one of several possible distance measures, smoothing any zero components by adding an appropriate value.

3.3 Implementation Details for Semantic Relations

The task of relation prediction provides a second benchmark for source domain selection. In this task, a semantic relations base is extended with information extracted from text. We use the CC-DBP [12] dataset: the text of Common Crawl111http://commoncrawl.org and the semantic relations schema and training data from DBpedia [1]

. DBpedia is a knowledge graph extracted from the infoboxes from Wikipedia. An example edge in the DBpedia knowledge graph is

, meaning Larry McCray is a blues musician. This relationship is expressed through the DBpedia genre relation, a sub-relation of the high level relation isClassifiedBy. The relation prediction task is to predict the relations (if any) between two nodes in the knowledge graph from the entire set of textual evidence, rather than each sentence separately as in mention level relation extraction.

Figure 2

shows the relation prediction neural architecture. The feature representations are taken from the penultimate layer, the max-pooled network-in-network. All models have the same architecture and hyperparameters, shown in Table


Table 1: Hyperparameters used
Figure 2: Deep Learning Architecture for Relation Prediction
Figure 2: Deep Learning Architecture for Relation Prediction

4 Experimental Results and Analysis

4.1 Experimental Approach Images

ImageNet22k contains 21841 categories spread across hierarchical categories such as person, animal, fungus. We extracted some of the major hierarchies from ImageNet22k (Table 3

) to form multiple source and target domains image sets for our evaluation. As the figure indicates, approx 9 million images were used. Some of the domains like animal, plant, person and food contained substantially more images (and labels) than categories such as weapon, tool, or sport. This skew is reflective of real world situations and provides a natural test bed for our method when comparing training sets of different size.

Each of these domains was then split into four equal partitions. One was used to train the source model, two were used to validate the source and target models, and the last was used for the transfer learning task. One-tenth of the fourth partition was used to create a transfer learning target. For example, the person hierarchy has more than one million images. This was split into four equal partitions of more than 250K each. The source model was trained with data of that size, whereas the target model was fine-tuned with one-tenth of that data size taken from one of the partitions. The smaller target datasets is reflective of real transfer learning tasks.

In this way, we generated 15 source workloads and 15 target training workloads. These source and target workloads were divided into two groups. One group, (,), consisting of sport, garment, fungus, weapon, plant, and animal as and was used to generate parameters for approximation function E in equation 4, as well as to determine which dissimilarity measure to use. A second held-out group, (,) consisting of furniture, food, person, nature, music, fruit, fabric, tool, and building as and were used to validate these parameters. The same identical parameters was also validated on 71 real world image classification tasks, Oxford Flowers dataset as well as for Semantic Relations.

The training of the source and target models was done using Caffe

[14] using a ResNet-27 model [13]. The source models were trained using SGD [5] for 900,000 iterations with a step size of 300,000 iterations and an initial learning rate of 0.01. The target models were trained with an identical network architecture, but with a training method with one-tenth of both iterations and step size. A fixed random seed was used throughout all training.

4.2 Experimental Approach Semantic Relations

We split the task of relation prediction into seven subtasks composed of the high-level relations with the most positive examples in the CC-DBP (other relations were discarded). This was intended to be mirror the partitions of ImageNet by high-level class. The seven source domains are shown in Table 2. A model is trained for each of these domains on the full training data for the relevant relation types.

Division Name Number of Positives in Train
coparticipatesWith 227 78598
hasLocation 85 72065
sameSettingAs 169 40359
isClassifiedBy 34 22743
hasPart 64 12319
hasMember 45 36706
hasRole 4 7320
Table 2: Source domains division of CC-DBP for binary relation extraction

Our approach to transfer learning was the same as in images: a deep neural network trained on the source domain was fine-tuned on the target domain. Fine-tuning involves re-initializing and re-sizing the final layer, since different domains have different numbers of relations. The final layer is updated at the full learn rate

while the previous layers are updated at , with . We used a fine-tune multiplier of .

A new, small training set is built for each target task. For each split of CC-DBP we take 20 positive examples for each relation from the full training set or all the training examples if there are fewer than 20. We then sample ten times as many negatives (unrelated pairs of entities). These form the target training sets. The model trained from the full training data of each of the different subtasks is then fine-tuned on the target domain. We measure the area under the precision/recall curve for each trained model. We also measure the area under the precision/recall curve for a model trained without transfer learning.

4.3 Results

When training a model, a user commonly may (1) choose the source model trained with the largest amount of data (LTD), or (2) randomly choose a model from the basket of available models as a source for transfer learning, or (3) not use transfer learning at all but instead initialize the weights of the network randomly. We have used this to compare P2L across two domains : Images (ImageNet22k in 4.3.1, Oxford Flowers in 4.3.3, Real World Tasks in 4.3.5) and Semantic Relations (DBpedia in 4.3.2).

In summary, across 95 tasks in the above 4 contexts, P2L was able to pick a better source model on average. In contrast, the heuristic LTD (to choose the source model trained with the largest amount of data) was able to pick the best source in 55 cases only.

Table 5 shows the relative increase in final performance for our proposed method in comparison to each of these three methods, across ImageNet22k (in 4.3.1) and DBpedia (in 4.3.2). Our method selects the best dataset for transfer learning in all but one case. On average, we improve the accuracy over the next best method by 2 percent. While it is fair to say that the gain shown Table 5 is consistent but modest, we found it encouraging, and sought to test it further in 2 more independent experiments: the Oxford Flower dataset and on tasks sampled from a real-world, commercial classifier training service.

For the validation on real world tasks (in 4.3.5), the largest source, Imagenet1K, was the best in only 35 of 71 cases. P2L picked better source models on average, boosting mean top-1 accuracy across the 71 real-world tasks from the public cloud service. We feel this result is the most significant of this work, since the real, "in-the-wild" classification tasks from the service had no guaranteed relationship to the ImageNet classification images used in Table 5. Similarly for the Oxford Flower dataset (in 4.3.3), Imagenet1K was not optimal, and P2L identified a better data source.

4.3.1 Validation on subsets of ImageNet22K

Figure 3: Spearman’s for various measures
Table 3: ImageNet22k partitions used for evaluation
Domain % of Evaluated Images
animal 32%
plant 24%
person 13%
food 11%
tool, building 2% each
sport, garment 2% each
nature, music 2% each
furniture 2%
fruit, fabric 2% each
fungus, weapon 1% each

We tested distance measures based on Kullback-Leibler Divergence (KLD), Jensen-Shannon Divergence (JSD), Chi-square (Chi2), and Euclidean distance (ED). For each training task (

,), we calculated the rank-correlation (Spearman’s ) between the predictions of each of these measures, and the ground-truth transfer performance based on top-1 classification accuracy.

Figure 3 shows the average Spearman’s of the top-1 ground truth rank and our prediction rank as they varied with various values of equation (4). For in this interval, the KLD based measure is most sensitive, and we use it exclusively for evaluations with = 1 and = 4. The parameter was fixed at 1, since what is important is the ratio of to . For these parameters, the average Spearman’s for the transfer learning task is 0.83. The gains from our prediction method are shown in Table 5.

The parameters and which were formulated as described above on the 6x6 Imagenet22K training set (), were subsequently used for validation on the 9x9 Imagenet22K (in 4.3.1), real world image classification tasks (in 4.3.4), Oxford flowers (in 4.3.3), as well as dbpedia (in 4.3.2) datasets. The set size of () can be increased from the current 6x6. But even with the current size it shows the potential of P2L.

This parameter selection of and is essentially offline, and only needs to be done once to pick the parameter values. It required 30 custom training jobs. All subsequent predictions for the 9x9 Imagenet22K, real work image classification tasks, Oxford flowers, as well as dbpedia, did not require further training.

Target Spearman’s Target Spearman’s
Dataset Dataset
Fabric 0.976 Food 0.833
Nature 0.952 Fruit 0.762
Person 0.929 Tool 0.667
Music 0.929 Furniture 0.595
Building 0.905
Table 4: Spearman’s for predictions vs ground truth for transfer learning on images

4.3.2 Validation on Common Crawl - DBpedia

Figure 5 shows the correlation of the prediction with the improvement , when using KLD in addition to size of the source domains’ training set in . Figure 5 shows the same when only size is used. Using the estimator produced better predictions, that is, and were then better correlated (Spearman’s = .763, Pearson’s = .823).

Figure 4: Transfer Learning Improvement
Predicted by KLD with Size in CC-DBP
Figure 5: Transfer Learning Improvement
Predicted by Size in CC-DBP
Figure 4: Transfer Learning Improvement
Predicted by KLD with Size in CC-DBP
Domain Target Dataset P2L Picked Largest Training Random Dataset No Transfer
Best Dataset Selection Learning
Dataset ? (OP-LTD)/LTD (OP-RDS)/RDS (OP-A)/A
Images Fruit Yes 0.18 0.59 1.00
Fabric Yes 0.00 0.32 0.67
Building Yes 0.00 0.36 0.63
Music Yes 0.00 0.25 0.53
Nature Yes 0.00 0.21 0.42
Food Yes 0.00 0.37 0.31
Tool Yes 0.00 0.12 0.25
Furniture Yes 0.00 0.22 0.22
Person Yes 0.00 0.13 0.25
Semantic hasPart Yes 0.15 0.13 0.40
Relations copartWith Yes 0.00 0.14 0.34
sameSettingAs Yes 0.00 0.17 0.30
hasLocation Yes 0.00 0.09 0.14
hasMember Yes 0.00 0.07 0.14
hasRole Yes 0.00 0.01 0.13
isClassifiedBy No -0.01 0.08 0.08
Table 5: Gain Summary for Images and Semantic Relation Prediction
OP = Accuracy using P2L; LTD = Accuracy of largest source training dataset ;
RDS = Avg accuracy of randomly picked source dataset ; A = Accuracy of

4.3.3 Validation on Oxford Flower 102 Dataset

We also evaluated fine tuning the Oxford Flower 102 [23] dataset using P2L and compared it to other methods on the ResNet27 architecture using the same training regime as in section 4.1. The dataset contains 102 commonly occurring flower types each with only 10 training images per class. Of the 16 source candidates, including ImageNet1k, P2L predicted plants. Intuitively, this is because of the strong visual resemblance of many plants and flowers. Experimental evaluation validated this prediction: fine tuning with plants as the source produced a top-1 accuracy of 91.6% accuracy in comparison to 85.12% accuracy for ImageNet1k.

4.3.4 Validation on Real World Image Classification Tasks

To provide a practical test of P2L, we obtained data for 71 training tasks that were submitted to a commercial machine learning service, by users of the service who had allowed their data to be used for research. This service takes images with labels as input, and produces a classifier via supervised learning. Our goal was to validate the prediction made by equation

4 on real world-data, by selecting the single most appropriate source model from the collection of 16 candidates generated in our transfer learning experiment, and then fine-tuning that candidate for each of the 71 target jobs. We validated our prediction method by exhaustively fine-tuning for each of the 1,136 possible source-target pairs. We assume that for efficiency at classification time it is necessary to select the single best source model instead of using an ensemble of multiple source models.

For our experiments, we randomly split each set of images with labels into 80% for fine-tuning and 20% for validation. For these 71 training sets, we had a total of approximately 18,000 images: an average of 204 training images and 50 held-out validation images each. There were 5.2 classes per classifier on average, with a range of 2 to 60 classes per classifier. We used 14 models trained from sub-domains of ImageNet as possible source models (listed in table 3, excluding "music"), plus a variant of the "animal" source model which was trained with twice as many examples. We also used a "standard" model trained on all of the ImageNet-1K training data. We used the same ResNet-27 architecture [13]

described above. Due to the small size of target domain data, we set the learning rate to 0 for the convolutional layers, and otherwise used a learning rate of 0.01 for 40 epochs. We ranked the performance of each of these models by top-1 accuracy using the 20% held out data.

Based on a manual inspection of the classifier labels for the 71 target jbos, we found a wide variety of domains, with the largest (animals) representing only about 14% of the total. This high level of variety appears common in real-world learning service scenarios, since users are training custom classifiers to address problems for which ready-made models don’t exist.

Domain Mean top-1 accuracy
P2L (ours) LTD
Oxford Flowers 91.6 85.1
Real World 79.3 78.1
Table 6: Results on Non-ImageNet Data

4.3.5 Results of Using P2L for Real World Tasks

Compared to the most robust baseline we identified in our experimental results (i.e., using ImageNet-1k as the source model for every target), our method was able to enhance the performance of target learning jobs. For our sample of 71 tasks, the P2L method provided a top-1 image classification accuracy of 79.3% compared to an ImageNet-1k baseline result of 78.1%. ImageNet-1k was the optimal baseline in 35 out of the 71 cases, but in the 36 remaining tasks P2L was able on average to identify a more effective source model. This increase was most often driven by the selection of the food and person source models. We speculate that variation within these domains may not be well captured by ImageNet-1k (which contains relatively few labeled examples of people or food), or that the target task may rely on a very specific feature domain. These findings are summarized in table 6.

4.3.6 Comparing against merged source datasets

We have investigated how a merged dataset of various source domains could do in comparison to its individual components. While it may seem that a single merged dataset would perform as well or better than individual sources, in reality we have noticed results to the contrary. For example, we trained a custom learning workload for "car", using the real world image classification dataset (in 4.3.4). Using the "weapon" dataset as a source delivered 87.5% accuracy. But combining the "music" and "tool" datasets with "weapon" actually reduced final accuracy to 73%. This combination was chosen since "music", "tool", and "weapon" are the three most convergent datasets.

The likely reason for this is that merging datasets, without consideration for the semantic similarity of labels results in confusion for the neural network. For example, images of "knife" are part of both the weapon and tool datasets labels. By merging these two, the training process has to differentiate between two labels which are similar and does not learn much.

5 Future Work

The current P2L approach estimates transfer performance at the level of large conceptual categories (e.g., "animal", or "location"). However, large labeled data sets, such as those used in ImageNet-1k, contain deep hierarchies (e.g., animal mammal cat cheetah) that may help to characterize finer resolution maps of the feature space. Identifying crucial sub-features can assist further in selecting more specific source categories, and in developing more efficient source models and transfer techniques.

We currently use one modality in isolation for determining which source model to use. However,there are a lot of information besides the image (or semantic relation) which could additionally aid in determining a good source model. Like accompanying text or audio feed etc. Bringing in these multi modal aspects could enhance the accuracy of prediction. For example, blight is a crop disease and crops are likely to occur in a plant dataset than any other dataset. If one can determine these links from external datasets, it would help zero in on a good dataset and choose the best especially when there are two or more close candidates. Extracting tags from the images or using other available information and using them to find out semantically closest source categories from a large knowledge graph can yield substantial improvement in image recognition.

Additionally,we have currently proven our methods with images and knowledge. We propose to enhance it for temporal domains like machine translation and video. We also propose to investigate refining and simplify our method and improving its understand-ability.

6 Conclusion

We described an efficient method for using a small data sample to select and fine-tune a candidate from a family of pre-trained models, applicable to both the image and semantic relations. We conducted an empirical test of the method using models trained on specific conceptual categories across images and semantic relations, demonstrating improved transfer learning results, outperforming baselines such as picking the model trained with the largest data set, or using a common industry standards such as a model based based on ImageNet-1k. These findings suggest that a learned representation from previous tasks can be used to select the best transfer candidate, and to provide greater transfer learning.

Despite order of magnitude differences in training set sizes, we were able to obtain transfer gains by computing an estimate of conceptual closeness. Although prior work has described a saturating curve for training set size contributions to accuracy [16]–which we also observed in our data–we showed that feature similarity provided transfer benefits not predicted by set-size alone.

Our method is efficient at training and classification time, and has been shown to improve accuracy versus the baseline, on publicly available image and semantic relations datasets as well as on real-world datasets, and across a wide range of task sizes. These results help to explain the tension in the literature between results showing that larger datasets usually outperform smaller [26], but that ill-selected transfer models can degrade performance [28]. We suggest that rather than there being a single "best" transfer model, transfer performance critically depends upon the similarity between the source and target models. Further, methods such as P2L can map the degree of overlap between disparate tasks to select more optimal models and and enhance transfer learning performance. Exploring these "maps" of feature space similarities could be a valuable future direction for machine learning research.


  • [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, and Z. Ives. Dbpedia: A nucleus for a web of open data. In In 6th Int’l Semantic Web Conference, Busan, Korea, pages 11–15. Springer, 2007.
  • [2] H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson. Factors of Transferability for a Generic ConvNet Representation. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 38, pages 1790–1802, 2016.
  • [3] Y. Bengio. Deep Learning of Representations for Unsupervised and Transfer Learning. In JMLR: Workshop and Conference Proceedings, volume 7, pages 1–20, 2011.
  • [4] D. Bollegala, Y. Matsuo, and M. Ishizuka. Relation adaptation: learning to extract novel relations with minimum supervision. In

    Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Three

    , pages 2205–2210. AAAI Press, 2011.
  • [5] L. Bottou. Stochastic gradient tricks. In G. Montavon, G. B. Orr, and K.-R. Müller, editors, Neural Networks, Tricks of the Trade, Reloaded, Lecture Notes in Computer Science (LNCS 7700), pages 430–445. Springer, 2012.
  • [6] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu. Boosting for transfer learning. Intl Conf on Machine Learning, pages 193–200, 2007.
  • [7] A. Das, S. Roy, and U. Bhattacharya.

    Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks.

  • [8] W. Deng, J.and Dong, L.-J. L. K. Socher, R.and Li, and F.-F. Li. Imagenet: A large-scale hierarchical image database. In IEEE Conference on CVPR, 2009.
  • [9] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. CoRR, abs/1703.03400, 2017.
  • [10] B. Fuglede and F. Topsøe. Jensen-shannon divergence and hilbert space embedding. In International Symposium on Information Theory, 2004.
  • [11] W. Ge and Y. Yu. Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In

    Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI

    , volume 6, 2017.
  • [12] M. Glass and A. Gliozzo. A Dataset for Web-scale Knowledge Base Population. In Proceedings of the 15th Extended Semantic Web Conference, 2018.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on CVPR, 2016.
  • [14] Y. Jia, E. Shelhmer, J. Donahue, S. Kacayev, J. long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In ACM Multimedia, 2014.
  • [15] J. Jiang. Multi-task transfer learning for weakly-supervised relation extraction. In

    4th International Joint Conference on Natural Language Processing

    . Association for Computational Linguistics, 2009.
  • [16] T. Kavzoglu and I. Colkesen.

    The effects of training set size for performance of support vector machines and decision trees.

    Symposium on Spatial Accuracy Assessment in Natural Resources, 2012.
  • [17] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1):79–86, 03 1951.
  • [18] C. Lemke, M. Budka, and B. Gabrys. Metalearning: a survey of trends and technologies. Artificial Intelligence Review, 44(1):117–130, Jun 2015.
  • [19] Z. Luo, Y. Zou, J. Hoffman, and L. Fei-Fei. Label efficient learning of transferable representations across domains and tasks. In NIPS, 2017.
  • [20] L. Mou, Z. Meng, R. Yan, G. Li, Y. Xu, L. Zhang, and Z. Jin. How Transferable are Neural Networks in NLP Applications? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 479–489, 2016.
  • [21] H. Myeong and K. M. Lee. Tensor-based high-order semantic relation transfer for semantic scene segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3073–3080, 2013.
  • [22] T. H. Nguyen, L. Fu, K. Cho, and R. Grishman. A Two-stage Approach for Extending Event Detection to New Types via Neural Networks. ACL Representation Learning for NLP Workshop, 2016.
  • [23] M. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In ICVGIP, 2008.
  • [24] S. J. Pan and Q. Yang. A Survey on Transfer Learning. IEEE Transactions on knowledge and data engineering, 22(10):1–15, 2010.
  • [25] N. Patricia and B. Caputo. Learning to learn, from transfer learning to domain adaptation: A unifying perspective. Conference on Computer Vision and Pattern Recognition, 2014.
  • [26] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: an astounding baseline for recognition. CoRR, abs/1403.6382, 2014.
  • [27] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele. What helps where and why? Computer Vision and Pattern Recognition (CVPR), 2010.
  • [28] M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich. To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, 2005.
  • [29] K. Simonyan and A. Zisserman. Very Deep Convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [30] R. Socher, M. Ganjoo, H. Sridhar, O. Bastani, C. D. Manning, and A. Y. Ng. Zero-Shot Learning Through Cross-Modal Transfer. pages 935–943, 2013.
  • [31] D. Sussillo and L. Abbott. Transferring learning from external to internal weights in Echo-State networks with sparse connectivity. PLoS ONE, 2012.
  • [32] L. Torrey and J. Shavlik. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pages 242–264. IGI Global, 2010.
  • [33] Y.-X. Wang and M. Hebert. Model recommendation: Generating object detectors from few samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1619–1628, 2015.
  • [34] Y. Wei, Y. Zhang, and Q. Yang. Learning to transfer. CoRR, abs/1708.05629, 2017.
  • [35] K. Weiss, T. M. Khoshgoftaar, and D. Wang. A survey of transfer learning. Journal of Big Data, 3(1):9, May 2016.
  • [36] W. Ying, Y. Zhang, J. Huang, and Q. Yang. Transfer learning via learning to transfer. In International Conference on Machine Learning, pages 5072–5081, 2018.
  • [37] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In NIPS, 2014.
  • [38] A. R. Zamir, A. Sax, W. B. Shen, L. J. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. CoRR, abs/1804.08328, 2018.
  • [39] D. Zeng, K. Liu, Y. Chen, and J. Zhao. Distant supervision for relation extraction via piecewise convolutional neural networks. In EMNLP, pages 1753–1762, 2015.