I Introduction
In realworld applications, the performance of deep learning often is limited by the size of training set. Training a deep neural network with a small number of training instances usually results in the socalled overfitting problem and the generalization capability of the obtained model is low. A simple yet effective approach to obtain highquality deep neural network models is transfer learning [22] from pretrained models. In such practices [5], a deep neural network is first trained using a large (and possibly irrelevant) source dataset (e.g. ImageNet). The weights of such a network are then finetuned using the data from the target application domain.
Through incorporating the weights of appropriate pretrained networks as startingpoints of optimization and/or reference for regularization [16, 31], deep transfer learning can usually boost the performance with better accuracy and faster convergence. For example, Li et al. [16] recently proposed SP algorithm that leverages the squared Euclidean distance between the weights of source and target networks as a regularizer for knowledge transfer and further set the weights of source network as the starting point for optimization procedure. In this approach, SP transfers the “knowledge” from the pretrained source networks to the target ones, through constraining the difference of weights between the two networks, while minimizing the empirical loss on the target task. In addition to the direct regularization on weights, [31] demonstrated the capacity of “knowledge distillation” [9] to transfer knowledge from the source to target networks in an teacherstudent training manner, where the source network performs as a teacher regularizing the outputs from the outerlayers of target network. All in all, knowledge distillation based approach intends to transfer the capacity of feature learning from the source network to the target one, through constraining the difference of feature maps outputted by the outer layers (e.g., convolution layers) [32, 10] of the two networks.
In addition to aforementioned strategies, a great number of methods, e.g., [13, 17], have been proposed to transfer the knowledge from the weights of pretrained source networks to the target task, for better accuracy. However, incorporating weights from inappropriate networks using inappropriate transfer learning strategies sometimes may hurt the training procedure and may lead to even lower accuracy. This phenomena is called “negative transfer” [26, 22]. For example, [32] observed that reusing the pretrained weights of ImageNet task through inappropriate ways could poison the CNN training and forbids deep transfer learning achieving its best performance. It indicates that reusing pretrained weights from inappropriate datasets and/or via inappropriate transfer learning strategies hurts deep learning.
In this paper, we consider deep transfer learning as minimizing a linear combination of empirical loss and regularizer based on pretrained weights, where the empirical loss refers to fitness on the target dataset and regularization controls the divergence (of weights, feature maps, and etc.) between source and target networks. From an optimization perspective, the regularizer would restrict the training procedure from lowering the empirical loss, as the descent direction (e.g., derivatives) of the regularizer might be conflicted with the descent direction of empirical loss.
We specify above observation using an example based on SP [16] shown in Figure 1. The Black Line refers to the empirical loss descent flow of common gradientbased learning algorithms with pretrained weights as the start point. It shows that with the empirical loss gradients as the descent direction, such method quickly converges to a local minimum in a narrow cone, which is usually considered as an overfitting solution. In the meanwhile, the Blue line demonstrates the possible empirical loss descending path of SP algorithm, where a strong regularization blocks the learning algorithm to continue lowering the empirical loss while traversing the area around the point of pretrained weights. An ideal case has been illustrated as the red line, where SP regularizer helps the learning algorithm to avoid the overfitting solutions. The overall descent direction adapting SP regularizer with respect to empirical loss leads to generalizable solutions. There thus needs a method to make both empirical loss and regularizer continue descending to boost the performance of deep transfer learning.
Our Contribution. In this way, we propose a novel deep transfer learning strategy DTNH that makes regularizationbased Deep Transfer learning Never Hurt. Specifically, for each iteration of training procedure, DTNH computes the derivatives of the empirical loss and regularizer terms separately, then reestimates a new descent direction that doesn’t hurt the empirical loss minimization while preserving the regularization affects from the pretrained weights. Extensive experiments have been done using common transfer learning regularizers, such as SP and knowledge distillation, on top of a wide range of deep transfer learning benchmarks including Caltech, MIT indoor 67, CIFAR10 and ImageNet. The experiments shows that DTNH can always improve the performance of deep transfer learning tasks, even when reusing pretrained weights from inappropriate networks (i.e., the performance of vanilla tranfer learning from the source task is even worse than direct training using the target dataset only). All in all, DTNH can work with above regularizers in all cases with 0.1% – 7% higher accuracy than state of the art algorithms.
The Organization of the paper. The rest of this paper is organized as follow. In Section II, we present the introduction to the relevant work of this research and discuss the comparisons to our work. Section III briefs the preliminary work and the technical backgrounds of our work, where the state of the art models for the regularizationbased deep transfer learning are introduced with details. Section IV presents the design of the proposed algorithm DTNH, where we first formulate the research problem of this work with two key assumptions, then present the algorithm design and discuss the novelty of the proposed algorithms. In Section V, we majorly report the experiments that we conducted to validate DTNH, where we first present the experiment setups and datasets used, then introduce the main results with the accuracy comparisons between DTNH and the existing transfer learning algorithms, further we provide several case studies that validate the two key assumptions made for problem formulation. Finally, we discuss the open issues of this work in Section VI and conclude this work in Section VII.
Ii Related Work
In this section, we introduce the related work of deep transfer learning with the most relevant work to our study emphasized, where we first introduce the existing work in deep transfer learning in general, then emphasize the deep transfer learning algorithms that uses pretrained models for the regularization, further we discuss the connection of existing work to this paper.
Iia Transfer Learning with Deep Architectures
Transfer learning refers to a type of machine learning paradigms that aim at transferring knowledge obtained in the source task to a (maybe irrelevant) target task
[22, 2], where the source and target tasks can share either same or different label spaces. In our research, we primarily consider the inductive transfer learning with different target label space for deep neural networks. As early as in 2014, authors in [5]reported their observation of significant performance improvement through directly reusing weights of the pretrained source network to the targt task, when training a large CNN with tremendous number of filters and parameters. However, in the meanwhile of reusing all pretrained weights, the target network might be overloaded by learning tons of inappropriate features (that cannot be used for classification in the target task), while the key features of the target task have been probably ignored. In this way, Yosinki
et al. [32] proposed to understand whether a feature can be transffered to the target network, through quantifying the “transferability” of features from each layer considering the performance gain. Furthermore, Huh et al. [10]made empirical study on analyzing features that CNN learned from ImageNet dataset to other computer vision tasks, so as to detail the factors effecting deep transfer learning accuracy. In recent days, this line of research has been further developed with increasing number of algorithms and tools that can improve the performance of deep transfer learning, including subset selection
[6, 3], sparse transfer [18], filter distribution constraining [1], parameter transfer [34].IiB Regularizationbased Deep Transfer Learning
In our research, we focus on the knowledge transfer through reusing pretrained weights for the regularization. We categorized the most recent related work as follow.

Regularizing through Weight Distance  The square Euclidean distance between the weights of source and target networks are frequently used as the regularizer for deep transfer learning [16]. Specifically, [16] studied to accelerate deep transfer learning while preventing finetuning from overfitting, using a simple norm regularization on top of the “Starting Point as a Reference” optimization. Such method, namely SP, can significantly outperform a wide range of regularizationbased deep transfer learning mechanisms, such as the standard norm regularization.

Regularizing through Knowledge Distillation  Yet another way to regularize the deep transfer learning is “knowledge distillation” [9, 25]. In terms of methodologies, the knowledge distillation was originally proposed to compress deep neural networks [9, 25] through teacherstudent network training, where the teacher and student networks are usually based on the same task [9]. In terms of inductive transfer learning, authors in [31] were first to investigate the possibility of using the distance of intermediate results (e.g., feature maps generated by the same layers) of source and target networks as the regularization term. Further, [33] proposed to use the distance between activation maps as the regularization term for socalled “attention transfer”.
In addition to the use of above two types of regularization, finetuning from a pretrained model with a norm regularization is also appreciated by the deep transfer learning community [5].
IiC Discussion on the Connection to Our Work
Compared to above work and other transfer learning studies, our work aims at providing a generic descent direction estimation strategy that improves the performance of regularizationbased deep transfer learning. The intuition of DTNH is, per iteration during the learning procedure, reestimating a new direction of loss descending that addresses the affect of regularizers while making the empirical loss reduction/minimization not hurt. In our work, we demonstrated the capacity of DTNH working with two most recent deep transfer learning regularizers–SP [16] and Knowledge distillation [31], which are based on two typical deep learning philosophies (i.e., constraining weights and feature maps respectively), using a wide range of transfer learning tasks. The consistent performance boosts with DTNH in all cases of experiments suggests that DTNH can improve above regularizationbased deep transfer learning with higher accuracy.
Other techniques, including continual learning [13, 17], attention mechanism for CNN models [20, 29, 30, 33] and so on, can also improve the performance of knowledge transfer between tasks. We believe our work made complementary contributions in this area. All in all, we appreciate the contributions made by these studies.
Iii Preliminaries and Backgrounds
In this section, we first introduced the preliminary setting of regularizationbased transfer learning, then introducing the backgrounds of SP and knowledge distillation based transfer learning that would be used in our studies.
Iii1 Regularizationbased Transfer Learning
Deep convolutional networks usually consist of a great number of parameters that need fit to the dataset. For example, ResNet110 has more than one million free parameters. The size of free parameters causes the risk of overfitting. Regularizationbased transfer learning is the technique to reduce this risk by constraining the parameters within a limited space with respect to a set of pretrained weights. The general learning problem is usually formulated as follow.
Definition 1 (Regularizationbased Deep Transfer Learning)
Let’s first denote the dataset for the desired task as , where totally tuples are offered and each tuple refers to the input image and its label in the dataset. We then denote be the
dimensional parameter vector containing all
parameters of the target model. Further, given a pretrained network with parameter based on an extremely large dataset as the source, one can estimate the parameter of target network through the transfer learning paradigms. The optimization object with regularizationbased deep transfer learning is to obtain the minimizer of(1) 
where (i) the first term refers to the empirical loss of data fitting while (ii) the second term characterizes the differences between the parameters of target and source network. The tuning parameter balances the tradeoff between the empirical loss and the regularization term.
Iii2 Deep Transfer Learning via SP and Knowledge Distillation
As was mentioned, two common regularizationbased deep transfer learning algorithms studied in this paper are SP [16] and knowledge distillation based regularization [31]. Specifically, these two algorithms can be implemented with the general regularizationbased deep transfer learning procedure with objective function listed in Eq. 1 using following two regularizers

L2SP [16] – In terms of regularizer, this algorithm uses the squaredeuclidean distance between the target weights (i.e., optimization objective ) and the pretrained weights of source network (listed in Eq 2) to constrain the learning procedure.
(2) In terms of optimization procedure, SP makes the learning procedure start from the pretrained weights (i.e., using to initialize the learning procedure).

Knowledge Distillation based Regularization [31] — Given the target dataset and filters in the target/source networks for knowledge transfer, this algorithm models the regularization as the aggregation of squaredeuclidean distances between feature maps outputted by the filters of the source/target networks, such that
(3) where refers to the feature map outputted by the filter () of the target network based on weight using input image (). The optimization algorithm can starts from as the initialization of learning.
In the rest of this work, we presented a strategy DTNH to improve the general form of regularizationbased deep transfer learning shown in Eq. 1, then evaluated and compared DTNH using above two regularizers with common deep transfer learning benchmarks.
Iv Dtnh: Towards Making Deep Transfer Learning Never Hurt
In this section, we formulate the technical problems of our research with assumptions addressed, then present the design of our solution DTNH.
Iva Problem Formulation
Prior to formulating our research problem based on the settings, this section introduced the settings and assumptions of the problem.
Definition 2 (Descent Directions)
Gradientbased learning algorithms are frequently used for deep transfer learning to minimize the loss function listed in Eq.
1. In each iteration of learning procedure, the algorithms estimate a descent direction , such as stochstic gradient, based on the optimization objective that approximates the gradient, such that(4)  
where refers to the gradient of empirical loss based on training set and is the gradient of regularization term all based on optimization objective .
IvA1 Key Assumptions
Due to the affect of regularization , the angle between the actual descent direction and the gradient of empirical loss , i.e., , would be large. It is intuitive to state that when is large, the descent direction cannot effectively lower the empirical loss and causes the potential performance bottleneck of deep transfer learning. We thus formulate the technical problem with following assumptions specified.
Assumption 1 (Efficient Empirical Loss Minimization)
It is reasonable to assume that the descent direction having a smaller angle with the gradient of empirical loss, i.e., a smaller , can lower the empirical loss more efficiently.
Assumption 2 (Regularization Affect Preservation)
It is reasonable to assume that there exists a potential threshold that, when , the descent direction can preserve the affect of the regularization for knowledge transfer.
IvA2 The Problem
Based on above definitions and assumptions, the problem of this research is to propose a new direction descent algorithm—every iteration of the algorithm reestimates a new descent direction to lower the training loss based on the optimization object , such that
(5) 
where refers to the maximal angle allowed between actual descent direction and the gradient regularizer to preserve the affect of regularizer (Assumption 2). Note that in this research we don’t intend to study the exact setting of and our algorithm implementation is indeed independent with the setting of .
IvB Dtnh: Descent Direction Estimation Strategy
In this section, we presented the design of DTNH as a descent direction estimator that solves the problem addressed. Given the empirical loss function , the regularization term , the set of training data , the minibatch size and the regularization coefficient , we propose to use Algorithm 1 to estimate the descent direction at the point for the iteration of deep transfer learning.
With such descent direction estimator, the learning algorithm can replace the original stochastic gradient estimators used in stochastic gradient descent (SGD), Momentum and/or Adam for deep learning. Specifically, for each (e.g., the
) iteration of learning procedure, DTNH estimates the gradients of empirical loss and regularization term (i.e., and ) separately. When the angle between gradients of empirical loss and regularization term is acute i.e., , DTNH uses the original stochastic gradient as the descent direction (such as line 8 in Algorithm 1). In such case, we believed the affect of regularization might not hurt the empirical loss minimization. On the other hand, when the angle is obtuse, DTNH decomposes the gradient of regularization term to two orthogonal directions and and make parallels with , such that(6) 
(7) 
Then DTNH truncates the direction against the gradient of empirical loss i.e., , and further compose the remaining direction with gradient of empirical loss as the actual descent direction i.e., .
For instance, Figure 2 illustrates an example of DTNH descent direction estimation, when the angles between gradients of empirical loss and regularization term is obtuse. The affect of regularization term forms a direction that might slow down the empirical loss descending. DTNH decomposes the gradient of regularization term and truncates the conflicted direction for the actual descent direction estimation. On the other hand, the angle between the actual descent direction and regularization gradient is acute, so as to secure the regularization affect of knowledge transfer from pretrained weights.
IvC Discussion
Note that DTNH strategy is derived from the common stochastic gradient estimation used in stochastic gradient based learning algorithms, such as SGD, Momentum, conditioned SGD, Adam and so on. It can be consider as an alternative approach for descent direction estimation on top of vanilla stochastic gradient estimation, where you can still use natural gradientalike method to condition the descent direction or adopt Momentumalike acceleration methods to replace the weight updating mechanism. We are not intending to compare DTNH with ANY gradientbased learning algorithms, as the contributions are complementary. One can freely use DTNH to improve any gradientbased optimization algorithms (if applicable) with the descend direction corrected.
V Experiment
In this section, we reported our experiment results for DTNH. As was stated, we evaluated DTNH with the two types of regularizationbased deep transfer learning paradigms, e.g., L2SP [16] and Knowledge distillation based transfer [31].
Va Data Sets and Experiment Setups
Specifically, we used the ResNet18 [8] as our base model with three common source datasets including ImageNet [4], Places 365 [35], and Stanford Dogs 120 [12] for weights pretraining. To demonstrate the performance of transfer learning, we further select five target datasets, including Caltech 256 [7], MIT Indoors 67 [23], Flowers 102 [21] and CIFAR 10 [14]. Note that, we follow the same settings used in [16] for Caltech 256 setup, where 30 or 60 samples randomly drawn from each category for training with 20 remaining samples for testing. Table I presents the statistics on some basic facts of the 7 datasets used in this experiments.
Datasets  Domains  # Train/Test 

Source Tasks  
ImageNet  visual objects  1,419K+/100K 
Place 365  indoor scenes  10,000K+ 
Stanford Dog 120  dogs  12K/8.5K 
Target Tasks  
CIFAR 10  objects  50K/10K 
Caltech 256  objects  30K+ 
MIT Indoors 67  indoor scenes  5K+/1K+ 
Flowers 102  flowers  1K+/6K+ 
Furthermore, to obtain the pretrained weights of all source tasks, we adopt the pretrained models of ImageNet ^{1}^{1}1https://github.com/PaddlePaddle/models, Place 365 ^{2}^{2}2https://github.com/CSAILVision/places365, and Stanford Dog 120^{3}^{3}3https://github.com/stormyua/dogbreedsclassification released online. We found an interesting fact that the pretrained models of Place 365 and Stanford Dog 120 were trained from the pretrained model of ImageNet. In this way, the pretrained models for Place 365 and Stanford Dog 120 have been already enhanced by the ImageNet.
Source/Target Tasks Pairing Above configuration leads to 15 source/target task pairs, where regularization would hurt the performance of transfer learning in some of these cases. For example, the image contents of ImageNet and CIFAR 10 are quite similar, in this way, the knowledge transfer from ImageNet to CIFAR 10 could improve the performance. On the other hand, the images in Stanford Dog 120 and MIT Indoor 67 are quite different, e.g., dogs v.s. indoor scenes; then the regularization based on pretrained weights of Stanford Dog 120 task would hurt the learning of MIT Indoor 67 task.
Image Classification Tasks Setups. All images are resized to and normalized to zero mean for each channel, following with data augmentation operations of random mirror and random crop to . We use a batch size of , SGD with the momentum of 0.9 is used for optimizing all models [27]
. The learning rate for base model starts with 0.01 and is divided by 10 after 6000 iterations. The Training is terminated with 8000 iterations for Caltech 256, MIT Indoor 67 and Flowers 102, terminates with 20,000 iterations for CIFAR 10 (i.e., 18 epochs). Cross validation has been done for searching the best regularization coefficient. Note that regularization coefficient decays at certain ratio, ranging from 0 to 1, per epoch. The pretrained weights obtained from the source task were not only used as the initialization of the model, i.e., starting point of optimization. Under the best configuration, each experiment is repeated five times. The averaged accuracy with standard deviations are reported in this paper.
Hyperparameter Tuning. The tuning parameters (i.e., in Eq. 1) for all experiments have been tuned best using cross validation. Please see also in the top1 accuracy for the deep transfer learning algorithms listed in Tables II, III and IV. We reproduced the experiments using Finetuning, SP [16] and KnowDist [31] algorithms based on the benchmark datasets, where we can find that these baseline algorithms indeed achieved better accuracy than the original work [16, 31] under the same settings.
Target Datasets  Finetuning  SP  DTNH (SP)  KnowDist  DTNH (KnowDist) 

Caltech 256  
MIT Indoors 67  
Flowers 102  
CIFAR 10 
Target Databsets  Finetuning  SP  DTNH (SP)  KnowDist  DTNH (KnowDist) 

Caltech 256  
MIT Indoors 67  
Flowers 102  
CIFAR 10 
Target Datasets  Finetuning  SP  DTNH (SP)  KnowDist  DTNH (KnowDist) 

Caltech 256  
MIT Indoors 67  
Flowers 102  
CIFAR 10 
VB Overall Performance Comparison
In this section, we report the results of overall performance comparison based on the above 15 source/target tasks pairs using two deep transfer learning regularizers—SP [16] and Knowledge Distillation (namely KnowDist) [31]. Here, we mainly focus on evaluate the performance improvement contributed by DTNH on top of SP and KnowDist, comparing to the vanilla implementations of these two algorithms. We evaluate the four algorithms—DTNH based on SP, DTNH based on KnowDist, the vanilla implementations of SP and KnowDist, using the all aforementioned 15 pairs of source/target tasks, under the same machine learning settings. We present the overall accuracy comparisons in Tables II, III, and IV. It is obvious that DTNH can significantly improve SP and KnowDist to achieve better accuracy in all the 15 source/target pairs.
VC General Transfer Learning Cases
DTNH improves the performance of deep transfer learning in above cases, no matter, whether negative transfer occurs. For example, CIFAR 10 target task works well with source task ImageNet using SP algorithm achieving 93.30% accuracy, while DTNH (SP) can improve it under the same setting with accuracy 96.41% (with more than 3.1% accuracy improvement). For the same experiment, KnowDist can achieve 96.43%, while DTNH (KnowDist) can further improve it to 96.57%. To the best of our knowledge, it might be the known limit [24] for CIFAR 10 training from ImageNet sources with only 18 epochs.
An interesting facts observed in the experiments is that, on top of the all four algorithms and 15 source/task pairs, using Stanford Dog 120 as the source task can perform similar as the ones sourcing from ImageNet. We consider it is due to the reason that the public release of Stanford Dog 120 pretrained model is pretrained from ImageNet, while the size of Stanford Dog 120 dataset is relatively small (i.e., it cannot “wash out” the knowledge obtained from ImageNet while preserving the knowledge from the both ImageNet/Stanford Dog 120 datasets). In this way, knowledge transferring from Stanford Dog 120 can be as good as those based on ImageNet. In the meanwhile, DTNH can still improve the performance of SP and KnowDist, gaining 0.12%
2.2% higher accuracy with low variance, even given the welltrained Stanford Dog 120 model.
VC1 Performance with Negative Transfer Effect
According to the results presented in Tables II, III, and IV, we find negative transfer may happen in the crossdomain cases “Visual Objects/Dogs Indoor Scenes” (please refer to the domain definitions in Table I), while DTNH can improve the performance of SP and KnowDist to relieve such negative effects. Two detailed cases are addressed as follow.
Cases of Negative Transfer For both SP and KnowDist algorithms, when using ImageNet and Stanford Dogs 120 as the source task while transferring to MIT Indoors 67 as the target task, we can observe significant performance degradation comparing to knowledge transfer from Place 365 to MIT Indoor 67. For example (Case I), the accuracy of MIT Indoor 67 using SP is 84.09% based on pretrained weights of Place 365, while the accuracy would be degraded to 75.11% and 74.64% under the same settings with ImageNet and Stanford Dog 120 as the pretrained models respectively. Furthermore, we also observe the similar negative transfer effects, when using Place 365 as source while transfer to the target tasks based on Caltech 256, Flower 102 and CIFAR 10. For example (Case II), the accuracy on Flowers 102 is 77.66% using Place 365 as source, while sourcing from ImageNet and Stanford Dog can achieve as high as 88.96% and 88.14% respectively, all based on SP.
Relieving Negative Transfer Effects. We believe performance degradation appeared in Cases I and II is due to the negative transfer, as the domains of these datasets are quite different. DTNH can however relieve such negative transfer cases. DTNH (SP) can achieve 84.11% on Flowers 102 dataset even when sourcing from Place 365 — i.e., achieving 7% accuracy improvement, comparing to vanilla SP under the same settings. For the rest negative transfer cases, DTNH can still improve the performance, with around 2% higher accuracy, comparing to the vanilla implementations of SP and KnowDist algorithms. In this way, we conclude that DTNH can improve the performance of SP and KnowDist in negative transfer cases with higher accuracy.
Note that we don’t intend to claim DTNH eliminating the negative transfer effects. It, however, improves the performance of regularizationbased deep transfer learning, even with inappropriate source/target pairs. Such accuracy improvement can marginally solve the problem of negative transfer effects.
VD Case Studies
We plan to report the results of following two cases studies that directly prove DTNH working in the way we assumed.
VD1 Empirical Loss Minimization
As was elaborated in introduction section, we doubt that using regularizer might restrict the learning procedure from lowering the empirical loss of deep transfer learning. Such restriction helps the deep transfer learning to avoid overfitting, but in the meanwhile, hurts the learning procedure. In this way, we hope to study trends of empirical loss part minimization with and without DTNH using the regularizationbase deep transfer learning algorithms. Note that the empirical loss here is NOT the training loss, it refers to the data fitting error part of the training loss.
Figure 3 illustrates the trends of both empirical loss and testing loss, with increasing number of iterations, based on both SP and DTNH (SP), for Places 365 MIT Indoors 67 case. As was expected, the empirical loss of both vanilla SP and DTNH (SP) reduces with the number of iterations, while the empirical loss of SP is always higher than that of DTNH(SP). In the meanwhile, DTNH(SP) always enjoys a lower testing loss than vanilla SP. The phenomena indicates that, compared vanilla SP to DTNH(SP), the SP regularization term would restrict the procedure of empirical loss minimization and finally hurt the learning procedure with lower testing accuracy.
VD2 Descent Direction vs Original Gradients
The intuition of DTNH design is based on our two assumptions made in section 3.1— it is possible to find a new descent direction that can be very closed to the direction of empirical loss gradient (Assumption 1), while always sharing a small angle with the gradient of regularization term (Assumption 2).
In this case, we hope to see the angle(denoted as Angle 1) between the actual descent direction of DTNH(SP) and the (stochastic) gradient of empirical loss, i.e., a noisy estimation of (defined in section 3.1), and the angle(denoted as Angle 2) between the actual descent direction of DTNH(SP) and the (stochastic) gradient of SP regularization term, i.e., a noisy estimation of (defined in section 3.1).

Validation on Assumption 1. As shown in Figure 4, we may compare Angle 1 with the angle(denoted as Angle 3) between the (stochastic) gradient of vanilla SP vs. the (stochastic gradient) of empirical loss, i.e., a noisy estimation of . As was discussed in section 3.1, when Angle 1 is smaller than Angle 3, then we can say DTNH is on the direction that minimizes the empirical loss faster than SP.

Validation on Assumption 2. As shown in Figure 5, we would further compare Angle 2 with the angle(denoted as Angle 4) between the (stochastic) gradient of vanilla SP and the (stochastic gradient) of SP regularization term, i.e., a noisy estimation of . When Angles 2 and 4 with a small gap, then we can say the Angle 2 is relatively small. In this case, DTNH shares a descent direction affecting by the regularizer, it still preserves the power of regularizer.
Vi Discussions
In this paper, we claimed one of our major contribution is to alleviate the “negative transfer” phenomenon caused by the reuse of inappropriate pretrained weights for regularizationbased transfer learning (e.g., [16] and KnowDist [31]).
In terms of evidence, some previous work [26, 11, 28] has already demonstrated the effects of “negative transfer”. In addition, our experiment also provided 15 detailed reallife cases: using the pretrained weights from 3 sources datasets (i.e., ImageNet, Places 365 and Stanford Dogs 120) and transferring to 5 different target tasks (i.e., Caltech 30, Caltech 60, MIT Indoors 67, Flowers 102, and CIFAR 10), where we present the comparison results in Tables II, III, and IV in the Section V. It has been shown that, while regularizationbased transfer learning can outperform finetuning in most cases, the performance of regularization performed even worse than finetuning for some specific sourcetarget pairs. Our experiment results validate the existence of “negative transfer” effects. It also suggests one can alleviate the “negative transfer” effects that used to appear in “inappropriate sourcetarget pairs” through incorporating the proposed algorithm DTNH, and achieve better accuracy than finetuning and the vanilla implementation of SP [16] and KnowDist [31].
To evaluate the improvement of DTNH beyond the regularizationbased deep transfer learning algorithms, we use SP [16] and KnowDist [31] as the reference. To our knowledge, SP [16] and KnowDist [31] are considered as the stateoftheart transfer learning algorithms without any modifications to the deep architectures. A new state of the art algorithm for regularizationbased deep transfer learning is DELTA [15] that uses feature maps over attention mechanisms as regularization to further improve KnowDist [31]. In our experiment, we did not include DELTA for the comparison of baselines. Since both DELTA and KnowDist use the regularization between the feature maps to enable the knowledge transfer from teacher to student networks, we thus believe DTNH should work with the algorithms like DELTA.
In terms of methodology, the most relevant work to our study is Gradient Episodic Memory (GEM) for continual learning [19], which continuously learn the new task using the welltrained models for past tasks through regularization terms. In terms of objectives, DTNH aims at lowering the effects of knowledge transfer regularization from hurting empirical loss minimization, while GEM prevents the empirical loss minimization from hurting regularization effects (i.e., the accuracy on old tasks). In terms of algorithms, in every iteration of learning, GEM estimates the descend direction with respect to the gradients of the new task and all past tasks using a timeconsuming Quadratic Program (QP), while DTNH reestimates the descend direction from the gradients of the regularizer term and the empirical loss term with lowcomplexity orthogonal decomposition. All in all, GEM can be considered as a special case of DTNH using LSP regularizer [16] based on two tasks.
Vii Conclusions
In this paper, we studied a descent direction estimation strategy DTNH that improves the common regularization techniques, such as SP [16] and Knowledge Distillation [31], for deep transfer learning. Nontrivial contribution has been made compared to the existing methods that simple aggregates empirical loss for data fitting and regularizer for knowledge transfer through linear combination, such as [16, 31].
Specifically, we designed a new method to reestimate a new direction for loss descending based on the (stochastic) gradient estimation of empirical loss and regularizers, where orthogonal decomposition has been made on the gradient of regularizers, so as to eliminate the conflicted direction against the empirical loss descending. We conduct extensive experiments to evaluate DTNH
using several realworld datasets based on typical convolutional neural networks. The experiment results and comparisons show that
DTNH can significantly outperform the state of the arts with higher accuracy, even in the negative transfer cases.Acknowledgement
We would appreciate the program committee and reviewers’ efforts in reviewing and improving the manuscript. This paper was done when Mr. Ruosi Wan was a fulltime research intern at Baidu Inc. Please refer to the opensource repository for the PaddlePaddle based implementation of
DTNH. The deep transfer learning algorithms proposed in this paper have been transferred to technologies adopted by PaddleHub^{4}^{4}4https://github.com/PaddlePaddle/PaddleHub and Baidu EZDL^{5}^{5}5https://ai.baidu.com/ezdl/. The first two authors contributed equally to this paper. While Mr. Ruosi Wan contributed the algorithm design, Dr. Haoyi Xiong led the research and wrote parts of the paper. Please contact Dr. Haoyi Xiong via xionghaoyi@baidu.com for correspondence.References
 [1] (2017) Exploiting convolution filter patterns for transfer learning.. In ICCV Workshops, pp. 2674–2680. Cited by: §IIA.
 [2] (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §IIA.

[3]
(2018)
Large scale finegrained categorization and domainspecific transfer learning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4109–4118. Cited by: §IIA.  [4] (2009) Imagenet: a largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Cited by: §VA.
 [5] (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pp. 647–655. Cited by: §I, §IIA, §IIB.
 [6] (2017) Borrowing treasures from the wealthy: deep transfer learning through selective joint finetuning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 10–19. Cited by: §IIA.
 [7] (2007) Caltech256 object category dataset. Cited by: §VA.
 [8] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §VA.
 [9] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §I, item 2.
 [10] (2016) What makes imagenet good for transfer learning?. arXiv preprint arXiv:1608.08614. Cited by: §I, §IIA.
 [11] (2014) Improving deep neural network performance by reusing features trained with transductive transference. In International Conference on Artificial Neural Networks, pp. 265–272. Cited by: §VI.
 [12] (2011) Novel dataset for finegrained image categorization: stanford dogs. In Proc. CVPR Workshop on FineGrained Visual Categorization (FGVC), Vol. 2, pp. 1. Cited by: §VA.
 [13] (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, pp. 201611835. Cited by: §I, §IIC.
 [14] (2014) The cifar10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html. Cited by: §VA.
 [15] (2019) DELTA: DEEP LEARNING TRANSFER USING FEATURE MAP WITH ATTENTION FOR CONVOLUTIONAL NETWORKS. In International Conference on Learning Representations, External Links: Link Cited by: §VI.
 [16] (2018) Explicit inductive bias for transfer learning with convolutional networks. Thirtyfifth International Conference on Machine Learning. Cited by: Fig. 1, §I, §I, item 1, §IIC, 1st item, §III2, §VA, §VA, §VB, §V, §VI, §VI, §VI, §VI, §VII.
 [17] (2017) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §IIC.
 [18] (2017) Sparse deep transfer learning for convolutional neural network.. In AAAI, pp. 2245–2251. Cited by: §IIA.
 [19] (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476. Cited by: §VI.
 [20] (2014) Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204–2212. Cited by: §IIC.
 [21] (200812) Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Cited by: §VA.
 [22] (2010) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §I, §I, §IIA.
 [23] (2009) Recognizing indoor scenes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 413–420. Cited by: §VA.

[24]
(2018)
Do cifar10 classifiers generalize to cifar10?
. arXiv preprint arXiv:1806.00451. Cited by: §VC.  [25] (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: item 2.
 [26] To transfer or not to transfer. In NIPS 2005 workshop on transfer learning. Vol. 898, Cited by: §I, §VI.
 [27] (2013) On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §VA.
 [28] (2016) A survey of transfer learning. Journal of Big Data 3 (1), pp. 9. Cited by: §VI.
 [29] (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §IIC.
 [30] (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29. Cited by: §IIC.
 [31] (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. Cited by: §I, item 2, §IIC, 2nd item, §III2, §VA, §VB, §V, §VI, §VI, §VI, §VII.
 [32] (2014) How transferable are features in deep neural networks?. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §I, §I, §IIA.
 [33] (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: item 2, §IIC.
 [34] (2018) Parameter transfer unit for deep neural networks. arXiv preprint arXiv:1804.08613. Cited by: §IIA.

[35]
(2017)
Places: a 10 million image database for scene recognition
. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §VA.
Comments
There are no comments yet.