Deep convolutional networks achieve great success on large scale vision tasks such as ImageNet[russakovsky2015imagenet] and Places365 [zhou2017places]. In addition to their notable improvements of accuracy, deep representations learned on modern CNNs are demonstrated transferable across relevant tasks [yosinski2014transferable]. This is rather fortunate for many real world applications with inefficient labeled examples. Transfer learning aims to obtain good performance on such tasks by leveraging knowledge learned by relevant large scale datasets. The auxiliary and desired tasks are called the source and target tasks respectively. According to [pan2009survey], we focus on inductive transfer learning, which cares about the situation that the source and target tasks have different label spaces.
The most popular practice is to fine-tune a model pre-trained on the source task with the regularization, which has the effect of constraining the parameter around the origin of zeros. [li2018explicit] points out that since the parameter may be driven far from the Start Point (SP) of the pre-trained model, a major disadvantage of naive fine-tuning is the risk of catastrophic forgetting of the knowledge learned from source. They recommend to use the regularizer instead of the popular . While in parallel, knowledge distillation, which is originally designed for compressing the knowledge in a complex model to a simple one [hinton2015distilling], is proved to be useful for transfer learning tasks where knowledge is distilled from a different dataset [zagoruyko2016paying, yim2017gift]. Recent work [li2019delta] formulates knowledge distillation in transfer learning as a regularizer on features and further improves through unactivated channel reusing for better fitting the training samples.
Although the mentioned starting point regularization and knowledge distillation methods have succeed to preserve the knowledge contained in the source model, fine-tuning also takes the obvious risk of negative transfer. Intuitively, if the source and target data distribution are dissimilar to some extent, not all the knowledge from the source is transferable to the target and an indiscriminate transfer could be detrimental to the target model. However, the impact of negative transfer has been rarely considered in inductive transfer learning studies. The most related work, [chen2019catastrophic], proposed to investigate the regularizer of Batch Spectral Shrinkage (
) to inhibit negative transfer, where the small singular values of feature representations are considered as not transferable and suppressed. Yet, it is hard to adaptively determine the scope of small singular value when faced with different target tasks. Moreover,does not take consideration of the catastrophic forgetting risk, which means it has to be equipped with other fine-tuning techniques (e.g., [li2018explicit], [li2019delta], etc.) to achieve considerable performance.
According to the above analysis, it is straightforward to think about a better solution which simultaneously takes the consideration of preserving relevant knowledge and avoiding negative transfer. In this paper, we intend to improve the standard fine-tuning paradigm by accurate knowledge transfer. Assuming that the knowledge contained in the source model consists of one part relevant to the target task and the other part which is irrelevant111Note that although this won’t be mathematically guaranteed, it is very common in practice that the source task is for general purpose while the target task focuses on a specific domain., we are going to explicitly disentangle the former from the source model. Thus, a target task specific starting point is used as the reference instead of the original one. Specifically, we design a novel regularizer of deep transfer learning through Target-awareness REpresentation Disentanglement (). The whole algorithm includes two steps. First we use a lightweight disentangler to separate middle representations of the pre-trained source model into the positive and negative parts. The disentanglement is achieved by simultaneously maximizing the maximum mean discrepancy (MMD) between these two parts, and being able to reconstruct the original representation. Supervision information from labeled target examples is utilized to distinguish the positive part from the negative part. The second step is to perform fine-tuning using the disentangled positive part of representations as the reference. In summary, our main contributions are as following:
We are the first to involve the idea of representation disentanglement to improve inductive transfer learning.
Our algorithm aiming at accurate knowledge transfer contributes to the study of negative transfer.
Our proposed significantly outperforms state-of-the-art transfer learning regularizers including , , , and on various real world datasets.
Approach Risk CF* NT* [donahue2014decaf] ✗ ✗ SPAR [zagoruyko2016paying, li2018explicit, li2019delta] ✓ ✗ BSS [chen2019catastrophic] ✗ ✓ TRED ✓ ✓
CF=Catastrophic Forgetting, NT=Negative Transfer, SPAR=Starting Point As the Reference.
2 Related Work
2.1 Shrinkage Towards Chosen Parameters
Regularization techniques have a long history since Stein’s paradox [stein1956inadmissibility, efron1977stein]
, showing that shrinking towards chosen parameters obtains an estimate more accurate than simply using observed averages. Most common choices like Lasso and Ridge Regression pull the model towards zero, while it is widely believed that shrinking towards “true parameters” is more effective. In transfer learning, models pre-trained on relevant source tasks with sufficient labeled examples are often regarded as “true parameters”. Earlier works demonstrate its effectiveness on Maximum Entropy[chelba2006adaptation]
or Support Vector Machine models[yang2007adapting, li2007regularized, aytar2011tabula].
2.2 Deep Inductive Transfer Learning
In order to overcome over-fitting, various transfer learning regularizers have been proposed. According to the type of regularized objectives, they can be categorized as parameter based [li2018explicit], feature based [zagoruyko2016paying, yim2017gift, li2019delta] or spectral based [chen2019catastrophic] methods.
Our paper adopts the general idea of preserving knowledge by regularizing features of the source model. While unlike previous methods, we do not directly use the original knowledge provided by the source model. Instead, we disentangle the useful part for reference to avoid negative transfer. There main differences are summarized in Table 1.
Studies from other angles, such as sample selection [ge2017borrowing, ngiam2018domain, jeon2020sample], dynamic computing [guo2019spottune], sparse transfer [wang2019pay] and cross-modality transfer [hu2020cross] are also important topics but out of this paper’s scope.
2.3 Representation Disentanglement
Techniques of representation disentanglement are developed to help depict underlying regulars between the input and output in an interpretable way. Note that the concept of disentanglement in this paper is somewhat different from a mainstream understanding where the objective is to separate latent factors of variations [goodfellow2009measuring, bengio2013representation]. Our work is highly inspired by the idea of domain information disentanglement [liu2018unified, belghazi2018mine], which intends to extract domain invariant representations in unsupervised domain adaptation tasks. In their works, the disentangled components are domain representations, or groups of features, rather than individual features.
3.1 Problem Definition
In inductive transfer learning, we are given a model pre-trained on the source task, with the parameter vector. For the desired task, the training set contains n tuples, each of which is denoted as . and refers to the -th example and its corresponding label.
Let’s further denote as the function of the neural network and as the parameter vector of the target network. We have the objective of structural risk minimization
where the first term is the empirical loss and the second is the regularization term. is the coefficient to balance the effect between data fitting and reducing over-fitting.
3.2 Regularizers for Transfer Learning
Recent studies in the deep learning paradigm show that SGD itself has the effect of implicit regularization that helps generalizing in over-parameterized regime[soltanolkotabi2018theoretical]
. In addition, since fine-tuning is usually performed with a smaller learning rate and fewer epochs, it can be regarded as a form of implicit regularization towards the initial solution with good generalization properties[liu2019towards]. Besides, we give a brief introduction of state-of-the-art explicit regularizers for deep transfer learning.
. The most common choice is the penalty with the form of , also named weight decay in deep learning. From a Bayesian perspective, it refers to a Gaussian prior of the parameter with a zero mean. The shortcoming is that the meaningful initial point is ignored.
-. [li2018explicit] follows the idea of shrinking towards chosen targets instead of zero. They propose to use the starting point as the reference
where the first term refers to constraining the parameter of the part responsible for representation learning around the starting point, and the second is weight decay of the remaining part which is task specific. Since is general in all mentioned methods, we ignore it in following formulas.
. [li2019delta] extends the framework of feature distillation [Romero2014FitNetsHF, zagoruyko2016paying] by incorporating an attention mechanism. They constrain 2-d activation maps with respect to different channels by different strengths according to their values to the target task. Given a tuple of training example and the distance metric between activation maps , the regularization is formulated as
where is the number of channels and refers to the regularization weight assigned to the -th channel. Specifically, each weight is independently evaluated by the performance drop when disabling that channel.
. Authors in [chen2019catastrophic] propose Batch Spectral Shrinkage (), towards penalizing untransferable spectral components of deep representations. They figure out that spectral components which are less transferable are those corresponding to relatively small singular values. They apply differentiable SVD to compute all singular values of a feature matrix and penalize the smallest ones:
where all singular values  are in the descending order. is not involved as doesn’t consider preserving the knowledge in the source model.
4 Disentangled Starting Point As the Reference
The most important component of our algorithm is to disentangle the original knowledge into two parts which are respectively relevant and irrelevant to the target task. Imitating the visual attention mechanism of humans, we force the two parts to pay attention to different spatial regions within the original image. Mathematically, this is achieved by enlarging their statistical distributions measured as MMD. The positive part is distinguished by a simple discriminator. Then we fine-tune the target model with the regularization to restrict the distance between feature maps and the corresponding disentangled ones. A framework of the approach is illustrated in Figure 1. We explain the components in the following paragraphs.
Representation Disentanglement. Different with the main stream of disentanglement studies which try to separate various atomic attributes such as the color or angle, we care about disentangling components relevant to the target task from the whole representation produced by the source model. Formally, we disentangle the original representation obtained from the pre-trained model into the positive and negative part with the disentangler module :
where and have the same shape with . For efficient estimation and optimization of the disentanglement, we further denote the mapping functions and , representing dimension reduction along the spatial and channel direction respectively. Therefore we get
where refers to either or .
Maximum Mean Discrepancy
. Acting as a relevant criterion for comparing distributions based on the Reproducing Kernel Hilbert Space, Maximum Mean Discrepancy (MMD) is widely used in statistical and machine learning tasks such as two-sample test[fukumizu2009kernel, gretton2012kernel], domain adaptation [pan2010domain, tzeng2014deep, long2017deep]sutherland2016generative, arbel2018gradient].
as random variable sets with distributionsand , an empirical estimate [tzeng2014deep, long2015learning] of the MMD between and compares the square distance between the empirical kernel mean embeddings as
refers to the kernel, as which a Gaussian radial basis function (RBF) is usually used in practice[long2015learning, louizos2016variational].
Our objective is to enlarge the MMD between the disentangled positive and negative part along the spatial dimension. Intuitively, this would explicitly encourage these two parts to recognize different regions of the input image. For stabler optimization, we minimize the negative exponent of the MMD as followed:
Reconstruction Requirement. As both the positive and negative part are trained by the flexible disentangler, it is easy to produce two parts of meaningless representations with the only objective of maximizing the distribution distance. To ensure the disentanglement is informative rather than an arbitrary transformation, we add the reconstruction requirement to restrict the disentangled representations within the original knowledge. Specifically, the disentangled positive and negative parts are required to be able to recover the original representation by point-wise addition:
Distinguishing the Positive Part. Since above representation disentanglement is actually symmetry for each part, an explicit signal is required to distinguish features which are useful for the target task. In particular, the selected layer for representation transfer is followed by a classifier consisting of a global pooling layer and a fully connected layer sequentially. A regular cross entropy loss is added to explicitly drive the disentangler to extract into the positive part components which are discriminative for the target task. Denoting the involved classifier as , we have
Regularizing the Disentangled Representation. After the step of representation disentanglement, we perform fine-tuning over the target task. We regularize the distance between a feature map and its corresponding starting point. Quite different from previous feature map based regularizers as [Romero2014FitNetsHF, zagoruyko2016paying, li2019delta], the starting point here is the disentangled positive part of the original representation. The regularization term corresponding to some example () becomes:
where refers to the parameter of the disentangler which is frozen during the fine-tuning stage. The complete training procedure is presented in Algorithm 1.
We select several popular transfer learning datasets to evaluate the effectiveness of our method.
Stanford Dogs. The Stanford Dogs [KhoslaYaoJayadevaprakashFeiFei_FGVC2011]
dataset consists of images of 120 breeds of dogs, each of which containing 100 examples used for training and 72 for testing. It’s a subset of ImageNet.
MIT Indoor-67. MIT Indoor-67 [quattoni2009recognizing]
is an indoor scene classification task consisting of 67 categories. There are 80 images for training and 20 for testing for each category.
CUB-200-2011. Caltech-UCSD Birds-200-2011 [WelinderEtal2010] contains 11,788 images of 200 bird species from around the world. Each species is associated with a Wikipedia article and organized by scientific classification.
Food-101. Food-101 [bossard14] is a large scale data set consisted of more than 100k food images divided into 101 different kinds. To better fit transfer learning applications, we use two subsets which contains 30 and 150 training examples per category.
Flower-102. Flower-102 [Nilsback2008Automated] consists of 102 flower categories. 1020 images are used for training and 6149 images for testing. Only 10 samples are provided for each category during training.
Stanford Cars. The Stanford Cars [KrauseStarkDengFei-Fei_3DRR2013] dataset contains 16,185 images of 196 classes of cars. The data is split into 8,144 training and 8,041 testing images.
Oxford-IIIT Pet. The Oxford-IIIT Pet [parkhi2012cats] dataset is a 37-category pet dataset with about 200 cat or dog images for each class.
Textures. Describable Textures Dataset [cimpoi14describing] is a texture database, containing 5640 images organized by 47 categories according to perceptual properties of textures.
5.2 Settings and Hyperparameters
We implement transfer learning experiments based on ResNet [he2016deep]. For MIT indoor-67, we use ResNet-50 pre-trained with large scale scene recognition dataset Places 365 [zhou2017places] as the source model. For remaining datasets, we use ImageNet pre-trained ResNet-101 as the source model. Input images are resized with the shorter edge being 256 and then random cropped to during training.
For optimization, we first train 5 epochs to optimize the disentangler by Adam with the learning rate of 0.01. All involved hyperparameters are set to default values of. Then we use SGD with the momentum of 0.9, batch size of 64 and initial learning rate of 0.01 for fine-tuning the target model. We train 40 epochs for each dataset. The learning rate is divided by 10 after 25 epochs. We run each experiment three times and report the average top-1 accuracy.
is compared with state-of-the-art transfer learning regularizers including [li2018explicit], [zagoruyko2016paying], [li2019delta] and [chen2019catastrophic]. We perform 3-fold cross validation searching for the best hyperparameter in each experiment. For , and , the search space is [, , ]. Although authors in and recommended fixed values of ( for and for ), we also extend the search space to [, , ] for and [, , ] for .
While recent theoretical studies proved that weight decay actually has no regularization effect [van2017l2, golatkar2019time]
when combined with common used batch normalization, we useNo Regularization as the most naive baseline, reevaluating . From Table 1 we observe that does not outperform fine-tuning without any regularization. This may imply that deep transfer learning hardly benefits from regularizers of non-informative priors.
Advanced works [zagoruyko2016paying, li2018explicit, li2019delta] adopt regularizers using the starting point of the reference for knowledge preserving. From the perspective of Bayes theory, these are equivalent to the informative prior which believes the knowledge contained in the source model, in the form of either parameters or behavior. Table 1 shows that these algorithms obtain significant improvements on some datasets such as Stanford Dogs and MIT indoor-67, where the target dataset is very similar to the source dataset. However, benefits are much less on other datasets such as CUB-200-2011, Flower-102, Stanford Cars and Food-101.
Table 1 illustrates that consistently outperforms all above baselines over all evaluated datasets. It outperforms naive fine-tuning regularizer by more than 2% on average. Except for Stanford Dogs and MIT Indoor-67, improvements are still obvious even compared with state-of-the-art regularizers , , and .
To evaluate the scalability of our algorithm with more limited data, we conduct additional experiments on subsets of the standard dataset CUB-200-2011. Baseline methods include , [chen2019catastrophic], [li2018explicit], [zagoruyko2016paying] and [li2019delta]. Specifically, we random sample 50%, 30% and 15% training examples for each category to construct new training sets. Results show that our proposed achieves remarkable improvements compared with all competitors, as presented in Table 2.
Figure 2 shows how these methods behave when reducing the size of the training set. For clear illustration, we treat all regularizers only considering the risk the catastrophic forgetting as a group, namely SPAR as they are all follow the framework of using the Starting Point As the Reference. The average accuracy of , and is used to represent for SPAR. , which is designed for suppressing the untransferable ingredients of features, only tackles the problem of negative transfer. While deals with both challenges. We plot the improvements of these methods compared with naive fine-tuning with regularization. We can observe from Figure 2 that and obtain increased improvements as the sampling rate reduces, implying that the negative impact from the source model is greater when the target dataset is smaller. Although sharing the same trend, always outperform with an obvious margin at all sampling rates. While the curve of SPAR is much stabler as the sampling rate reduces.
In this section, we dive deeper into the mechanism and experiment results to explain why target-awareness disentanglement provides better reference. In subsection Representation Visualization, we show the effect of our method by visualizing attention maps and feature embeddings. In subsection Shrinking Towards True Behavior, we briefly discuss the theoretical understanding related with shrinkage estimation. Then we provide more statistical evidences to validate the advantage of the disentangled positive representation. In subsection Ablation Study, we empirically analyze why the disentanglement component is essential.
6.1 Representation Visualization
Show Cases. Authors in [zagoruyko2016paying] show that the spatial attention map plays a critical role in knowledge transfer. We demonstrate the effect of representation disentanglement by visualizing the attention map in Fig 3. As observed in typical cases from CUB-200-2011 and Stanford Cars, the original representations generated by the ImageNet pre-trained model usually contain a wide range of semantic features, such as objects or backgrounds, in addition to parts of birds. Our proposed disentangler is able to “purify” the interested concepts into the positive part, while the negative part pays more attention to the complementary constituent.
Embedding Visualization. Since the most important change of our method is to use the disentangled rather than the original representation as the reference, we are interested in comparing the properties of these two representations on the target task. We visualize the original and disentangled feature representations of Flower-102 and MIT Indoor-67. The dimension of features is reduced along the spatial direction and then plotted in the 2D space using t-SNE embeddings. As illustrated in Figure 4, deep representations derived by our proposed disentangler are separated more clearly than the original ones for different categories and clustered more tightly for samples of the same category.
6.2 Shrinking Towards True Behavior
Recent work [li2018explicit] discusses the connection between their proposed and classical statistical theory of shrinkage estimation [efron1977stein]. The key hypothesis is that shrinking towards a value which is close to the “true parameters” is more effective than an arbitrary one. [li2018explicit] argues that the starting point is supposed to be more close to the “true parameters” than zero. [zagoruyko2016paying, li2019delta] regularize the feature rather than the parameter, which can be interpreted as shrinking towards the “true behavior”. Our proposed further improves them by explicitly disentangling “truer behavior” by utilizing the global distribution and supervision information of the target dataset. To support the claim, We provide some additional evidences as followed.
Reducing Untransferable Components. Inspired by [chen2019catastrophic]
, we compute singular eigenvectors and values of the deep representation by SVD. All singular values are sorted in descending order and plotted in Fig5. Authors in [chen2019catastrophic] demonstrate that the spectral components corresponding to smaller singular values are less transferable. They find that these less transferable components can be suppressed by involving more training examples. Interestingly, we find similar trends by the proposed representation disentanglement. As observed in Fig 5, smaller singular values of the disentangled positive representation are further reduced compared with the original representation. Fig 5 also shows the phenomenon that spectral components corresponding to larger singular values are increased, which does not exist in [chen2019catastrophic]. This is intuitively consistent to the hypothesis that features relevant to the target task are disentangled and strengthened.
Robustness to Regularization Strength. We also provide an empirical evidence to illustrate the effect of “truer behavior” obtained by our proposed disentangler. The intuition is very straightforward that, if the behavior (representation) used as the reference is “truer”, it is supposed to be more robust to the larger regularization strength. We compare with which uses the original representation as the reference. We select three transfer learning tasks for evaluation, which are Places365 MIT indoor-67, ImageNet Stanford Cars and Places365 Stanford Dogs. The regularization strength is gradually increased from 0.001 to 1. As illustrated in Fig 6, the performance of falls rapidly as increases, especially in ImageNet Stanford Cars and Places365 Stanford Dog, indicating that the regularizer using original representations as the reference suffers from negative transfer seriously. While performs much more robust to the increasing of .
|Dataset||Target Task||Source Task|
|MIT Inddors 67*||85.79||82.42||28.92||25.57|
Pre-trained on Places365 and evaluated on ImageNet.
6.3 Mechanism Analysis
Since the desired output for the target task is the disentangled positive part, it seems reasonable to obtain the discriminative representation only using the classifier corresponding to . In this section, we conduct ablation study to compare the simpler framework without maximizing the distribution distance, which performs direct transformation on the original representation rather than disentanglement. This version is denoted by -.
We can observe in Table 3 that, all evaluated tasks get significant performance drop on the target task without disentanglement. A reasonable guess is that, disentangling helps preserve knowledge in the source model and restrain the representation transformation from over-fitting the classifier . To verify this hypothesis, we compare the top-1 accuracy of ImageNet classification (the source task) between and -. Specifically, we train a random initialized classifier to recognize the category of ImageNet, using the fixed transformed representation as input. The top-1 accuracy of the pre-trained ResNet-101 model is 75.99%. As shown in Table 3, - gets more performance drops on ImageNet than , indicating that representation disentanglement performs better in preserving the general knowledge learned by the source task.
In this paper, we extend the study of negative transfer in inductive transfer learning. Specifically, we propose a novel approach to regularize the disentangled deep representation, achieving accurate knowledge transfer. We succeed to implement the target-awareness disentanglement, utilizing the MMD metric and other reasonable restrains. Extensive experimental results on various real-world transfer learning datasets show that significantly outperforms the state-of-the-art transfer learning regularizers. Moreover, we provide empirical analysis to verify that the disentangled target-awareness representation is indeed closer to the expected “true behavior” of the target task.