Pseudo-task Regularization for ConvNet Transfer Learning

by   Yang Zhong, et al.

This paper is about regularizing deep convolutional networks (ConvNets) based on an adaptive multi-objective framework for transfer learning with limited training data in the target domain. Recent advances of ConvNets regularization in this context are commonly due to the use of additional regularization objectives. They guide the training away from the target task using some concrete tasks. Unlike those related approaches, we report that an objective without a concrete goal can serve surprisingly well as a regularizer. In particular, we demonstrate Pseudo-task Regularization (PtR) which dynamically regularizes a network by simply attempting to regress image representations to a pseudo-target during fine-tuning. Through numerous experiments, the improvements on classification accuracy by PtR are shown greater or on a par to the recent state-of-the-art methods. These results also indicate a room for rethinking on the requirements for a regularizer, i.e., if specifically designed task for regularization is indeed a key ingredient. The contributions of this paper are: a) PtR provides an effective and efficient alternative for regularization without dependence on concrete tasks or extra data; b) desired strength of regularization effect in PtR is dynamically adjusted and maintained based on the gradient norms of the target objective and the pseudo-task.



There are no comments yet.


page 6


Adaptive Consistency Regularization for Semi-Supervised Transfer Learning

While recent studies on semi-supervised learning have shown remarkable p...

A general method for regularizing tensor decomposition methods via pseudo-data

Tensor decomposition methods allow us to learn the parameters of latent ...

Domain Adversarial Fine-Tuning as an Effective Regularizer

In Natural Language Processing (NLP), pre-trained language models (LMs) ...

Meta-learning Transferable Representations with a Single Target Domain

Recent works found that fine-tuning and joint training—two popular appro...

Representation Stability as a Regularizer for Improved Text Analytics Transfer Learning

Although neural networks are well suited for sequential transfer learnin...

Beyond Target Networks: Improving Deep Q-learning with Functional Regularization

Target networks are at the core of recent success in Reinforcement Learn...

Distributed Deep Transfer Learning by Basic Probability Assignment

Transfer learning is a popular practice in deep neural networks, but fin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (ConvNets) have recently advanced the development of computer vision and flourished in many large-scale computer vision applications

[1, 2, 3, 4]. Since the introduction of AlexNet [5], deeper and more complex network architectures, such as VGG [6], Inception [7], GAN [8], and ResNet [9], have been proposed. In addition, other contributions have been made toward network optimization, which has been helping the performance and efficiency of ConvNets, e.g. BatchNorm [10], MiniBatchSGD [11]. Despite the improved effectiveness by those, one known issue is that ConvNets are normally over-parameterized and would demand a large-scale labeled dataset.

It is a common practice to exploit transfer learning which adapts a model pre-trained on a source task to a new target task when given a small amount of labeled dataset. Specifically, by leveraging the transferability of deep features


one can map images to a middle or high-level feature through pre-trained model and therewith train target specific classifiers

[13, 14, 15], which is often called feature selection. It is also viable to fine-tune a source model for the target data. As fine-tuning aims to optimize the entire network for the target task, it often achieves higher effectiveness and has therefore been a rule of thumb in ConvNets transfer learning with a limited amount of domain data [16]. During fine-tuning, a source model needs to be mildly tuned to avoid overfitting due to the fact that deep networks are yet over-parameterized for small-scale target tasks.

One of the challenges during fine-tuning, which this paper also tries to address, is to achieve network regularization for an over-parameterized model efficiently and effectively. In the recent state-of-the-art transfer learning solutions, there is a trend of using an auxiliary training objective in a framework of multi-objective111For ease of discussions, we do not distinguish training “objectives” from “tasks”; thus, multi-objective and multi-task learning may be used interchangeably. learning for improved regularization [17, 18, 19, 20]. These auxiliary objectives are designed in a concrete manner, through which models would enforce certain desired properties that facilitates multiple purposes in the learned image representations. The key to the enhanced regularization on the target task is then attributed to the improved generality learned from the imposed auxiliary objectives. However, the regularization gain would come at a resource-dependent cost of the storage of off-the-shelf predictions for multiple steps of network training [18], selecting qualified labeled data samples from the source domain for a target task [17], using a complex network architecture during training [20], or recalling the source model [19].

From a network training perspective, a basic effect of training with a regularization objective could be considered as to simply distract the minimization of the empirical loss (typically, through a structural loss). As a result, the regularization power can also be seen to come from the extra gradients generated by the employed distracting (regularization) objective. These gradients cause useful distortions in the gradient-descent trajectory to force the network to tolerate slightly higher empirical loss in the course of training which allows for more chances in seeking better optima. Now, if such a distraction effect is the essence to network regularization, it is worthwhile to study whether the regularization objective could have some alternative form rather than being a real and concrete task.

Intuitively, if it is the distraction (but the convergence) that is the primary concern, there could be diverse ways to construct a distractor which interferes the training of the target task while looking for an improved regularization. One potential approach could be through a virtual “task” which neither depends on the above mentioned data and storage availability for multi-task learning, nor a concrete goal as designed in [17, 18, 19, 20].

Fig. 1: An overview of the proposed Pseudo-task Regularization (PtR). The path linking the blue modules illustrates a vanilla fine-tuning pipeline with a target classifier trained by a cross-entropy loss. The PtR loss, which is connected to the feature representation layer of a ConvNet, is brought into the network training when the convergence of target classifier is relatively stable. The total loss therewith is the sum of cross-entropy loss on the target task and the weighted PtR loss. The PtR loss module automatically weights the strength of regularization according to the gradient norm of the target task on the feature layer. The PtR is explained in Algorithm 1 in detail.

In this paper, we consider image classification tasks in a transfer learning scenario where only limited annotated target domain data and an off-the-shelf model are available. As the aforementioned recent best performing approaches, we also employ a multi-objective learning framework to take a regularizer into account. However, we aim to device a regularizer which generates distractions but is independent of any concrete task; our regularizer simply exploits a pseudo-task that injects random noise in the gradients to distract the training on the target tasks and seeks for an improved regularization. Experimental results consistently support our conjecture on various datasets and with different network architectures. The contributions of this paper are:

  1. We demonstrate a simple Pseudo-task Regularization (PtR) that provides an efficient and strong alternative to other recent best regularizations based on real and concrete tasks.

  2. In PtR, we use a pseudo-task to generate useful gradients for regularization; we also propose to dynamically adjust the strength of the regularization based on the gradient norms of the target objective and the pseudo-task.

The results by those suggest a novel view on the key elements of network regularization for ConvNet transfer learning, which we hope future research will exploit further.

2 Pseudo-task Regularization

2.1 Overview

Our motivation is to let a ConvNet learn the representations for a target task while also being distracted so that the learned representations are not excessively target-task specific, leading to the loss of generality. For this purpose, as shown in Figure 1, we choose to exploit a multi-task learning framework which utilizes two training objectives: one is the cross-entropy loss for the target-task classification, the other objective generates distractions to promote the generalization through a pseudo regression task. We call it Pseudo-task Regularization (PtR).

In using both of the training objectives, a significant aspect of PtR is to balance the impact of the two loss functions. It is reasonable that the distraction should be on a proper level which is not too strong and hinder model convergence, nor is on an ignorable magnitude compared to the gradients of the target classifier. To this end, inspired by

[21], we propose to dynamically balance the strengths of gradients from the two losses according to the gradient norms with respect to the image representation during the training process. The training procedure of our adaptive multi-task learning framework is described in Algorithm 1.

a) Off-the-shelf net; b) Labeled data in target domain
for iteration (batch)  do
       Compute cross-entropy loss .
       if  far from minimum then
            Back propagate only;
             First, perform the following calculations: 1. : the pseudo-task (regression) loss, w.r.t. the noise being generated on-line; 2. and : the gradient norms of and w.r.t. . stands for the image representations of the batch; 3. and : the average of and over the batch; 4. Weight : = for a target ratio .
             Then, back propagate .
       end if
end for
Algorithm 1 Training with Pseudo-task Regularization

2.2 Algorithm

Our method learns image representations on a target task using a pre-trained model in an adaptive multi-objective learning framework, as shown in Algorithm 1. For a training iteration , it computes the cross-entropy loss and additionally the loss by the distraction regressor, the Pseudo-task Regularization loss , starting when

falls below a certain threshold of average epoch loss

. (The choice of the threshold is not critical as explained in Appendix A.) is calculated by regressing the image representation to a pseudo regression target. In PtR, we use random noise generated on-line such that:


where stands for the output of representation layer in a ConvNet, for regression targets (noise signals) with an equal dimension as , and is a regression function for which we consider two popular choices: loss and “smooth-L1” (denoted by SML1) loss. Note that the noise signals are randomly generated during training and the training instances are not bounded to the generated regression targets. The details of the noise signals are described in Section 3.1.

The total loss is the weighted sum of the cross-entropy loss and the regression loss:


where is a coefficient to balance the impact of the distraction regressor, as explained below; and the balanced regression loss are back propagated through the network. Weight decay is omitted in the equation for a concise representation.

To generate a proper level of distraction for regularization, we first calculate the gradient norms of the cross-entropy loss and those of the regression loss w.r.t. the output feature for each instance, which are denoted by and , respectively (for brevity, the indexes of the training instance in batch are omitted):


The gradient norms are then averaged over the batch as:


In order to balance the relative impact of and , we introduce a target gradient norm ratio . It is defined in terms of signal-to-noise ratio by the gradient norm ratio of the cross-entropy loss and a desired regression loss as: . Thus, for the gradient norm ratio to satisfy at iteration , needs to be weighted by a factor :


which is calculated on-line per batch before back propagation.

3 Experiments and Results

3.1 Experimental Setup

Datasets. For transfer learning, training ConvNets with data across domains has been found to be an important regularization method. However, our experiments focus on a situation where data of other domain is not available. We also focus on a challenging scenario where the training samples are sparse. For this purpose, four commonly used small-scale transfer learning datasets are selected to comparatively evaluate PtR: Flower102 [22], CUB200-2011 [23], MIT67 [24], and Stanford40 [25], two of which represent fine-grained classification tasks of different scenarios. Besides, we also chose 500 identities222Random 500 identities that have the most training instances on the WebFace dataset. from the WebFace [26] dataset, denoted by “WebFace500”, to evaluate PtR when performing transfer learning from image classification to closed-set face identification with scarce samples per class. Caltech256 [27] was also used for performance evaluation in a general image recognition scenario.

On Flower102 we faithfully follow the data splits for training and testing. On the WebFace500 dataset, each identity has random 20 training images, five validation images, and on average 24 test images. Faces were segmented and normalized to a fixed scale with a face detector [28] before training. From Caltech256, we formed two independent training sets with 30 and 60 training samples per class, respectively, for consistency with [17, 19]. For other datasets, of the training images were randomly separated to form the validation sets for model training.

Training and Evaluation. To augment the training images, we employed random jittering and subtle scaling and rotation perturbations to the training images. We resized images of all involved datasets to

pixels, and the aspect ratio of the images was retained by applying zero padding all the time. During test time, we averaged over the network responses from the target-task classifiers over ten crops which were sampled from the corners and the centers of originals and the flipped counterparts.

As we consider the vanilla fine-tuning procedure as the baseline, it is very important to ensure that the effectiveness of vanilla fine-tuning is not underestimated. To this end, we carefully selected learning rate schedules for fine-tuning to demonstrate the test accuracy on each dataset with each type of network architecture. To conduct fair comparisons to fine-tuning as much as possible, we also used the same learning rates used by fine-tuning in our dynamic pseudo-task regularization approach; the learning rate schedules were slightly different due to the difference in converging speeds. A learning rate was decreased when the validation loss and validation accuracy stopped progressing and it was decreased twice before model training was terminated. The models trained after the last epoch of their learning rate schedules were always used for performance evaluations.

Implementation details. Experiments on different datasets shared many common settings. We used the standard SGD optimizer with momentum set to 0.9. The batch size was set to 20 (unless otherwise stated); weight decay was set to 0.0005 for VGG networks [6] and 0.0001 for ResNet [9]

architectures except in a number of ablation studies. The dropout ratio for VGG networks was set to 0.5. Our experiments were implemented with PyTorch


We always started the experiments from an ImageNet

[1] pretrained model. As the training data was visited randomly, we ran five independent runs and average the results to mitigate the impact of randomness for all the experiments. The classification accuracy was used as the performance metric when comparing to related methods except [17].

Other hyperparameters to PtR.

In the PtR, the impact of the additional loss is adjusted primarily by the target gradient norm ratio that controls the interference gradient magnitude. Then, the gradients with respect to each feature dimension are largely determined by the nature of a pseudo-task which we employ in our experiments, such as the distribution of noise signals. Without loss of generality we considered noise signals,

, following a uniform distribution with a mean value of

such that i.e., for any single regression target in , , where is an instance in batch () and is the batch size.

We used hold-out sets to efficiently determine and (to avoid expensive cross-validation parameter search as in [30]): for ResNet structures =3 and =1 are consistently used on all datasets; for the VGG-16 structure, varies in the range between 3 and 5, and around 10 to 15. We chose =1 as a sensible setting in all the experiments given that the influence by the choice of is limited (see Appendix A).

3.2 Results and Comparisons

Baseline Regularization Gain Error Rate Reduction
Flower102 83.92% (0.36) 2.38% (0.32) 2.61% (0.42) 14.80% 16.23%
CUB200 75.07% (0.26) 3.05% (0.39) 2.84% (0.37) 12.23% 11.39%
MIT67 71.55% (0.38) 1.42% (0.58) 1.39% (0.40) 4.99% 4.89%
Stanford40 76.99% (0.19) 2.50% (0.09) 2.21% (0.16) 10.86% 9.60%
WebFace500 77.54% (0.52) 0.95% (0.56) 0.83% (0.47) 4.23% 3.70%

The comparative classification accuracy by the Pseudo-task Regularization (PtR), against vanilla fine-tuning (in column Baseline) with two different choices of regression functions, SML1 and L2, with the VGG-16 architecture. The performance gain brought about by PtR with different regression loss functions are in the two columns in the center under SML1 and L2, respectively. The corresponding error rate reduction values are in the rightmost columns. The standard deviation of each experiment is given in parenthesis.

As the VGG-16 architecture has been used very commonly in many different transfer learning applications, we first evaluate PtR with it across five different dataset using SML1 and L2 for the regression function respectively, and compare against the fine-tuning baseline. The results are listed in Table I.

It is observed that PtR helps improve the vanilla fine-tuning on all the classification tasks tested with two different regression functions; it brings about reasonable and consistent performance gain. On WebFace500 which contains the most training samples, it reduced the error rate for around , but on Flower102 and CUB200 where the training samples are more sparse, it is particularly more effective and reduces the error rates for more than 10%. These results suggest that, on the one hand, collecting more data helps regularization even for small datasets. On the other hand, when the training samples become sparse, PtR manages to keep the learned representations from being excessively target-task specific and further promises the networks to learn more useful representations. The choice of regression function does not appear to be a significant factor as the test accuracy with SML1 and L2 are close; SML1 is used in all the following experiments.

We have also conducted a large number of comparative experiments to recent best performing multi-task/objective based regularization approaches: Joint Training (JointTrain) [18], Learning without Forgetting (LwF) [18], Borrowing Treasures from the Wealthy (BTfW) [17], Inductive Bias (Ind.Bias) [19], and Pair-wise Confusion (PC) [20]. In addition, we evaluated the regularization through feature norm penalty (we denote this method by FNP) [30] in the context of transfer learning (hyperparameters of FNP were therefore set by using the same procedure of PtR for fairness). We have also compared the impact of weight decay on CUB200 and Caltech256 by disabling weight decay (denoted by “w/o WD”). As we intended to perform all experiments on a single GPU module with 12GB memory, the ResNet-101 was used as a compromise to compare to the results achieved by a special memory-saving version of ResNet-152 in [17]. For fair evaluations, all the corresponding vanilla fine-tuning baseline and improved test accuracy are shown in the following tables together with accuracy gain.

The comparative results on CUB200 dataset are shown in Table II. It can be seen that the accuracy gain by PtR (with VGG-16) is slightly better than JointTrain where real source data was used for regularization; it also performs better than LwF where off-the-shelf predictions were used. Compared to FNP, PtR outperforms by a visible margin of 1.3%. Although PtR achieved lower accuracy gain than PC, the gaps is not significant regardless of network architecture (around 0.5%). For the absolute accuracy, it is also noticeable that PtR achieved the highest baseline performance as well as that of the optimized models among all the other methods. Weight decay seems not impacting PtR, but the baseline accuracy is 0.7% higher when training without it.

Method Baseline Acc. Gain
JointTrain (VGG-16) 72.1 74.6 2.5
LwF (VGG-16) 72.1 72.3 0.2
PC(VGG-16) 73.3 76.5 3.2
PC (ResNet-50) 78.2 80.3 2.1
FNP (ResNet-50) 80.3 80.6 0.3
PtR (VGG-16) 75.1 77.8 2.8
PtR (ResNet-50) 80.3 81.9 1.6
PtR (ResNet-50, w/o WD) 81.0 82.0 1.0
TABLE II: Comparing classification accuracy on the CUB200 dataset. All the numerical results are in %. The network and other settings (if any) used by each method are given in parenthesis.
Method Baseline Acc. Gain
PC(VGG-16) 85.2 86.2 1.0
BTfW (ResNet-152) 92.3 94.7 2.4
PC (ResNet-50) 92.5 93.5 1.0
FNP (ResNet-50) 91.0 91.5 0.5
PtR (VGG-16) 83.9 86.3 2.4
PtR (ResNet-50) 91.0 91.8 0.8
PtR (ResNet-101) 90.6(92.3) 91.6(93.2) 1.0(0.9)
TABLE III: Classification accuracy and accuracy gain (in %) comparisons on the Flower102 dataset. The mean class accuracies used to compare to BTfW [17] are listed in parenthesis.

On Flower102 dataset, as shown in Table III, the gain by PtR is larger than PC for 1.4% with the VGG-16 structure; it is just equivalent to that of PC with the ResNet-50 network. FNP brings some regularization margin, but it is 0.3% lower than that of PtR. For PtR, it achieves consistent gain in accuracy with ResNet-50 and ResNet-101, and the depth of the network does not appear to deteriorate the regularization effect. Although we achieved an equally good baseline performance as BTfW (in parenthesis of the bottom row of Table III), the regularization gain of BTfW is higher than PtR or any other methods. The difference in regularization may suggest that training with sufficient labeled data in a multi-task learning framework is a stronger regularization for transfer learning.

Method Baseline Acc. Gain
JointTrain (VGG-16) 74 75.5 1.5
LwF (VGG-16) 74 74.7 0.7
BTfW (ResNet-152) 81.7 82.8 1.1
Ind.Bias (ResNet-101) 77.5 78 0.5
FNP (ResNet-50) 77.4 78.0 0.6
PtR (VGG-16) 71.6 73.0 1.4
PtR (ResNet-50) 77.4 77.9 0.5
PtR (ResNet-101) 78.7(78.7) 79.2(79.2) 0.5(0.5)
TABLE IV: Comparative results on the MIT67 (in %). The mean class accuracies used to compare to BTfW [17] are in parenthesis.

Similar observations can also be found from the results on MIT67 dataset, as shown in Table IV. With the VGG-16 architecture, regularization effect by PtR is again very close to that of JointTrain and outperforms LwF. With the ResNet, the regularization gain by PtR is equivalent to that of Ind.Bias and FNP. BTfW also achieves higher gain than the other methods with ResNet. We would infer that optimizing a network simultaneously on multiple tasks with sufficient selected real data samples might be more effective than other related methods. As for PtR, it brings about consistent margin over the fine-tuning base-line regardless of the depth of ResNet architecture, which also coincides with the results in Table III.

The results on Caltech256 dataset are in Table V. In these experiments, we increased the batch size to 32, which is a value in between of those used by [17, 19], to make fair comparisons as much as possible. Interestingly, we achieved the best baseline accuracy among all the comparing methods with both of the Caltech256 partitions. Consequently, it could be harder for PtR to demonstrate the regularization power in comparison to others because a better generalized baseline usually has a smaller room to improve the generalization. However, we can still observe some similar trends. First, as in the previous experiments, by training a network with sufficient annotated data of multiple classes, BTfW achieves the best regularization gain (around 2.6% with both setups). Second, PtR consistently delivers regularization gain; for both data splits the gains are equivalent, which indicates that PtR would not be so sensitive to the size of training data of each category. The improvement brought about by FNP might be marginal or even unstable given the negative gain on Caltech256-30. The impact of weight decay on the classification accuracy of PtR is not visible.

Through the analysis of the comparative results, we argue that PtR delivers consistent gains which are on a par with the recent state-of-the-art approaches. All of the other compared methods consider using auxiliary objectives attached to concrete tasks while enhancing regularization, but PtR which leverages an additional pseudo-task as a regularizer is free from design of concrete auxiliary tasks and more straightforward. Compared to LwF, the “warmup” training stage and collecting predictions of the target data from an off-the-shelf model is not required by PtR. It is not needed for PtR to remember the “starting point” parameters of an off-the-shelf model as in [19]. PtR is also more efficient than [20] which requires a Siamese network, and it does not depend on annotated data from other domain either as in [17].

Caltech256-30 Caltech256-60
Method Bsln. Acc. Gain Bsln. Acc. Gain
BTfW 81.2 83.8 2.6 86.4 89.1 2.7
Ind.Bias 81.5 83.5 2.0 85.3 86.4 1.1
FNP 84.0 83.8 -0.2 86.8 86.9 0.1
PtR 84.0 84.5 0.5 86.8 87.2 0.4
PtR,w/o WD 84.0 84.5 0.5 86.9 87.2 0.3
TABLE V: Mean class accuracies and accuracy gains (in %) on the Caltech256 with two partitions of training data. Bsln is short for baseline. Mean class accuracies were the same as the average classification accuracy over the test set, thus are not given in parenthesis in this table. BTfW [17] was using ResNet-152 while others were using ResNet-101. The network used by each method is not shown in this table for brevity.
Fig. 2: A sample from the validation set of CUB200 that PtR correctly rectified mis-classification caused in the vanilla fine-tuning. Left: An input image. Middle: Categorical distributions made by the baseline model and PtR. The second largest prediction made by FT baseline model is around 30% for Class 171. Right: Two randomly picked training samples of the class predicted by the baseline (top) and two samples of the class predicted by PtR (bottom).
Fig. 3: A sample from the validation set of CUB200 that the standard fine-tuning correctly classified but PtR wrongly predicted. See also the caption of Figure 2.

4 The Effect of PtR on Predictions

To investigate the impact of PtR, we case-study on the validation set of the CUB200 dataset with ResNet-50 network to explore how the predictions have been altered in comparison to those from vanilla fine-tuning. We base our analysis on the concept of confusion matrix and define a matrix

(D=200), each row of which contains cumulative predicted probabilities across all the validation samples for different class categories. We compute matrices

and for the cases of using PtR and baseline fine-tuned model, respectively. We then sum their diagonal elements into:


Feeding in 584 validation images, we had = 425 and = 404, which indicates that the pseudo-task regularized model shows more certainty in the correct classes on average than the fine-tuned model. Furthermore, in terms of the average entropy of the predictions, the pseudo-task regularizer reduced it from 1.33 to 1.15 bits. This is due to better regularization which allows the model eliminate minor probabilities in false predictions, which in turn reduced the average entropy. The reduction of entropy also implies that the class predictions have been disambiguated by PtR.

Correspondingly we computed and , the sum of off-diagonal elements of and , respectively. We observed that the regularized model tends to make predictions with fewer minor probabilities than the vanilla fine-tuning model. On the other hand, PtR gives higher certainty to its predictions given the smaller , which is consistent with the reduced entropy in comparison to vanilla fine-tuning.

To further qualitatively study how the predictions made by the vanilla fine-tuning model (baseline) have evolved by PtR, we case study two types of input samples for which:
       i) PtR rectified the baseline model’s errors, and
      ii) PtR mis-classified on the contrary to the baseline.
Namely, i) is true rectification and ii) is false rectification. Two of these examples are compared in Figure 2 and 3. It can be seen that PtR has an effect of encouraging the predictions of the instances of other classes which are visually close to the ground truth class333In Figure 2, the second largest prediction made by PtR is at another class of similar appearance; in Figure 3, the wrongly predicted class is also similar to the ground truth.. This indicates that the random noise regularizer implicitly helps the network focus and distinguish visually similar classes, which in case of [20] was encouraged through a concrete regularization objective, and hence helps the classification accuracy overall (for around 1% on CUB200’s validation set). At the same time, from our observations, PtR does not tend to produce so many minor probabilities on other classes which are less similar as the baseline model does. This aids the regularized model to suppress the uncertainties and focus on a few most similar candidate classes.

5 Related Work

Regularization methods that are commonly used for fine-tuning ConvNets can be generally grouped into four categories: data perturbation, parameter norm penalty, dropout, and multi-task learning. Image augmentation as a popular form of image perturbation has been proven to be particularly useful to prevent ConvNets from overfitting. In this paper, we also assume to perturb our training instances by random augmentations. The supervision signal can be perturbed for better regularization as well. This can be achieved by learning to predict soft targets rather than hard binary ones as in [31]. In this work, label perturbation is not considered so that we can deliver more ablated studies on the effectiveness of auxiliary training objectives.

The parameter norm penalty, or weight decay more specifically, has been one of the most common ways of regularization in training deep models. Our PtR leveraged weight decay by default, but we also evaluated PtR without weight decay to study its impact on accuracy. Another method being apparently similar to weight decay is to perform feature norm penalty (FNP) on the representation layer of a network [30]. Superficially, FNP would resemble to PtR if the regression target for PtR was towards a static norm of zero without involving randomness (an additional feature of PtR which should be noted is that it also balances objectives automatically). The technical difference and benefit of PtR over FNP will be elaborated in Appendix B. Dropout is also one of the standard techniques to improve model regularization by temporarily shielding a part of the hidden units in the bottleneck layer and fully connected layers during training [32, 33]. For the VGG-16 structures which we employed in our experiments, dropout was also used after flattened hidden layers.

A more recent approach of regularization in transfer learning has been to train ConvNets with an auxiliary task/objective through multi-task learning [18, 17]. These objectives in the recent best performing methods are often designed with the expectation that more generic features are less likely to overfit to the target task in a few different ways. In [17], the network was simultaneously trained by the target data and a number of selected source data samples that are similar to the target data when viewed in low-level features. Another way to encourage ConvNets regularization, instead of relying on the availability of foreign data, was to let model stay not too far from the original structure of the model trained by a large source task. As demonstrated by [18], one can attempt to retain the predictions of the target domain images made by the off-the-shelf source model while learning on the new target task (we acknowledge that the original intention of doing so in [18] was not for regularization). It is therefore reasonable to interpret the use of off-the-shelf predictions as implicitly using the source domain training data. The other way to make the trained models attracted to the original model was to explicitly force the target model being trained to stay in a vicinity of the source model in the weight space; the work in [19] leveraged the inductive bias in a fine-tuning scenario to prevent the learned features from becoming overly specific to the target tasks.

For fine-grained vision tasks, a very recent approach [20]

suggested to “confuse” the network by encouraging different class-conditional probability distributions to come closer together, thus reducing the inter-class distance. In the cases where only style transfer is considered, one can attempt to reduce the domain variance through certain metrics (i.e., perform domain adaptation) as in

[34, 35].

A common aspect of these aforementioned regularizers is that they depend on a concrete task or objective. But since they are not purposed for optimizing the target objective (cross-entropy loss), they can all be largely seen as “distractors”. The approach studied in this paper also causes distractions, but we suggest that a distractor could work equally effectively without involving a concrete auxiliary objective or any form of source domain data as supervision labels.

6 Discussions and Conclusions

We have introduced a Pseudo-task Regularization (PtR) which leverages a multi-task learning framework and dynamically regularizes a network by simply attempting to regress image representations to a rather random target in the course of fine-tuning. It is dynamic in that it adjusts the strength of the regularization based on the gradient norms of the target objective and the pseudo regression task. Unlike existing approaches, PtR does not depend on a concrete or real regularization objective. Surprisingly, we observed that the performance gain brought about by the simple PtR was on a par or better than the related recent solutions, and therefore PtR can serve as an efficient alternative to recent best performing regularization methods that are based on concrete tasks.

On the one hand, the use of PtR could have an effect of reducing the effective network capacity for a target task at hand, which should also help enhance generalization [36]. On the other hand, PtR is expected to cause useful distortions in the trajectory of gradient-descent while back-propagating controlled distraction gradients. Like other regularization objectives, it encourages a prolonged training to explore better local minima, which could in turn result in a better generalization (hence class disambiguation). However, such a generalization does not have to depend on a concrete or real regularization objective in PtR.

Appendix A The timing to bring in PtR

The pseudo regression task is brought into the training when the network is more stable in the course of convergence on the target task. The timing factor , under which PtR is brought in, was settled by considering the training efficiency and model effectiveness. Given that optimizing hyper-parameters is not our intention, we simply perform parameter search on Flower102 and Stanford40 datasets to determine . It can be observed that PtR improves on the baseline accuracies (fine-tuning) in all cases and the PtR at =1 resulted in a higher accuracy on average. Hence, we chose =1 as a practical setting in all the experiments since a larger seems not contribute to the model effectiveness while negatively prolonging the convergence time.

Fig. 4: Validation accuracy (in %) on the Flower102 and Stanford40, according to different timing when PtR joins the network training (SML1 was used for regression).

Appendix B Variance in gradients

In this section, we explore the effect of using a randomly varying regression target on the gradients by case-studying a minimum toy example network which is composed by one hidden neuron with one input and one output. The single hidden neuron is denoted by

, whose output is seen as the feature representation learned by the example network. The input to , denoted by , is the product of a input and its weight on the input path. That is: , = , and

represents the ReLU function.

When a regression target is applied to , the regression loss During back propagation, if neuron is activated, the gradient on

by the chain rule is:


With a simplified assumption that being a constant, it can be seen that the variance of the gradient of , , is determined by that of and , such that


given that the regression target is a variant independent of .

It can be seen that, if regularization is achieved through feature norm penalization, in Equation 7 needs to be a constant of 0. Consequently, the variance of gradients is smaller compared to other regression targets which follow a certain distribution. By leveraging noisier gradients generated by an independent randomly-varying regression target, PtR would explore more local optima to yield higher chances in avoiding saddle points and achieving stronger regularization.


The authors gratefully acknowledge the support by our industrial partner, Toshiba Corporation for this research. We thank NVIDIA Corporation for their generous donation of NVIDIA GPUs. We would like to thank the support by the Swedish Research Council.


  • [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2009.
  • [2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014.
  • [3]

    Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in

    The IEEE International Conference on Computer Vision (ICCV), 2015.
  • [4] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A convolutional network for real-time 6-dof camera relocalization,” in Proc. of the International Conference on Computer Vision (ICCV), 2015.
  • [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012.
  • [6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Arxiv preprint 1409.1556, 2014.
  • [7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in The IEEE International Conference on Computer Vision (ICCV), 2014.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, 2014.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [10] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint 1502.03167, 2015.
  • [11] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: training imagenet in 1 hour,” Arxiv preprint 1706.02677, 2017.
  • [12] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems 27, 2014.
  • [13] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in

    International Conference on Machine Learning (ICML)

    , 2014.
  • [14] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [15] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European Conference on Computer Vision (ECCV), 2014.
  • [16] H. Azizpour, A. Razavian, J. Sullivan, A. Maki, and S. Carlsson, “Factors of transferability for a generic convnet representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • [17] W. Ge and Y. Yu, “Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [18] Z. Li and D. Hoiem, “Learning without forgetting,” in European Conference on Computer Vision (ECCV), 2016.
  • [19] X. Li, Y. Grandvalet, and F. Davoine, “Explicit inductive bias for transfer learning with convolutional networks,” in International Conference on Machine Learning (ICML), 2018.
  • [20] A. Dubey, O. Gupta, P. Guo, R. Raskar, R. Farrell, and N. Naik, “Pairwise confusion for fine-grained visual classification,” in European Conference on Computer Vision (ECCV), 2018.
  • [21] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, “Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks,” in International Conference on Machine Learning (ICML), 2018.
  • [22] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Proc. of the Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
  • [23] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset,” California Institute of Technology, Tech. Rep. CNS-TR-2011-001, 2011.
  • [24] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009.
  • [25] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, “Human action recognition by learning bases of action attributes and parts,” in The IEEE International Conference on Computer Vision (ICCV), 2011.
  • [26] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” Arxiv preprint 1411.7923, 2014.
  • [27] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” California Institute of Technology, Tech. Rep., 2007.
  • [28] V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W, 2017.
  • [30] B. Hariharan and R. Girshick, “Low-shot visual recognition by shrinking and hallucinating features,” in The IEEE International Conference on Computer Vision (ICCV), 2017.
  • [31] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian, “Disturblabel: Regularizing cnn on the loss layer,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, 2014.
  • [33] P. Morerio, J. Cavazza, R. Volpi, R. Vidal, and V. Murino, “Curriculum dropout,” in IEEE International Conference on Computer Vision (ICCV), 2017.
  • [34] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in The IEEE International Conference on Computer Vision (ICCV), 2015.
  • [35] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International Conference on Machine Learning (ICML), 2015.
  • [36] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in Proc. of the International Conference on Learning Representations (ICLR), 2017.