Pre-training is a standard transfer learning approach for deep nets. It entails training a net on a source task, randomly reinitializing one or more of the higher layers, and then retraining the resultant net on a desired target task. This reasoning is based on the structure of the network, namely, layers closer to the input extract general features whereas layers closer to the output extract more task/dataset specific features . The advantage of reusing general features are three-fold: 1) Better training times, 2) Better accuracy, and 3) Reduced data requirements.
The logic of pre-training deep neural networks and its success in a wide variety of domains have led to its routine use. State-of-the-art (SOTA) results have been achieved for object detection, image segmentation and other recognition tasks[4, 5, 17, 22, 1, 21]
by using nets that have been pre-trained on the ImageNet dataset as the backbone of the net to be trained on the target task.
Investigations of the methodology for pre-training have generally focused either upon which layers to transfer 
, or on using source task training sets significantly larger than ImageNet, such as ImageNet-5K, JFT-300M dataset  and Instagram . However, it has recently been demonstrated that target task datasets must be far smaller than previously believed before any accuracy benefits accrue from pre-training . In this paper we illustrate a new consideration for pre-training - the amount of training performed on the source task.
Too much source task training (i.e. pre-training) can significantly degrade a pre-trained net’s ability to learn a target task.
This loss reflects a general loss of learning ability. It has the characteristics of ‘fragile co-adaptation’, which Yosinski et al.  argue can leave a pre-trained net less able to relearn a source task or learn a related task.
We are the first to connect this effect with the amount of source task training.
Unlike Yosinski et al. , we observed this when reinitializing only the uppermost layer of the net, rather than several of the top layers. This observation is important for practitioners who rely upon pre-trained nets.
Ii Related Work
Two of the main hurdles in training neural nets are the large quantities of time and data needed to train them. These can be mitigated with transfer learning, which allows a deep net to reuse weights learned from one task to more rapidly learn a new task. In spite of the lack of explainability of deep nets, their structured nature lends itself to transfer learning. It has long been known that deep neural nets trained on visual data learn very similar first layer features - namely Gabor features and color patches, regardless of the specific task they have been trained upon [12, 11, 13]. Yet, the output layer of nets are, by necessity extremely task specific. It seems intuitively obvious that initializing a net with features that are more similar to the desired end state should save training time and allow for improved accuracy with smaller training sets.
The technique of copying lower layers of a net and training higher layers from random initialization is known as ‘pre-training and fine-tuning’. Indeed, as noted in 
, the first significant results using deep learning approaches for detecting objects involved pre-training with the ImageNet dataset before learning the target (i.e. desired) task [5, 21, 9, 19]. This is now the dominant approach for state of the art object detectors [5, 8, 20, 24, 16, 15]
This leads us to ask the question: how can one best pre-train? Previous studies have centered on which layers to transfer and how large the transfer training set should be [25, 6]. It has also been observed that the accuracy benefits of transfer learning decline rapidly with increasing size of the target task training set [9, 7].
In this paper we ask: how much pre-training should be performed? Yosinski et al.  had observed that keeping the lower and middle layers led to worse performance than keeping just the lower layers, or keeping all layers save the output layers. They attributed this to a ‘fragile co-adaption’ of the nodes in the middle layers. This meant that the net’s layers learned “features that interact with each other in a complex or fragile way such that this co-adaptation could not be relearned” . We observed a similar effect at the top layers of the net, which manifested quickly after too much source task training. Because this effect manifested at top network layers, it may impact practitioners who rely upon pre-trained nets.
We performed experiments using three standard datasets: 1) CIFAR-100, 2) Tiny ImageNet200  and 3) Caltech101 . CIFAR-100  contains 100 classes divided into a training set with 50,000 images (500 Samples per Class (SPC)) and a test set with 10,000 images(100 SPC). All images in CIFAR-100 were 32x32x3.
Tiny ImageNet200  consists of 200 classes drawn fom the ImageNet dataset. It also had 500 SPC in the training set and 100 SPC in its test set. However, the images in this set were 64x64x3. Hence, we downsampled to 32x32x3 for consistency and to decrease training time. Finally, the Caltech 101 dataset  is comprised of 101 samples with SPC’s ranging from a few dozen to a few hundred. On average each class had about 50 SPC. We similarly resized the Caltech101 images to 32x32x3 to decrease training time and for consistency, since they were not all of uniform size.
Each dataset was divided into a sets of living and not living classes. This was done to ensure that there would not be similar classes in both the source and target sets and to ensure source and target tasks with an appreciable number of classes.
The CIFAR100 dataset was divided into a source task of 65 living classes, and a target task of 35 not living classes. The Tiny ImageNet200 dataset was divided into a source task of 65 not living classes and a target task of 35 living classes. Finally, the Caltech101 dataset was divided into a source task of 51 living and a target class of 49 not living classes. These splits were determined by class availability and the desire to have more classes in the source tasks than in the target tasks.
Iii-B Neural Net Architecture
The architecture we used was a Wide ResNet 28-10 
architecture. We chose it because it achieved high performance on the CIFAR-100 dataset without requiring novel augmentation or regularization techniques, balancing good performance with ease of implementation. Because we are using Keras (v. 2.3.0), but Zagoruyko et al.’s
original implementation used PyTorch, we based our instantiation on the Keras based code at:https://github.com/titu1994/Wide-Residual-Networks
Iii-C Training Procedure
All experiments were run using Keras (v. 2.3.1) with a Tensorflow backend (v. 2.1.0) on machines with Quadro M6000 GPU’s and either CUDA 10.2 or 11.0. Minibatches of 128 images were preprocessed with featurewise centering and normalization. Also, data augmentation, using both random horizontal image flips and uniformly random vertical & horizontal shifts of up to 4 pixels, was employed. These techniques were both implemented with the Keras image preprocessing module. After each training epoch, the image order was randomly shuffled.
Following Zagoruyko et al. 
, we used SGD with Nesterov momentum to optimize, with a Categorical Cross-Entropy loss function. We also used the hyper-parameters they employed for CIFAR-100. Most notably, we used their learning rate schedule, which is reflected in the abrupt changes the accuracies achieved on the source task validation set at the 60and 120 epoch, though not by the 160 epoch.
Iii-C1 Source Task Training
Source task training consisted of training a net from scratch, using the full source task training set. It lasted for 200 epochs, by which time the net displayed very little change in its validation set accuracy. The state of the net was recorded every 10 epochs, resulting in an initial set of 21 nets (i.e. Epoch 0 - Epoch 200, inclusive) for transfer. Furthermore, we saved the net with the highest validation set accuracy, since it is the one normally selected for transfer.
This was done 5 times, providing 5 different source nets at each of 21 different recorded source task training epochs and at the optimal source task epoch.
Iii-C2 Target Task Training
The initial state of the target task net was a trained source task net with the final fully connected layer and softmax layer reinitialized and changed to reflect the new number of output classes.
The target task training sets were decreased to 10 samples per class (SPC), as the effects of transfer learning are more visible with small target task training sets. This also enabled us to choose 5 different transfer task training sets. By combining them with the 5 different source nets, we performed 25 quasi-independent transfer experiments both for each source task training epoch and for each optimal source task epoch. Our graphs reflect the minimum, median and maximum accuracy achieved on the target class validation set. This set was the same for all transfer experiments within each dataset.
Iii-D Specific Experimental Questions
Iii-D1 Does optimal source task performance imply optimal pre-training for a target task?
The first set of experiments were designed to illustrate both that training for optimal source task performance might not be optimal for learning a target task, and the degree to which it might be suboptimal. These experiments involved the CIFAR-100 and Tiny ImageNet200 datasets.
Iii-D2 Does too much source task training place the pre-trained net in a state that is overly specific to relearning the source task?
The premise of transfer learning via pre-training is that the weights of the retained layers create an initial state that is conducive for learning tasks similar to the source task. Presumably, the better the original net is at its source task, the better the resulting pre-trained net will be for learning related tasks. More assuredly, the better the pre-trained net would be at relearning the original source task, since that is the exact task the retained layers had been trained to do. A drop in accuracy when learning a different target task, but not when relearning the original source task, would indicate that the pre-trained net’s state had become overly specific for relearning the source task.
To test this, we performed experiments with identical source and target tasks. If the remaining layers of the net are overly specific for learning the source task, then relearning it should not display a drop in accuracy associated with too many ‘pre-transfer’ source task training epochs, though other target tasks (i.e. different than the source task) might display such a drop. If, however, this effect is present, then we are observing something similar to what Yosinski et al.  saw, and which they ascribed to “fragile co-adapted features”; essentially, the net is not able to relearn the source task when its upper layer(s) are reinitialized, nor to learn related tasks.
Iii-D3 Does source task training set size affect loss of transfer accuracy associated with too much source task training?
We also experimented with the CalTech101 dataset. It is both much smaller than either of the other datasets and more unbalanced. It has an approximate median of 50 SPC, as opposed to 500 SPC, which CIFAR-100 and Tiny ImageNet200 have. Furthermore, Caltech101 has classes with as few as 28 samples and classes with as many as a few hundred samples. This led to different behavior than seen with the other two datasets.
To determine if this difference was due to the decreased size (i.e. decreased amount of training per epoch) of the Caltech101 dataset, we took the source task training set from the Tiny ImageNet200 dataset and abridged it to 50 SPC. Then, in order to increase the size of this training set (i.e. recreate an equivalent amount of training per epoch), without changing the variety of data, we copied each image 10x, giving us 500 SPC again.
Iii-D4 Does learning rate affect loss of transfer accuracy associated with too much source task training?
There was a weak correlation between optimal target task performance and decreases in learning rate. This raised the question of whether the decline in target task learning was caused by the amount of training that had occurred at the time of the learning rate decreases, or by the decrease itself. So, we performed an experiment with the CIFAR-100 dataset, using delayed learning rate decreases at 100, 160 & 180 epochs instead of at 60, 120 & 160 epochs.
Each of our graphs shows median source task accuracy with a green curve that uses the right y-axis. The target task results use the left y-axis. Transfer learning results are indicated with a data point for the median accuracy with ‘error bars’ to indicate the max and min values, obtained over 25 trials. Red denotes target task results obtained with 0 source task training epochs - i.e. no transfer learning. Yellow indicates target task results using the net with optimal source task performance (i.e. the nets currently used for transfer). Because each of the 5 source nets achieved optimal performance at a different epoch, we associate their target task results with the median of those epochs. Dark blue indicates the best target task results obtained without using the optimal source task performance net. Horizontal lines are used to aid comparison of the median accuracies.
Iv-a Does optimal source task performance imply optimal pre-training for a target task?
The results obtained in Figs. 1 and 2, clearly show that optimal source task performance does not imply optimal pre-training for a target task. Fig. 1 shows transfer with the CIFAR-100 dataset; Figs. 2 shows results from the same experiment, but with the Tiny ImageNet200 dataset. The difference in accuracy reflected by the dark blue horizontal line (i.e. best overall transfer accuracy) and the yellow horizontal line (i.e. accuracy achieved by pre-training to optimal source task performance) indicates how much training to peak source task performance may degrade transfer learning of the target task from training to the optimal stopping point.
Iv-B Does too much source task training place the pre-trained net in a state that is overly specific to relearning the source task?
To test whether our nets had been pre-trained to a state that was overly specific to their source task, we repeated the experiments from Figs. 1 and 2, but this time we attempted to relearn the original source task with transfer learning, in a ‘degenerate’ example of transfer learning. The results in Figs. 3 and 4 demonstrate that the net is not in a state that is overly specific to the source task. In each case, not only is the original optimal source task net far from optimal at relearning that same task, but it also, surprisingly, displays negative transfer.
Negative transfer is evidenced by the fact that the horizontal yellow line (i.e. target task accuracy achieved after obtaining peak source task accuracy) is beneath the horizontal red line (i.e. target task accuracy achieved without any source task training).
The loss of the pre-trained net’s accuracy, when relearning the original source task, demonstrates that excessive training on a source task can leave a pre-trained net in a state where it is less able to learn. This is similar to the ‘fragile co-adapted’ nodes first observed by Yosinsky et al., except we’ve demonstrated that ‘fragile co-adaptation’ can occur at the high, rather than middle, layers of a net and that this phenomenon can be associated with the amount of source task training.
Iv-B1 Does source task training set size affect loss of transfer accuracy associated with too much source task training?
The results in Fig 5 show that the optimal source task net is not necessarily sub-optimal for transfer learning. They were obtained with the Caltech101 dataset, which has been noted is notably smaller and less balanced than CIFAR-100 and Tiny ImageNet200. The smaller size means that not only does Caltech101 have less variety than our other two datasets, but it also engages in less updating per training epoch.
To investigate the degree to which the smaller size of the Caltech101 dataset was the caused this to occur, we truncated the source task data from Tiny Imagenet200 dataset to 50 SPC. We felt this would help it to best mimic the Caltech101 dataset. The transfer learning experments were then repeated, with the results shown in Fig.6. They show the same lack of degradation of transfer learning as Fig 5 .
Fig.7 shows results when each image in the source training set of our truncated Tiny ImageNet200 is repeated 10x each epoch. This restores the size of the original training set, though not its variety. As can be seen, the degradation in the transfer task accuracy reappears, showing that it is more due to excessive training and not very influenced by variety in the dataset. Even more significantly, the bottom graph shows evidence of negative transfer, if one uses the net with optimal source task performance, since, once again, the horizontal yellow line has fallen below the horizontal red line.
Iv-B2 Does learning rate affect loss of transfer accuracy associated with too much source task training?
Fig. 8 illustrates the importance of learning rate, as opposed to just number of training epochs. It repeats the experiment from Fig. 1 using delayed learning rate decreases at 100, 160 & 180 epochs instead of at 60, 120 & 160 epochs. The shift in occurrence of the decreases in transfer learning accuracy to coincide with the scheduled decrements in learning rate, indicates the effect smaller learning rates have in creating pre-trained nets that are ill-suited for transfer.
Pre-trained nets are the most common transfer learning technique for deep nets. We have demonstrated that optimal source task performance is not indicative of optimal target task performance for pre-trained nets. Too much source task training results in a pre-trained net that is less able to learn target tasks. This observation seems to be due to the ‘fragile co-adapted features’ first noted by Yosinski et al., except we observe them in the highest layer, which is commonly used for pre-training. As a result, more caution is warranted in the use of pre-trained nets, especially for small target task training sets.
The authors would like to thank Travis Cuvelier for many useful conversations.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In , Vol. , pp. 4724–4733. External Links: Cited by: §I.
-  (2009) ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. External Links: Cited by: §I, §II.
-  (2004) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pp. 178–178. Cited by: §I, §III-A, §III-A.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 580–587. External Links: Cited by: §I.
-  (2015) Fast r-cnn. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1440–1448. External Links: Cited by: §I, §II.
Knowledge transfer in deep convolutional neural nets.
Proceedings of the Twentieth International Florida Artificial Intelligence Research Society Conference, May 7-9, 2007, Key West, Florida, USA, D. Wilson and G. Sutcliffe (Eds.), pp. 104–109. External Links: Cited by: §II.
-  (2011) Latent learning - what your net also learned. In The 2011 International Joint Conference on Neural Networks, IJCNN 2011, San Jose, California, USA, July 31 - August 5, 2011, pp. 1316–1321. External Links: Cited by: §II.
-  (2017) Mask R-CNN. In 2017 IEEE International Conference on Computer Vision, pp. 1440–1448. Cited by: §I, §II.
-  (2019) Rethinking imagenet pre-training. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 4917–4926. External Links: Cited by: §I, §II, §II.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §I, §III-A.
Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §II.
-  (2011) ICA with reconstruction cost for efficient overcomplete feature learning. In Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), Vol. 24, pp. . External Links: Cited by: §II.
-  (2009) . In Proceedings of the 26th annual international conference on machine learning, pp. 609–616. Cited by: §II.
-  (2015)(Website) External Links: Cited by: §I, §III-A, §III-A.
-  (2016) Feature pyramid networks for object detection. CoRR abs/1612.03144. External Links: Cited by: §II.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §II.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §I.
-  (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §I.
-  (2014) Learning and transferring mid-level image representations using convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1717–1724. External Links: Cited by: §II.
-  (2016) You only look once: unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 779–788. External Links: Cited by: §II.
-  (2014) OverFeat: integrated recognition, localization and detection using convolutional networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §I, §II.
-  (2014) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, Cited by: §I.
-  (2017) Revisiting unreasonable effectiveness of data in deep learning era. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 843–852. External Links: Cited by: §I.
-  (2019) Detectron2 (2019). URL https://github. com/facebookresearch/detectron2. Cited by: §II.
-  (2014) How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Cited by: 2nd item, 2nd item, §I, §I, §II, §II, §III-D2, §IV-B, §V.
-  (2016) Wide residual networks. NIN 8, pp. 35–67. Cited by: §I, §III-B, §III-C.