An Efficient Transfer Learning Technique by Using Final Fully-Connected Layer Output Features of Deep Networks

by   Tasfia Shermin, et al.
Federation University Australia

In this paper, we propose a computationally efficient transfer learning approach using the output vector of final fully-connected layer of deep convolutional neural networks for classification. Our proposed technique uses a single layer perceptron classifier designed with hyper-parameters to focus on improving computational efficiency without adversely affecting the performance of classification compared to the baseline technique. Our investigations show that our technique converges much faster than baseline yielding very competitive classification results. We execute thorough experiments to understand the impact of similarity between pre-trained and new classes, similarity among new classes, number of training samples in the performance of classification using transfer learning of the final fully-connected layer's output features.



There are no comments yet.


page 3

page 5

page 7


Do We Need Fully Connected Output Layers in Convolutional Networks?

Traditionally, deep convolutional neural networks consist of a series of...

An Out-of-the-box Full-network Embedding for Convolutional Neural Networks

Transfer learning for feature extraction can be used to exploit deep rep...

Speeding Up Neural Networks for Large Scale Classification using WTA Hashing

In this paper we propose to use the Winner Takes All hashing technique t...

Clustering and Classification Networks

In this paper, we will describe a network architecture that demonstrates...

Recyclable Waste Identification Using CNN Image Recognition and Gaussian Clustering

Waste recycling is an important way of saving energy and materials in th...

Class Subset Selection for Transfer Learning using Submodularity

In recent years, it is common practice to extract fully-connected layer ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The advancement of influential internal representations in human infancy is reused later in life to solve various problems as stated by the cognitive study of [2]

. Humans are capable of performing visual tasks like classification, detection, recognition, etc of objects with absolute ease as a result of such development compared to machines. In resemblance to humans, deep neural networks built for computer vision problems also learn the data representations (features) which they use later to solve multiple tasks. More precisely, a network can be trained on a source task and then be reused on a target task. This phenomenon of transferability of learned data representations is termed as transfer learning

[8, 4, 5]

. Transfer learning is particularly convenient when a large dataset is available that can be used for training as source task and the data for the target task is scarce. This technique works well when the learned data representations or features are generic, which refers to having features suitable to both base and target datasets. The opportunity to learn generic features for deep networks is paved by the ImageNet

[10] dataset. Deep neural networks manifest an interesting attribute after training on images, all of them incline to learn features in the first layer that resemble Gabor filters and colour blobs. This behaviour is common for different datasets and training objectives [20, 21, 24]. This first layer feature behaviour depicts the fact that regardless of the optimization function they are general. A number of works in various computer vision tasks have reported significant results by transferring inner layer features of deep networks [12, 40, 31]. As the deep network architecture moves toward fully-connected (FC) layers, the specificity increases while the generic nature of features decrease [39], i.e., the intuition behind the last layer’s features of a deep network is that they are highly specific to pre-trained classes and might not generalize well in transfer learning. Therefore, the final FC layer features have not been considered generic enough for transfer learning tasks.

However, some empirical evidence about the transferability of final FC layer’s output features has been provided by [18]. They have shown that these features represent high nearest neighbour accuracy when tested with unknown classes. This indicates that the final FC layer output features might yield generic result in transfer learning. Motivated by their findings, we have used final fully-connected layer’s output features for transfer learning based classification tasks in this paper. Our proposed transfer learning approach extracts the final FC layer’s output features (1000 feature vector) and classify them using our proposed single layer perceptron classifier network. Performance of proposed approach is compared with an existing transfer learning approach which replaces the final FC layer with number of new classes and fine-tune penultimate layer consisting 4096 or higher feature vector along with the replaced one. Section 3 discusses both techniques. Furthermore, to investigate the performance of the proposed classifier and to observe the behaviour of the final FC layer’s output features in classifying new classes using transfer learning, we systematically investigate the following research questions (RQ).

RQ1: Does similarity of new classes with the pre-trained classes influence performance of classification using transfer learning?

RQ2: Does similarity among new classes influence the performance of classification using transfer learning?

RQ3: How much the performance of classification using transfer learning is influenced by the number of training and validation images used for new classes?

RQ4: How much the performance of classification using transfer learning is influenced when a mixed types of new classes are trained?

RQ5: Can proposed single layer perceptron classifier be used to improve computational efficiency without adversely affecting the performance of classification using output vectors from the final fully-connected layer of deep networks?

Section 4 describes the experimentation carried out and their outcomes are explored in Section 5. Empirical studies show that the proposed approach has has achieved test accuracy which is highly competitive to the baseline while it takes significantly less time to converge in training. Finally, Section 6 concludes our findings.

2 Related work

A significant number of papers have experimented and studied transfer learning in CNNs, which includes various factors affecting fine-tuning, pre-training and freezing layers. Apparently, it has become a trend for computer vision community to treat convolutional neural networks [20, 23, 35, 34, 16] trained on ImageNet as extractors of features that can be reused in handling almost all categories of visualization tasks. ImageNet pre-trained CNN features yielded impressive results in image classification [12, 32], action recognition [33], object detection[13, 31], image captioning [11, 19]

, human pose estimation

[7], image segmentation [9], optical flow [38], and others [22]. A study of suitable architectural choices for transfer learning has been reported by [3]. Discussion on whether to stop pre-training early to avoid overfitting and which layers would be best transferable for transfer learning is studied by [1, 39]. Transfer learning approach to fine-tune for new tasks without forgetting the old ones is proposed by [25]. To limit the need for annotated data for supervised pre-training required for transfer learning, [36] has proposed a method of more universal representations. A method of inferring the curriculum by transfer learning from another network pre-trained on a different task is proposed by [37]. The nature of transfer learning in mid-level features for transferring learned features in different visualization tasks is studied by [28]. CNN features were used as off-the-shelf features without fine-tuning by [32]. CNN features pre-trained in road scenes were reused for more specific road scene classifications by [17] and they have considered the dataset size and training time along with transfer learning performance. We have considered similar performance metrics for proposed transfer learning approach.

In this paper, we propose a single layer perceptron (SLP) classifier using final FC layer’s output features for classifying target dataset. To increase accuracy of object detection in the field of pathology, Romain et al. [26] have used different classifiers for transfer learning which includes single layer perceptron (SLP). They have implemented SLP from Scikit-learn by [29]

. Our Proposed classifier uses a different set of training hyper-parameters and activation function compared to

[29]. Moreover, our proposed classifier is highly computationally efficient and yields competitive results to the baseline.

Figure 1: Feature extraction from final FC layer of pre-trained convolutional neural network and forwarding to single layer perceptron classifier. Proposed classifier is trained separately.
Figure 2: Flow diagram of baseline transfer learning procedure described in Section 3.2

3 Proposed architecture and baseline

The pre-processing procedure of images and feature extraction are briefly explained in Section 3.1. The Final FC layer’s output features are extracted from ImageNet-1000 pre-trained CNNs [34, 16] for our proposed approach. Baseline technique and proposed transfer learning approach using SLP classifier are discussed in Sections 3.2 and 3.3 respectively.

3.1 Pre-processing and feature extraction

Unlike a regular Neural Network, the layers of a deep network have neurons arranged in three dimensions: width (

W), height (H), depth (D) and so the volume of input image I can be denoted as where DI represents the number of color channels. For augmenting the training dataset, input images from training sets are first randomly cropped, horizontally flipped (randomly) and then finally normalized. The pre-trained base networks are designed to take square images as inputs (i.e., ). Therefore, to match the input dimension of the network, square patches S of (maximum) height and width are randomly cropped from the image. The cropped patches are then resized to preserving the aspect ratio of the image. For validation and testing, center instead of random crop of the image is taken followed by resizing. The final normalized cropped patch is passed through the pre-trained base network P. Then the output of the last FC layer F, that is a feature map of dimension (i.e.,

) is extracted. High-dimensional tensors extracted for transfer learning usually require global average pooling, principal component analysis or max pooling for dimensionality reduction. However, considering the last layer feature map dimension of tensors needs not to be reduced.

3.2 Baseline transfer learning

For the baseline technique, the final FC layer is replaced with as many neurons as the number of classes in the target experiment [28, 32]. The baseline is represented in Figure 2

. Initially, the base network is frozen. Then only the newly appended and the second last layers are fine-tuned for 25 epochs with a learning rate of

. The data augmentation process described in Section 3.1 is utilised for processing the input training images. Stochastic gradient descent (SGD)


optimization is used with a momentum of 0.9 and no decaying of weight was included in the fine-tuning process. During training, the loss function was handled with the categorical cross-entropy and scheduler step size being set to 7 with a gamma value which equals to

. Every epoch has one training and validation phase, therefore at the end of each epoch, the model is evaluated on the validation set and eventually the best weights are being saved. After training, the system automatically saves the optimised model weights which have produced the best performance on the validation set. Baseline technique takes more time to converge during training.

Figure 3: Architecture of our single layer perceptron classifier.

3.3 Proposed single layer perceptron network

This section explains the motivation for designing our proposed single layer perceptron classifier and it’s working principle. Figure 1 portrays transfer learning procedure step-by-step using proposed classifier. The extracted output features of final FC layer are passed to proposed single layer perceptron classifier as input. Finally, proposed classifier is trained separately.

3.3.1 Motivation

When the deep network architecture moves towards fully-connected layers the features are found to be less generic. Therefore, each of the output neuron of final FC layer is considered to be specific to each class. The motivation behind designing the single layer perceptron classifier is that the extracted final FC layer’s output feature vector contains already globally optimized class specific information from pre-trained images, thus subsequently appending fully connected hidden layers in classifier might result in a drop due to representation specificity as reported by [39].

3.3.2 Architecture

The proposed classifier is designed based on the principle of traditional fully-connected artificial neural network where all the neurons in the previous layer shake hands with neurons in the next layer. Figure 3 shows the architecture of proposed classifier. The first layer is composed with K neurons where K denotes the 1000 dimensional extracted features of the final FC layer of the pre-trained network. Each of the 1000 neurons is connected to all neurons in the next layer. Number of output neurons is decided according to the number of classes in the target task. The output layer neurons are denoted with N, where

for our classification experiments. For introducing non-linearity in the model, we have activated the neurons of the output layer with Rectified Linear Unit (RELU)


and finally the output decisions are fed into softmax classifier to get probability.

3.3.3 Initialization and training

To derive a good weight and bias initialization, the proposed model was evaluated with the main types of initialization that are popular in deep learning community these days. For example, Xavier initialization


in the form of uniform and normal distribution, random initialization, Kaiming uniform


initialization with bias 0.01. Our empirical study indicates that for this architecture, the initialization of learnable weights and biases following uniform distribution yields best results. The values of weight and bias are initialized from

, where To train the single layer perceptron classifier, stochastic gradient descent (SGD) optimization was used with a learning rate of and no momentum. The learning rate was decayed by a factor of after every 7 epochs. For the purpose of calculating loss function, categorical cross entropy loss was used. Every training phase was followed by a validation phase in each epoch and the model has saved the best weights of the validation set as best validation accuracy.

Figure 4: Confusion matrix between new and pre-trained classes of species stating the similarity measures. For example, , and Other denotes other known classes. (a) and (b) show outcomes of pre-trained ResNet18 and VGG19 respectively.
Figure 5: Four types of species with five classes each for testing transfer learning.

4 Experimentation

In this section, the classification performances between the proposed approach (Section 3.3) using extracted features and the baseline explained in Section 3.2 is compared. In addition, behaviour of final FC layer’s output features is empirically established in transfer learning based classification of new classes .For feature extraction, ResNet18 and VGG19 pre-trained in ImageNet-100 [30] networks are used. Following our questions of interest stated in Section 1, four types of species (Bird, Fruit, Flower and Pepper) consisting five different classes each with approximately different degrees of similarity (80%, 70%, 60%, and 50%) to the pre-trained classes have been selected by creating confusion matrix of Figure 4. These matrices state the percentage of similarity of new class with respect to pre-trained class of a species produced by pre-trained network without transfer learning. For experimentation, 500 images for each classes of the species were collected according to the ImageNet synsets by web crawling. Figure 5 shows examples of classes of each species from our target dataset. We have ensured that the target and base datasets had no overlapping classes. In order to observe impacts of transfer learning in classification among different classes of same species in-depth, three types of classification (i.e., 3-class, 4-class and 5-class) are investigated. We will denote this type classification as A-type classification in this article. The results of A-type classifications are represented in Figures 6. To understand the influence of transfer learning in classification among different classes of different species, experiments are designed consisting classes from each species, where . We will denote this type classification as B-type classification in this article. For each category of classification, we have organized experiments with three combinations of target sets consisting fJ images for training, images for validation, and images for testing from each class, where and

. For example, the first target set is composed of 50 training images, 25 validation images and the rest of the images are left for testing from each classes. The retrieval of pre-trained weights and other experiments are done in PyTorch. Hyper-parameters of all experiments were tuned by 30-fold cross-validation on the dataset. Section 4.1 portrays a detailed discussion of our evaluation metrics.

Species CNN Similarity Classes
{1,2,3} {1,2,3,4} {1,2,3,4,5}
Baseline ANN Gain Baseline ANN Gain Baseline ANN Gain
Bird ResNet18 81.8 91.0 89.8 -1.3% 90.5 89.5 -1.1% 90.0 89.0 -1.1%
VGG19 80.8 92.0 90.0 -2.2% 90.0 89.7 -0.3% 90.5 89.5 -1.1%
Fruit ResNet18 72.5 80.5 80.1 -0.5% 80.3 80.0 -0.4% 80.2 80.0 -0.2%
VGG19 72.7 79.0 79.3 0.4% 78.3 79.0 0.9% 78.5 78.8 0.4%
Flower ResNet18 64.8 72.3 73.0 0.9% 72.1 72.8 1.0% 72.1 72.7 0.9%
VGG19 63.7 70.5 69.1 -2.1% 70.3 69.0 -1.9% 70.3 70.0 -0.4%
Pepper ResNet18 50.9 61.2 64.2 4.7% 61.1 64.1 4.7% 61.1 64.7 5.6%
VGG19 52.2 62.2 62.6 0.7% 62.3 62.0 -0.5% 62.3 62.0 -0.5%
Average 67.4 76.1 76.0 0.1% 75.6 75.8 0.3% 75.6 75.9 0.4%
Table 1: Transfer learning classification test accuracy (%) of the proposed ANN-based (SLP) technique against the baseline for 50 training images per class of each species.
Species CNN Similarity Classes
{1,2,3} {1,2,3,4} {1,2,3,4,5}
Baseline ANN Gain Baseline ANN Gain Baseline ANN Gain
Bird ResNet18 81.8 92.0 90.6 -1.5% 91.0 89.7 -1.4% 90.7 89.2 -1.7%
VGG19 80.8 91.4 89.8 -1.8% 91.0 89.9 -1.2% 90.4 90.5 0.1%
Fruit ResNet18 72.5 80.6 80.3 -0.4% 80.4 80.0 -0.5% 80.2 80.0 -0.2%
VGG19 72.7 80.0 80.3 0.4% 79.3 80.0 0.9% 79.5 80.2 0.9%
Flower ResNet18 64.8 72.5 74.0 2.0% 72.4 73.3 1.2% 72.5 73.7 1.7%
VGG19 63.7 70.7 71.1 0.5% 70.6 71.0 0.6% 70.4 71.8 2.0%
Pepper ResNet18 50.9 62.2 64.3 3.3% 62.1 64.2 3.2% 62.1 64.3 3.5%
VGG19 52.2 62.6 62.4 -0.3% 62.4 62.3 -0.1% 62.5 62.1 -0.6%
Average 67.4 76.5 76.6 0.3% 76.2 76.3 0.3% 76.0 76.5 0.7%
Table 2: Transfer learning classification test accuracy (%) of the proposed ANN-based (SLP) technique against the baseline for 100 training images per class of each species.
Species CNN Similarity Classes
{1,2,3} {1,2,3,4} {1,2,3,4,5}
Baseline ANN Gain Baseline ANN Gain Baseline ANN Gain
Bird ResNet18 81.8 92.4 90.9 -1.6% 91.8 89.9 -2.1% 91.7 89.0 -3.0%
VGG19 80.8 91.1 90.8 -0.3% 90.9 89.5 -1.6% 90.5 90.7 0.2%
Fruit ResNet18 72.5 81.0 80.8 -0.2% 80.7 80.2 -0.6% 80.7 80.2 -0.6%
VGG19 72.7 80.4 81.4 1.2% 80.3 81.0 0.9% 80.4 81.4 1.3%
Flower ResNet18 64.8 72.8 74.2 1.9% 72.7 74.2 2.0% 72.7 74.2 2.1%
VGG19 63.7 71.9 72.0 0.2% 71.8 72.0 0.3% 71.4 72.0 0.9%
Pepper ResNet18 50.9 63.9 64.5 1.0% 63.7 64.2 0.7% 63.4 64.4 1.6%
VGG19 52.2 63.7 63.4 -0.4% 63.6 63.3 -0.5% 63.4 63.2 -0.3%
Average 67.4 77.1 77.3 0.2% 76.9 76.8 -0.1% 76.8 76.9 0.3%
Table 3: Transfer learning classification test accuracy (%) of the proposed ANN-based (SLP) technique against the baseline for 200 training images per class of each species.
Figure 6: Transfer learning classification test accuracy (%) trend against similarity among training species and the number of classes trained per species for pre-trained CNN models ResNet18 and VGG19.
Species CNN 3 classes per species
Baseline ANN Gain
Indep. (avg) ResNet18 76.2 76.8 0.7%
VGG19 75.9 75.2 -0.9%
Average 76.1 76.0 -0.1%
Mixed ResNet18 71.1 71.9 1.2%
VGG19 73.8 74.1 0.5%
Average 72.4 73.0 0.8%
Table 4: Transfer learning classification test accuracy (%) of the proposed ANN-based (SLP) technique against the baseline for training each species independently or in a mix with fixed number of classes per species.
CNN Classes Number of training images
50 100 200
Baseline ANN Baseline ANN Baseline ANN
ResNet18 3 990 15 1110 17 1130 18
4 1110 16 1230 19 1230 19
5 1808 18 1832 22 1868 23
Average 1303 16 1391 19 1409 20
Gain -98.7% -98.6% -98.6%
VGG19 3 1215 18 1315 20 1325 23
4 1255 22 1505 23 1535 25
5 1935 28 2115 29 2175 30
Average 1468 23 1645 24 1678 26
Gain -98.5% -98.5% -98.5%
Table 5: Transfer learning training time (s) of the proposed ANN-based (SLP) technique against the baseline.

4.1 Performance metrics

To evaluate the experimental results of two transfer learning strategies (Sections 3.2 and 3.3), two different performance metrics are considered: first one is the test accuracy (TA) obtained on the test sets, this ensures about the final classification performance of the network after it has learned from training. The second one is the amount of time needed to train (TT) the process without using GPU.

5 Results and discussion

This section discusses detail in the light of our 5 research questions about the findings from experiments and impacts of transfer learning in classification tasks are established by observing the outcomes portrayed in tables and graphs. Percentage of gain of test accuracy for A-type classification and A-type classification are obtained from difference between test accuracy of proposed technique and baseline, where negative (-) sign indicates less test accuracy of proposed technique compared to baseline).

RQ1: It is noticed that, classification outcomes using final fully-connected layer’s output features follow the trend of behaviour of classification in computer vision. More precisely, observing the results stated in Tables 1, 2 and 3 it can be stated that for all cases of A-type classification with the gradual diminution of similarity classification test accuracy for both proposed and baseline technique decreases. This observation establishes a relation between similarity of new and pre-trained classes which highly influence classification outcomes.

RQ2: Moreover, test accuracy decreases with increment of number of classes. For example, if one observes towards right starting from column 4 in Table 1 test accuracy of both baseline and proposed technique decreases to from and from respectively. This phenomenon indicates the marginal improvement gradually decreases with the increase of classes of same species. Therefore, similarity among new classes of same type does not seem to have much impact in increasing the performance of classification based on transfer learning.

RQ3: As the number of training samples per classes increase the performance of test accuracy goes higher. For example, it is noticed in Table 1 that for Birds (A-type classification with 3 classes) test accuracy of proposed classifier is along with the increase in number of training samples for each class test accuracy in Table 3 it becomes . Similar (approx) increase is frequently observed in all species with different similarity which indicates more training samples help to learn more and yields better performance.

Apparently, comparison among similarity and test accuracy clearly shows approximately of increase in classification after transfer learning using both approaches. Which establishes final FC layer’s output features are suitable for A-type classification tasks. In addition, proposed classifier yields very competitive classification test accuracy by using only 1000 dimensional feature vector compared to baseline technique which uses 4096 or higher dimensional features. Proposed technique achieves average gain in the range of to as observed from Table 1, 2, and 3. Precisely, proposed technique achieves a gain of for only one type of classification among nine as highlighted in Table 3 whereas it acquires positive gain for rest of the eight cases. Which means proposed classifier yields better results in majority of cases. Concerning the test accuracy of proposed technique, it is seen from results that on average it performs similar to baseline and for cases with positive gain it outperforms baseline by approximately. To understand the behaviour of proposed technique further from pictorial view, we plot 3D graphs of Figure 6 considering classification accuracy, similarity and number of training samples as the axes (Blue & Green represent proposed technique and baseline respectively). Figure 5(a) and 5(b) refer that proposed classifier using ResNet18 features surpasses the test accuracy of baseline technique when the similarity among new classes tend to decrease. In addition for less number of training samples and more classes, our technique gains more test accuracy than baseline technique. This establishes that with ResNet18 features, proposed classifier (during training ) generalizes more on new classes when they have less similarity with pre-trained classes. However, for higher similarity it yields similar outcomes as the baseline. On the other hand, Figure 5(c) shows that using VGG19 features with less training samples per class leads proposed classifier to outperform baseline when similarity is (approx). Figure 5(d) shows it surpasses the baseline test accuracy with greater number of training samples and number of classes across all range of similarities. In rest of the cases, our classifier gives approximately similar results as baseline technique.

RQ4: To understand behaviour of mixed species in classification, 3 class A-type classification is compared with 3 classes per species for B-type classification. From Table 4 it is apparent that proposed classifier achieves more average gain for mixed class experiments. Which establishes that proposed approach does better classification than baseline when similarity among classes decreases with the increase of number of classes.

RQ5: For providing evidence about computational efficiency, training time of both proposed and baseline techniques are enlisted in Table 5. All training time are presented in seconds. For training time, negative (-) gains indicate less training time needed to converge. It is noticed that for all cases of classification, our approach is trained in approximately less time than baseline because of consisting less neurons in the architecture. Proposed network does not overfit because of early stopping at the time of convergence. This paved the way for proposed classifier to generalize more on test samples.

6 Conclusion

A new transfer learning approach using the final fully-connected layer’s output features (1000 dimension) is proposed in this work. For classification of new classes using extracted features, proposed single layer perceptron classifier is used. We empirically examine and compare transfer learning performance of baseline and proposed technique. Considering the training time, baseline approach lags far behind proposed approach when trained in CPU. Proposed classifier converges during training in very less amount of time compared to baseline which is a crucial attribute in the field of deep learning. In addition, proposed classifier outperforms the baseline technique in majority of cases and yields very similar results in other cases. Furthermore, followed by our RQs, the behaviour of final fully-connected layer’s output features in transfer learning is established by empirical investigations. It is pertinent to mention, that experiments with new classes having higher similarity to base dataset while least similarity among themselves yielded better results. This might be justified by the fact that final fully-connected layer features are more specific to the base task. Overall classification accuracy increases with the increase of training samples. We hope, our thorough investigation will help researchers to formulate best practices for efficient use of proposed strategy. In future, we would want to explore transferability of final fully-connected layer’s output features in other visual tasks, for example, object detection, image captioning, image recognition, etc as our RQs for more classes. Moreover, we will investigate proposed transfer learning approach with multi-layer perceptron.


  • [1] P. Agrawal, R. Girshick, and J. Malik. Analyzing the performance of multilayer neural networks for object recognition. In European conference on computer vision, pages 329–344. Springer, 2014.
  • [2] J. Atkinson. The developing visual brain. Oxford University Press UK, 2002.
  • [3] H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson. From generic to specific deep representations for visual recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    , pages 36–45, 2015.
  • [4] Y. Bengio. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pages 17–36, 2012.
  • [5] Y. Bengio, A. Bergeron, N. Boulanger-Lewandowski, T. Breuel, Y. Chherawala, M. Cisse, D. Erhan, J. Eustache, X. Glorot, X. Muller, et al. Deep learners benefit more from out-of-distribution examples. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , pages 164–172, 2011.
  • [6] L. Bottou.

    Large-scale machine learning with stochastic gradient descent.

    In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
  • [7] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4733–4742, 2016.
  • [8] R. Caruana.

    Learning many related tasks at the same time with backpropagation.

    In Advances in neural information processing systems, pages 657–664, 1995.
  • [9] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3150–3158, 2016.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
  • [11] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
  • [12] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. A deep convolutional activation feature for generic visual recognition. arxiv preprint. arXiv preprint arXiv:1310.1531, 2013.
  • [13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  • [14] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [17] C. J. Holder, T. P. Breckon, and X. Wei. From on-road to off: transfer learning within a deep convolutional neural network for segmentation and classification of off-road scenes. In European Conference on Computer Vision, pages 149–162. Springer, 2016.
  • [18] M. Huh, P. Agrawal, and A. A. Efros. What makes imagenet good for transfer learning? arXiv preprint arXiv:1608.08614, 2016.
  • [19] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  • [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [21] Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng. Ica with reconstruction cost for efficient overcomplete feature learning. In Advances in neural information processing systems, pages 1017–1025, 2011.
  • [22] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
  • [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [24] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng.

    Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.

    In Proceedings of the 26th annual international conference on machine learning, pages 609–616. ACM, 2009.
  • [25] Z. Li and D. Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [26] R. Mormont, P. Geurts, and R. Marée. Comparison of deep transfer learning strategies for digital pathology. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2262–2271, 2018.
  • [27] V. Nair and G. E. Hinton.

    Rectified linear units improve restricted boltzmann machines.

    In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  • [28] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1717–1724, 2014.
  • [29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
  • [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [31] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In In International Conference on Learning Representations (ICLR)., 2014.
  • [32] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 806–813, 2014.
  • [33] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
  • [34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [36] Y. Tamaazousti, H. L. Borgne, C. Hudelot, M. E. A. Seddik, and M. Tamaazousti. Learning more universal representations for transfer-learning. arXiv preprint arXiv:1712.09708, 2017.
  • [37] D. Weinshall and G. Cohen. Curriculum learning by transfer learning: Theory and experiments with deep networks. arXiv preprint arXiv:1802.03796, 2018.
  • [38] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. Deepflow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision, pages 1385–1392, 2013.
  • [39] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
  • [40] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.