Deep Transfer Learning with Ridge Regression

06/11/2020 ∙ by Shuai Tang, et al. ∙ University of California, San Diego 76

The large amount of online data and vast array of computing resources enable current researchers in both industry and academia to employ the power of deep learning with neural networks. While deep models trained with massive amounts of data demonstrate promising generalisation ability on unseen data from relevant domains, the computational cost of finetuning gradually becomes a bottleneck in transfering the learning to new domains. We address this issue by leveraging the low-rank property of learnt feature vectors produced from deep neural networks (DNNs) with the closed-form solution provided in kernel ridge regression (KRR). This frees transfer learning from finetuning and replaces it with an ensemble of linear systems with many fewer hyperparameters. Our method is successful on supervised and semi-supervised transfer learning tasks.



There are no comments yet.


page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Besides soaring high on tasks they are trained on, deep neural networks have also excelled on tasks where datasets are collected from similar domains. Prior work Yosinski et al. (2014)

showed that filters/parameters learnt in DNNs pretrained on ImageNet generalise better with slight finetuning than those learnt from random initialisations. Since then, applications in Computer Vision have had major breakthroughs by initialising DNNs with pretrained parameters and finetuning them to adapt to new tasks. Similarly, Natural Language Processing (NLP) welcomed its “ImageNet Era” with large and deep pretrained language models including Bert

Devlin et al. (2018), and performance on downstream NLP tasks has achieved state-of-the-art on a daily basis by employing more data and deeper models during pretraining, and using smarter methods for finetuning.

These advances in transfer learning using pretrained DNNs and finetuning, however, come with a large computational cost. An essential step to boost the performance on a given new task is to finetune the pretrained DNN until it converges, which is computationally intense since these models tend to have hundreds of millions of parameters. An alternative approach is to freeze the parameters and treat the pretrained DNN model as a feature extractor which produces abstracted vector representations of data samples with the knowledge from pretraining, and then train a simple classifier on top of these extracted vectors. But as the parameters are not adapted to the new task, the latter approach provides inferior performance to finetuning.

We here propose a new way of augmenting the latter approach without finetuning the DNN. Our approach is to take an accumulation of feature vectors produced at different individual layers which encode various different aspects of the data. Since feature vectors are highly correlated with each other, as they are generated from a single DNN, only a few of them are needed to make predictions. We adopt the alignment maximisation algorithm for combining kernels Cortes et al. (2012)

, in which we first find a convex combination of linear kernels constructed from individual layers that gives maximal alignment with the target kernel constructed from one-hot encoding of the labels. Then, we take the ensemble of feature vectors of layers selected by non-zero elements in the sparse combination, and make predictions using kernel ridge regression (KRR).

2 Related Work

Transfer learning with classical machine learning methods has been studied for a couple of decades

Pan and Yang (2010), including boosting Dai et al. (2007)

, support vector machines

Muandet et al. (2013), ridge regression Cortes and Mohri (2011), etc. These methods benefit from the transparency of classical machine learning models, and universal function approximators including boosting and kernel methods with strong theoretical guarantees. However, it is not easy to incorporate structural priors into regularising the learning process, such as our knowledge about images and text. This information is crucial in advancing machine learning systems.

Neural networks are also universal function approximators Hornik et al. (1989), and learnt vectorised representations are generalisable across tasks, with recent advances in various architecture designs specifically for individual types of inputs, including convolutional layers for image recognition LeCun et al. (1998), recurrent layers Elman (1990); Hochreiter and Schmidhuber (1997) and transformers Vaswani et al. (2017) for text processing, etc. Recent research has demonstrated that deep models pretrained on large amounts of training data give decent performance on unseen data sampled from relevant domains Yosinski et al. (2014) by finetuning. With growing depth of networks, the cost for finetuning becomes non-negligible. Efforts in knowledge distillation from deep models to shallow ones Ba and Caruana (2014); Hinton et al. (2015) and to simple ones Frosst and Hinton (2017) showed that neural networks can be simplified after learning, although the learnt transferable features can be potentially detrimented during distillation.

Our approach takes the best of both worlds by using feature vectors produced from multiple layers of a pretrained neural network but without explicit finetuning, and makes predictions with KRRs on a downstream task. With help from low-rank approximations, our approach only requires passing the training data once through a neural network without backpropagation.

3 Method

The key concept is to apply KRR with a few layers of feature vectors produced from a pretrained neural network to make predictions, classification in our case, on a downstream task. The notations include: is the data matrix with samples with each sample in -dimensional space, is the corresponding labels with one-hot encoding, is the flattened feature vectors produced at the -th layer from a pretrained neural network, is the random projection matrix that meets the requirement of subspace embedding with ,

is the identity matrix,

is the number of layers in a pretrained neural network, and is the regularisation term in ridge regression. Other notations will be introduced as needed.

3.1 Low-rank Approximation at Individual Layers

Flattened feature vectors generated from neural networks are generally high-dimensional and redundant, therefore, we adopt theoretical work Udell and Townsend (2019) showing that big data matrices are approximately low rank, and use random projections to obtain low-rank approximations of high-dimensional feature vectors with many fewer dimensions. Given that the Nyström method is well-studied in approximating large-scale kernel matrices Gittens and Mahoney (2016), we follow the formula to approximate a linear kernel as


where is the pseudo-inverse of a square matrix. If , which is mostly the case for feature vectors generated from neural networks, and the eigendecomposition is written as , then the low-rank approximation of can be obtained by with each sample in at most -dimensional space. As we aim to conduct layer-wise low-rank approximations, it is preferrable to apply sparse random projections instead of dense ones. Therefore, we consider a stack of CountSketch Clarkson and Woodruff (2013) to approximate the sparse Johnson-Lindenstrauss Transformation Woodruff (2014).

In CountSketch, the random projection matrix is considered as a hash table that uniformly hashes samples into buckets with a binary value randomly sampled from so there is no need to materialise . Successful applications of CountSketch including polynomial kernel approximation Pham and Pagh (2013) and large-scale regressions are due to its scalability with theoretical guarantees when few hash tables are used Woodruff (2014). Generally, larger leads to better approximations, yet the performance improvement becomes marginal. Prior work Jagadeesan (2019) showed that empirically works on real-world datasets, thus, we set , and it drastically reduces the cost for low-rank approximations at individual layers. The time complexity of Nyström is .

With limited GPU memory, producing feature vectors for a downstream task given a pretrained neural network is often done in batches of samples. CountSketch is also well-suited in this situation as, technically, the approximation can be done in only one forward pass of .

3.2 Convex Combination of Features across Layers by Learning Kernel Alignment

Storing feature vectors at layers has memory complexity at most , thus we aim to select only a few layers that give the maximum alignment with the target. Specificially, a vector is optimised to maximise the following alignment Cortes et al. (2012):


Proposition 9 in Cortes et al. (2012) showed that it is equivalent to the quadratic programming problem: , where and , then . Intuitively, Non-zero entries in provide a weighted sparse combination of feature vectors from a few layers that gives the highest linear alignment with targets. The time complexity is dominated by materialising , which is at worst.

The kernel induces an embedding space which is a concatenation of feature vectors weighted by , then the optimisation problem in Eq. 2 can be written in a weight-space perspective:


It is worth noting that the objective is not the “goodness-of-fit” measure for linear regression,

statistics , where

contains eigenvectors of

. Optimising to maximise will lead to drastic overfitting by accumulating all layers, and subsequently meaningless . The aforementioned objective finds a convex combination of features that maximises the alignment between the subspace spanned by the concatenated features and that by the onehot encoded label space, so it prevents from accumulating more feature vectors once an optimal subset is obtained. Therefore, the alignment-based objective prevents overfitting to a certain degree.

3.3 [Optional Step] Nyström for Large-scale Kernel Approximation

We denote as the number of layers with positive ’s. Since, in the end, the predictions are made by kernel ridge regression, if is at a manageable order, then there is no need to conduct kernel approximation through Nyström. However, low-rank approximation can potentially help reduce the noise in data, which leads to a better generalisation compared to computing the exact kernel function.

We consider approximating an RBF kernel function with the Nyström method using the same subsampling in Sec. 3.1, CountSketch, to further promote fast computation on accumulated feature vectors . We denote the number of buckets in hash functions as , then the time complexity of this step is . Since and , the dominating term in the complexity is . The hyperparameter

is heuristically set to

. One could cross-validate as well, however, for the sake of reducing of the complexity of transfer learning, we stick to the heuristic value.

3.4 Ridge Regression for Predictions

The approximated low-rank feature map of an RBF function is denoted as . Given a new data sample , the prediction is given the closed-form solution of ridge regression in the table.


Then the label of a test sample is the index of the maximum value in predicted . The time complexity of ridge regression is determined by the inverse of a square matrix and the matrix multiplication that gives the square matrix, and it is .

In summary, our proposed method has four steps including 1) CountSketch to obtain low-rank feature vectors at individual layers to a manageable size, 2) convex combination to take weighted accumulation of feature vectors, 3) Nyström for approximating an RBF kernel, and 4) KRR to make predictions. Compared to multiple forward and backward passes required in finetuning or training classifiers, our method drastically reduces the computational cost.

4 Experiments

We demonstrate the effectiveness of our method through experiments on transfering ResNet-based models He et al. (2015, 2016) pretrained on the ImageNet dataset Deng et al. (2009); Russakovsky et al. (2015) to downstream tasks, including three in-domain datasets, CIFAR-10, CIFAR-100 Krizhevsky (2009), STL10 Coates et al. (2011), and three out-of-domain ones, Street View House Number (SVHN) Netzer et al. (2011), Caltech-UCSD-200 (CUB200) Welinder et al. (2010), Kuzushiji49 Clanuwat et al. (2018)111The full Kuzushiji49 dataset has 232k training images, whick takes too long to cross-validate hyperparameters for LogReg. Thus, the same half of the dataset is used. . Basic statistics of each dataset are presented in Table 1.

Training In-domain Transfer Out-of-domain Transfer
ImageNet CIFAR10 CIFAR100 STL10 SVHN CUB200 Kuzushiji49
1.2m / - [1000] 50k / 10k [10] 50k / 10k [100] 5k / 8k [10] 73k / 26k [10] 6k / 6k [200] 116k / 38k [49]
Table 1: Dataset details. Individual cell indicates (# Training Samples / # Test Samples / [# Classes]).

Hyperparameter Settings: We report results with , and . ResNet-18 and ResNet-34 pretrained on ImageNet are selected as base models to transfer from. To reduce the memory cost, instead of hashing all layers, we only hash feature vectors from every residual block in a model as each block usually has two or three convolutional layers. The regularisation strength is cross-validated on the training set of the downstream task with values ranging from .

Comparison Partner: Finetuning the top layer on each downstream task with softmax regression.

Models are finetuned for 30 epochs with Adam optimiser, and the learning rate decays by a factor of 2 every 10 epochs. Cross validation is conducted to optimise the following hyperparameters and their associated values: data augmentation={with, without}, weight decay rate={

,,}, initial learning rate={, }. Note that finetuning with data augmentation tremendously increases the training time as the neural network needs to be kept during finetuning, while for others one can store feature vectors from the last layer prior to finetuning. Results are marked with LogReg in the following tables and figures.


: Since our method involves random projections and comparison partners require initialisation, for fair comparison, we run each method five times with different random seeds, and each marker in each plot presents the mean of five trials along with a vertical bar indicating the standard deviation. It is noticeable that vertical bars are often invisible as hyperparameters of each method are cross-validated on the training set. The main results are presented in Tab.


CIFAR10 [In] CIFAR100 [In] STL10 [In] CUB200 [Out] SVHN [Out] Kuzushiji49 [Out]
LogReg 87.45 / 89.94 69.08 / 72.76 95.08 / 96.55 60.80 / 61.60 64.36 / 59.47 74.56 / 71.08
Ours 90.77 / 92.31 71.31 / 74.63 96.30 / 97.31 58.78 / 61.70 88.76 / 88.53 88.12 / 88.00
Table 2: Results of supervised transfer learning.

Median accuracy of five trials is reported in each cell, and each cell has two accuracy terms of transferring from [ResNet-18 / ResNet-34]. Expect for CUB200, our method outperforms LogReg significantly since the variance of five trials is very small as presented in figures.

4.1 Supervised Transfer with Varying Portions of Training Samples

Since individual downstream tasks have ample samples in the training set, it encourages us to study our method and it comparison method when varying the portion of training samples. Specifically, the kept portion of training samples varies from to , and the interval is determined linearly in the log-space. The results are presented in Fig. 1.

Our method for outperforms significantly finetuning the top layer on five out of six transfer tasks with different portions of training samples, and only performs relatively similar to finetuning on CUB200, which is a finegrained bird species recognition task.

(a) In-domain transfer tasks
(b) Out-of-domain transfer tasks
Figure 1: Supervised transfer with varying portions of training samples from the transfer task. Except for CUB200, our method with all three ’s generalises better than LogReg does (purple lines in plots) when the portion of training samples varies from to , and the observation is consistent across two different depths of ImageNet models.
Figure 2: Convex combination of layers vs. Varying portions of training samples. In-domain transfer tasks assign higher values to top few layers, and out-of-domain ones tend to give higher values to slightly lower layers.

4.2 Insights provided by

The solution to Eq. 2 provides insights on the number of accumulated layers and their weights. We plot a heatmap with y-axis indicating the index of layers, x-axis indicating the portion of training samples, and gradient colour scheme presenting the value of in Fig. 2.

As shown in He et al. (2016), the penultimate layer (index 11 for resnet18 and 19 for resnet34) of a ResNet removes all spatial information by averaging outputs from the previous layer. As illustrated in Fig. 2, layers before the penultimate layer have been assigned non-zero ’s across six tasks confirming that preserving spatial information helps in transfer learning.

For in-domain transfer tasks, it turns out that the top few layers are the most useful, and the improvement of our method is brought by the ability of identifying and accumulating these layers. STL10 contains images from the ImageNet dataset but with lower resolutions, so the penultimate layer provides adequately abstract information of the images, which explains the observation that our method assigns a very dominating towards the penultimate layer.

For SVHN and Kuzushiji49, clearly, the selected layers don’t include the feature vectors generated from the penultimate layer, and lower layers give higher , which results in better performance than finetuning the top linear layer. However, our method doesn’t provide better performance compared to finetuning the last layer on CUB200. A potential explanation comes from the fact that kernel ridge regression learns one-vs-all classifiers, and it is suitable when many classes are presented. This is a limitation of our method, but also a research direction for future study.

(a) In-domain transfer tasks
(b) Out-of-domain transfer tasks
Figure 3: Accuracy of accumulating Residual Blocks vs. that of individual Ones. Line plots indicate accuracy of accumulating blocks until the exhaustion of non-zero , and bar plots indicates the performance of these blocks separately. (I) In-domain transfer tasks demonstrate increasing accuracy when blocks are accumulated gradually, so does Kuzushiji49, which validates that accumulating layers helps. (II) Note that values don’t directly imply the importance of layers, and that explains why the bar plots don’t have a monotonic trend.

4.3 Accumulated Feature Vectors vs Individual Feature Vectors

As in our method is a weighted concatenation of feature vectors from layers with non-zero ’s, it is important to conduct a sanity check on the effectiveness of accumulating layers compared to using these layers alone. Therefore, we gradually accumulate layers sorted by their ’s, and plot the performance curve versus the number of accumulated layers. Then these layers are applied individually to make predictions as a comparison. We use the full training dataset in this subsection. The results are shown in Fig. 3.

For in-domain transfer tasks, we see that the performance improves as our method accumulates layers, while the trend is not obvious/significant for out-of-domain transfer tasks. Overall, accumulating a few layers provides better performance than making predictions based on individual layers.

(a) in-domain transfer tasks
(b) out-of-domain transfer tasks
Figure 4:

Accuracy of semi-supervised learning with varying number of labelled samples per class.

The number of labelled examples changes from to per class and the rest are left unlabelled for semi-supervised learning. Left y-axis for line plots refers to the accuracy of semi-supervised learning, and right y-axis for bar plots refers to the relative improvement brought by unlabelled data. (I) Our method gives better performance than LogReg overall expect for CUB200. (II) Our method is also better at leveraging unlabelled samples for learning as indicated by taller bars for ours than LogReg expect for STL10 and CUB200.

4.4 Semi-supervised Transfer Learning via Transductive Regression

There are many ways of incorporating unlabelled data into kernel ridge regression, including manifold regularisation Belkin et al. (2006) and transductive learning Cortes and Mohri (2006). Since manifold regularisation requires exact computation or an approximation of the Laplacian matrix on labelled and unlabelled samples, which leads to increased learning time, we adopted the transductive learning method for regression problems to leverage unlabelled data when extremely limited labelled training samples with large amount of unlabelled samples are provided. The solution of transductive ridge regression Cortes and Mohri (2006) is


where and are hyparameters that control the contribution from unlabelled data and labelled data, which can be cross-validated on the labelled data. comes from the ridge regression model learnt only on labelled data, and is given as , where is a hyperparameter, and sets the maximum value of the vector prediction of a data sample to and the rest to . For finetuning the top layer, we use the supervised classifier trained on labelled samples to annotate unlabelled data samples, and incorporate these samples into the training set and retrain the classifier with cross-validation.

We simulate a semi-supervised learning environment by keeping 2, 5, 10, 20, 50, or 100 labeled training samples per class on each dataset, and leave the rest as unlabelled samples. The accuracy of semi-supervised transfer learning on the testset is reported in lineplots in Fig. 4, and the relative improvement against supervised transfer learning is reported in barplots in the same figure.

Our method outperforms finetuning the last layer on five out of six transfer tasks indicated by the lineplots. Our method also gives significant relative improvement when unlabelled samples are incorporated through transductive regression on these five tasks as well, while unlabelled samples don’t improve the generalisation ability for the top layer finetuning method. Negative results are concentrated on CUB200, where unlabelled samples become detrimental to our method while helpful for finetuning the top layer.

5 Discussion

(a) in-domain transfer tasks
(b) out-of-domain transfer tasks
Figure 5: Expected Calibration Error (lower the better) with varying portions of training data. LogReg provides better calibrated models than ours does. However, a posthoc adjustment by Temperature Scaling helps our method to match the calibration performance with LogReg.

5.1 Temperature Scaling for Calibration

Although our method provides both speed up and accuracy improvement in transfer learning, we are also interested in how well-calibrated our learnt classifier is compared to finetuning the top layer. It is expected that KRR, SVM and tree-based boosted classifiers are not well-calibrated as the predicted outputs can not be directly interpreted as classifers’ confidence Niculescu-Mizil and Caruana (2005). We calculate the Expected Calibration Error (ECE) Guo et al. (2017)

for our method, and baseline models - finetuning the top layer with logistic regression. The formula of ECE is given as

, where is the number of samples, and

is the number of bins in the estimation.

The results shown in Fig. 5 validate our expectation that our method gives worse calibration on test set compared to logistic regression. A simple cure is Temperature Scaling Guo et al. (2017), which optimises a parameter to rescale the output from the classifer in order to reduce the ECE on training data, and doesn’t change the predicted labels. In our case, we can simply cross-validate very efficiently once the output from KRR is produced.

(a) in-domain transfer tasks
(b) out-of-domain transfer tasks
Figure 6: Nyström vs. Random Projection in the first step of our method. The performance improvement of Nyström over random projection is relatively larger on out-of-domain transfer tasks that it is on in-domain ones, and the observation is consistent across varying portions of training data.

5.2 Random Projections

Our method adopted the Nyström method for low-rank approximation of feature vectors at individual layers, which involves hashing data samples into buckets first, then solving a linear system, and the hash functions can be applied across all layers. A more direct approach is to hash individual features in each feature vector into buckets as in . This approach eliminates the step of solving a linear system, which reduces the time complexity to for each layer. The comparison between Nyström and random projection is presented in Fig. 6.

Overall, Nyström provides better accuracy across all six tasks than random projection. However, it is noticeable that the difference between the two is smaller on in-domain transfer tasks. The observation also serves as a piece of supporting evidence that our method is relatively consistent when different low-rank approximation schemes are applied in the first step.

5.3 Task-dependent Distillation

Our method in previous sections still requires the pretrained ImageNet model during testing time. Now we present task-dependent distillation to leverage the predictions constructed in our method as a regularisation in training smaller networks for individual downstream tasks.

ResNet 8 / ResNet 10
w/o 91.50 / 92.22 69.24 / 69.41 67.31 / 69.21
w 92.23 / 93.22 69.03 / 70.87 69.91 / 71.64
Table 3: Predictions from our method as a regularisation for training small models on individual tasks. “w/o” refers to training models with cross-entropy loss only and “w” refers to training with MSE loss as a regulariser. The median performance of five random initialisations is reported. Overall, predictions from our method helps smaller models to generalise better.

Once predictions on the training set of a task are made by KRR in our method, we store the predictions and remove the pretrained model. This step only requires memory complexity of , where is the number of classes.

We train a ResNet with 8 layers and one with 10 layers on individual tasks with the cross-entropy loss, and use Mean Squared Error loss (MSE) to regress the output of neural networks to the predictions made by KRR from our method. The results are presented in Tab. 3. Overall, regularising small models with predictions of our methods from ImageNet models helps them to obtain better generalisation.

6 Conclusion

We provided a promising four-step Ridge Regression based transfer learning scheme for deep learning models. It doesn’t require finetuning, which simplifies the transfer learning problem to simple regressions, and it is capable of identifying a few layers to accumulate for making better predictions.

We evaluated our method on supervised transfer with varying portions of training data, and handle semi-supervised transfer learning problems via transductive regression. Both show significant improvement compared to finetuning only the last layer.

Discussions addressed the issue of calibration by Temperature Scaling, and demonstrate the superiority of Nyström over plain random projections. Lastly, we showed that the predictions from our method can be used to improve shallow/small models when training directly on transfer tasks.


Shuai Tang and Virginia R. de Sa are supported by NSF IIS-1817226, and Shuai’s research is partly funded by Adobe’s gift funding. We gratefully thank Charlie Dickens and Wesley J. Maddox for fruitful discussions, and appreciate Mengting Wan and Shi Feng for comments on our draft.


  • J. Ba and R. Caruana (2014) Do deep nets really need to be deep?. ArXiv abs/1312.6184. Cited by: §2.
  • M. Belkin, P. Niyogi, and V. Sindhwani (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7, pp. 2399–2434. Cited by: §4.4.
  • T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha (2018) Deep learning for classical japanese literature. ArXiv abs/1812.01718. Cited by: Appendix A, §4.
  • K. L. Clarkson and D. P. Woodruff (2013) Low rank approximation and regression in input sparsity time. In STOC ’13, Cited by: §3.1.
  • A. Coates, A. Y. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In AISTATS, Cited by: Appendix A, §4.
  • C. Cortes, M. Mohri, and A. Rostamizadeh (2012) Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research 13, pp. 795–828. Cited by: Appendix B, §1, §3.2.
  • C. Cortes and M. Mohri (2006) On transductive regression. In Advances in Neural Information Processing Systems, Cited by: §4.4.
  • C. Cortes and M. Mohri (2011) Domain adaptation in regression. In Algorithmic Learning Theory, Cited by: §2.
  • W. Dai, Q. Yang, G. Xue, and Y. Yu (2007) Boosting for transfer learning. In International Conference on Machine Learning, Cited by: §2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li (2009) ImageNet: a large-scale hierarchical image database.

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 248–255.
    Cited by: §4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Cited by: §1.
  • J. L. Elman (1990) Finding structure in time. Cognitive Science 14, pp. 179–211. Cited by: §2.
  • N. Frosst and G. E. Hinton (2017)

    Distilling a neural network into a soft decision tree

    ArXiv abs/1711.09784. Cited by: §2.
  • A. Gittens and M. W. Mahoney (2016) Revisiting the nyström method for improved large-scale machine learning. The Journal of Machine Learning Research 17 (1), pp. 3977–4041. Cited by: §3.1.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In International Conference on Machine Learning, Cited by: §5.1, §5.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. ArXiv abs/1603.05027. Cited by: §4.2, §4.
  • G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. ArXiv abs/1503.02531. Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9, pp. 1735–1780. Cited by: §2.
  • K. Hornik, M. B. Stinchcombe, and H. White (1989) Multilayer feedforward networks are universal approximators. Neural Networks 2, pp. 359–366. Cited by: §2.
  • M. Jagadeesan (2019) Understanding sparse jl for feature hashing. In Advances in Neural Information Processing Systems, Cited by: §3.1.
  • A. Krizhevsky (2009) Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Toronto, Toronto, (en). External Links: Link Cited by: Appendix A, Appendix A, §4.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Cited by: §2.
  • K. Muandet, D. Balduzzi, and B. Schölkopf (2013) Domain generalization via invariant feature representation. In International Conference on Machine Learning, Cited by: §2.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, External Links: Link Cited by: Appendix A, §4.
  • A. Niculescu-Mizil and R. Caruana (2005)

    Predicting good probabilities with supervised learning

    In International Conference on Machine Learning, Cited by: §5.1.
  • S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, pp. 1345–1359. Cited by: §2.
  • N. Pham and R. Pagh (2013) Fast and scalable polynomial kernels via explicit feature maps. In SIGKDD Conference on Knowledge Discovery and Data Mining, Cited by: §3.1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, pp. 211–252. Cited by: §4.
  • M. Udell and A. Townsend (2019) Why are big data matrices approximately low rank?. SIAM Journal on Mathematics of Data Science 1 (1), pp. 144–160. Cited by: §3.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §2.
  • P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001. Cited by: Appendix A, §4.
  • D. P. Woodruff (2014) Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science 10, pp. 1–157. Cited by: §3.1, §3.1.
  • J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems, Cited by: §1, §2.

Appendix A Dataset Descriptions

CIFAR10 Krizhevsky (2009) consists of images, each of size . The train/test split is made available, and the training set contains images and the test set contains the rest. Each image has an object at the center, and the total object categories are similar to ones in ImageNet dataset.

CIFAR100 Krizhevsky (2009) has images as well, each of size . The ratio of training images and test ones is the same as in CIFAR10. Each image has an object at the center, and in total, there are 100 object categories, which makes the task harder than CIFAR10.

STL10 Coates et al. (2011) has images for training and images for testing per class, each of size and in total, there are classes. Since images of this dataset come from labelled samples from ImageNet but with lower resolution, models pretrained on ImageNet are expected to generalise well.

SVHN Netzer et al. (2011) consists of real-world images obtained from house numbers in Google Street View images, therefore, there are categories. The training set contains images, and the test set contains . The dataset also provides a set of unlabelled images, and we didn’t make use of it in our study.

CUB200 Welinder et al. (2010) is an image dataset with photos of 200 bird species (mostly North American). The total number of training images is , therefore, each class has around 30 training examples. The task itself is considered to difficult as it requires the model to pay attention to details of the bird presented in each image, which makes it a fine-grained classification problem.

Kuzushiji49 Clanuwat et al. (2018) is a Japanese character recognition task, which contains Hiragana characters and one Hiragana iteration mark. The dataset itself is much larger than aforementioned ones, and contains only gray-scale images. The training set contains images, and in our study, we only used half of the whole set. The test set contains images, which is used to evaluate the effectiveness of our method and other methods.

Appendix B Results: RBF Baselines

RBF kernels as universal kernels are widely used in many research domains. Since we used features produced by neural network models learnt on the ImageNet dataset as inputs to an RBF kernel, it is reasonable to compare to the method that takes an ensemble of RBF kernels with various bandwidths and directly takes the vectorised images as inputs. Nyström approximation is applied to reduce the memory complexity.

Individual RBF kernels are selected as follows, and learning kernel alignment Cortes et al. (2012) is also applied to find the optimal combination of RBF kernels with different bandwidths.


The results are presented in Tab. 4. Since RBF kernels are directly operating on pixels of images without neural networks, the performance is worse than ours or finetuning the top layer (LogReg). It serves as an supporting evidence that inductive biases (prior knowledge) introduced by convolutional layers are important in image recognition tasks.

Methods CIFAR10 [In] CIFAR100 [In] STL10 [In]
LogReg 87.45 / 89.94 69.08 / 72.76 95.08 / 96.55
RBF 51.40 21.76 43.58
Ours 90.77 / 92.31 71.31 / 74.63 96.30 / 97.31
Table 4: Results of LogReg, our method and RBF kernels.

Appendix C Results: Transferring within In-domain tasks

We have three in-domain transfer tasks, and train a model for each task then evaluate its performance on other tasks using our method. The results are presented in Tab. 5. Overall, our method provides reasonable performance across tasks, and it doesn’t involve finetuning the models. Specifically, for STL10, as the dataset itself has very few images in the training set, models trained on CIFAR10 and CIFAR100 give better generalisation on STL10 than those trained on STL10 itself.

Tasks for Pretraining Transfer Tasks
CIFAR10 94.61 57.16 82.81
CIFAR100 87.85 76.21 80.63
STL10 73.72 43.65 67.85
Table 5: Transferring within In-domain tasks.