1 Introduction
Besides soaring high on tasks they are trained on, deep neural networks have also excelled on tasks where datasets are collected from similar domains. Prior work Yosinski et al. (2014)
showed that filters/parameters learnt in DNNs pretrained on ImageNet generalise better with slight finetuning than those learnt from random initialisations. Since then, applications in Computer Vision have had major breakthroughs by initialising DNNs with pretrained parameters and finetuning them to adapt to new tasks. Similarly, Natural Language Processing (NLP) welcomed its “ImageNet Era” with large and deep pretrained language models including Bert
Devlin et al. (2018), and performance on downstream NLP tasks has achieved stateoftheart on a daily basis by employing more data and deeper models during pretraining, and using smarter methods for finetuning.These advances in transfer learning using pretrained DNNs and finetuning, however, come with a large computational cost. An essential step to boost the performance on a given new task is to finetune the pretrained DNN until it converges, which is computationally intense since these models tend to have hundreds of millions of parameters. An alternative approach is to freeze the parameters and treat the pretrained DNN model as a feature extractor which produces abstracted vector representations of data samples with the knowledge from pretraining, and then train a simple classifier on top of these extracted vectors. But as the parameters are not adapted to the new task, the latter approach provides inferior performance to finetuning.
We here propose a new way of augmenting the latter approach without finetuning the DNN. Our approach is to take an accumulation of feature vectors produced at different individual layers which encode various different aspects of the data. Since feature vectors are highly correlated with each other, as they are generated from a single DNN, only a few of them are needed to make predictions. We adopt the alignment maximisation algorithm for combining kernels Cortes et al. (2012)
, in which we first find a convex combination of linear kernels constructed from individual layers that gives maximal alignment with the target kernel constructed from onehot encoding of the labels. Then, we take the ensemble of feature vectors of layers selected by nonzero elements in the sparse combination, and make predictions using kernel ridge regression (KRR).
2 Related Work
Transfer learning with classical machine learning methods has been studied for a couple of decades
Pan and Yang (2010), including boosting Dai et al. (2007)Muandet et al. (2013), ridge regression Cortes and Mohri (2011), etc. These methods benefit from the transparency of classical machine learning models, and universal function approximators including boosting and kernel methods with strong theoretical guarantees. However, it is not easy to incorporate structural priors into regularising the learning process, such as our knowledge about images and text. This information is crucial in advancing machine learning systems.Neural networks are also universal function approximators Hornik et al. (1989), and learnt vectorised representations are generalisable across tasks, with recent advances in various architecture designs specifically for individual types of inputs, including convolutional layers for image recognition LeCun et al. (1998), recurrent layers Elman (1990); Hochreiter and Schmidhuber (1997) and transformers Vaswani et al. (2017) for text processing, etc. Recent research has demonstrated that deep models pretrained on large amounts of training data give decent performance on unseen data sampled from relevant domains Yosinski et al. (2014) by finetuning. With growing depth of networks, the cost for finetuning becomes nonnegligible. Efforts in knowledge distillation from deep models to shallow ones Ba and Caruana (2014); Hinton et al. (2015) and to simple ones Frosst and Hinton (2017) showed that neural networks can be simplified after learning, although the learnt transferable features can be potentially detrimented during distillation.
Our approach takes the best of both worlds by using feature vectors produced from multiple layers of a pretrained neural network but without explicit finetuning, and makes predictions with KRRs on a downstream task. With help from lowrank approximations, our approach only requires passing the training data once through a neural network without backpropagation.
3 Method
The key concept is to apply KRR with a few layers of feature vectors produced from a pretrained neural network to make predictions, classification in our case, on a downstream task. The notations include: is the data matrix with samples with each sample in dimensional space, is the corresponding labels with onehot encoding, is the flattened feature vectors produced at the th layer from a pretrained neural network, is the random projection matrix that meets the requirement of subspace embedding with ,
is the identity matrix,
is the number of layers in a pretrained neural network, and is the regularisation term in ridge regression. Other notations will be introduced as needed.3.1 Lowrank Approximation at Individual Layers
Flattened feature vectors generated from neural networks are generally highdimensional and redundant, therefore, we adopt theoretical work Udell and Townsend (2019) showing that big data matrices are approximately low rank, and use random projections to obtain lowrank approximations of highdimensional feature vectors with many fewer dimensions. Given that the Nyström method is wellstudied in approximating largescale kernel matrices Gittens and Mahoney (2016), we follow the formula to approximate a linear kernel as
(1) 
where is the pseudoinverse of a square matrix. If , which is mostly the case for feature vectors generated from neural networks, and the eigendecomposition is written as , then the lowrank approximation of can be obtained by with each sample in at most dimensional space. As we aim to conduct layerwise lowrank approximations, it is preferrable to apply sparse random projections instead of dense ones. Therefore, we consider a stack of CountSketch Clarkson and Woodruff (2013) to approximate the sparse JohnsonLindenstrauss Transformation Woodruff (2014).
In CountSketch, the random projection matrix is considered as a hash table that uniformly hashes samples into buckets with a binary value randomly sampled from so there is no need to materialise . Successful applications of CountSketch including polynomial kernel approximation Pham and Pagh (2013) and largescale regressions are due to its scalability with theoretical guarantees when few hash tables are used Woodruff (2014). Generally, larger leads to better approximations, yet the performance improvement becomes marginal. Prior work Jagadeesan (2019) showed that empirically works on realworld datasets, thus, we set , and it drastically reduces the cost for lowrank approximations at individual layers. The time complexity of Nyström is .
With limited GPU memory, producing feature vectors for a downstream task given a pretrained neural network is often done in batches of samples. CountSketch is also wellsuited in this situation as, technically, the approximation can be done in only one forward pass of .
3.2 Convex Combination of Features across Layers by Learning Kernel Alignment
Storing feature vectors at layers has memory complexity at most , thus we aim to select only a few layers that give the maximum alignment with the target. Specificially, a vector is optimised to maximise the following alignment Cortes et al. (2012):
(2) 
Proposition 9 in Cortes et al. (2012) showed that it is equivalent to the quadratic programming problem: , where and , then . Intuitively, Nonzero entries in provide a weighted sparse combination of feature vectors from a few layers that gives the highest linear alignment with targets. The time complexity is dominated by materialising , which is at worst.
The kernel induces an embedding space which is a concatenation of feature vectors weighted by , then the optimisation problem in Eq. 2 can be written in a weightspace perspective:
(3) 
It is worth noting that the objective is not the “goodnessoffit” measure for linear regression,
statistics , wherecontains eigenvectors of
. Optimising to maximise will lead to drastic overfitting by accumulating all layers, and subsequently meaningless . The aforementioned objective finds a convex combination of features that maximises the alignment between the subspace spanned by the concatenated features and that by the onehot encoded label space, so it prevents from accumulating more feature vectors once an optimal subset is obtained. Therefore, the alignmentbased objective prevents overfitting to a certain degree.3.3 [Optional Step] Nyström for Largescale Kernel Approximation
We denote as the number of layers with positive ’s. Since, in the end, the predictions are made by kernel ridge regression, if is at a manageable order, then there is no need to conduct kernel approximation through Nyström. However, lowrank approximation can potentially help reduce the noise in data, which leads to a better generalisation compared to computing the exact kernel function.
We consider approximating an RBF kernel function with the Nyström method using the same subsampling in Sec. 3.1, CountSketch, to further promote fast computation on accumulated feature vectors . We denote the number of buckets in hash functions as , then the time complexity of this step is . Since and , the dominating term in the complexity is . The hyperparameter
is heuristically set to
. One could crossvalidate as well, however, for the sake of reducing of the complexity of transfer learning, we stick to the heuristic value.3.4 Ridge Regression for Predictions
The approximated lowrank feature map of an RBF function is denoted as . Given a new data sample , the prediction is given the closedform solution of ridge regression in the table.
condition  
prediction 
Then the label of a test sample is the index of the maximum value in predicted . The time complexity of ridge regression is determined by the inverse of a square matrix and the matrix multiplication that gives the square matrix, and it is .
In summary, our proposed method has four steps including 1) CountSketch to obtain lowrank feature vectors at individual layers to a manageable size, 2) convex combination to take weighted accumulation of feature vectors, 3) Nyström for approximating an RBF kernel, and 4) KRR to make predictions. Compared to multiple forward and backward passes required in finetuning or training classifiers, our method drastically reduces the computational cost.
4 Experiments
We demonstrate the effectiveness of our method through experiments on transfering ResNetbased models He et al. (2015, 2016) pretrained on the ImageNet dataset Deng et al. (2009); Russakovsky et al. (2015) to downstream tasks, including three indomain datasets, CIFAR10, CIFAR100 Krizhevsky (2009), STL10 Coates et al. (2011), and three outofdomain ones, Street View House Number (SVHN) Netzer et al. (2011), CaltechUCSD200 (CUB200) Welinder et al. (2010), Kuzushiji49 Clanuwat et al. (2018)^{1}^{1}1The full Kuzushiji49 dataset has 232k training images, whick takes too long to crossvalidate hyperparameters for LogReg. Thus, the same half of the dataset is used. . Basic statistics of each dataset are presented in Table 1.
Training  Indomain Transfer  Outofdomain Transfer  
ImageNet  CIFAR10  CIFAR100  STL10  SVHN  CUB200  Kuzushiji49 
1.2m /  [1000]  50k / 10k [10]  50k / 10k [100]  5k / 8k [10]  73k / 26k [10]  6k / 6k [200]  116k / 38k [49] 
Hyperparameter Settings: We report results with , and . ResNet18 and ResNet34 pretrained on ImageNet are selected as base models to transfer from. To reduce the memory cost, instead of hashing all layers, we only hash feature vectors from every residual block in a model as each block usually has two or three convolutional layers. The regularisation strength is crossvalidated on the training set of the downstream task with values ranging from .
Comparison Partner: Finetuning the top layer on each downstream task with softmax regression.
Models are finetuned for 30 epochs with Adam optimiser, and the learning rate decays by a factor of 2 every 10 epochs. Cross validation is conducted to optimise the following hyperparameters and their associated values: data augmentation={with, without}, weight decay rate={
,,}, initial learning rate={, }. Note that finetuning with data augmentation tremendously increases the training time as the neural network needs to be kept during finetuning, while for others one can store feature vectors from the last layer prior to finetuning. Results are marked with LogReg in the following tables and figures.Trials
: Since our method involves random projections and comparison partners require initialisation, for fair comparison, we run each method five times with different random seeds, and each marker in each plot presents the mean of five trials along with a vertical bar indicating the standard deviation. It is noticeable that vertical bars are often invisible as hyperparameters of each method are crossvalidated on the training set. The main results are presented in Tab.
2.CIFAR10 [In]  CIFAR100 [In]  STL10 [In]  CUB200 [Out]  SVHN [Out]  Kuzushiji49 [Out]  
LogReg  87.45 / 89.94  69.08 / 72.76  95.08 / 96.55  60.80 / 61.60  64.36 / 59.47  74.56 / 71.08 
Ours  90.77 / 92.31  71.31 / 74.63  96.30 / 97.31  58.78 / 61.70  88.76 / 88.53  88.12 / 88.00 
Median accuracy of five trials is reported in each cell, and each cell has two accuracy terms of transferring from [ResNet18 / ResNet34]. Expect for CUB200, our method outperforms LogReg significantly since the variance of five trials is very small as presented in figures.
4.1 Supervised Transfer with Varying Portions of Training Samples
Since individual downstream tasks have ample samples in the training set, it encourages us to study our method and it comparison method when varying the portion of training samples. Specifically, the kept portion of training samples varies from to , and the interval is determined linearly in the logspace. The results are presented in Fig. 1.
Our method for outperforms significantly finetuning the top layer on five out of six transfer tasks with different portions of training samples, and only performs relatively similar to finetuning on CUB200, which is a finegrained bird species recognition task.
4.2 Insights provided by
The solution to Eq. 2 provides insights on the number of accumulated layers and their weights. We plot a heatmap with yaxis indicating the index of layers, xaxis indicating the portion of training samples, and gradient colour scheme presenting the value of in Fig. 2.
As shown in He et al. (2016), the penultimate layer (index 11 for resnet18 and 19 for resnet34) of a ResNet removes all spatial information by averaging outputs from the previous layer. As illustrated in Fig. 2, layers before the penultimate layer have been assigned nonzero ’s across six tasks confirming that preserving spatial information helps in transfer learning.
For indomain transfer tasks, it turns out that the top few layers are the most useful, and the improvement of our method is brought by the ability of identifying and accumulating these layers. STL10 contains images from the ImageNet dataset but with lower resolutions, so the penultimate layer provides adequately abstract information of the images, which explains the observation that our method assigns a very dominating towards the penultimate layer.
For SVHN and Kuzushiji49, clearly, the selected layers don’t include the feature vectors generated from the penultimate layer, and lower layers give higher , which results in better performance than finetuning the top linear layer. However, our method doesn’t provide better performance compared to finetuning the last layer on CUB200. A potential explanation comes from the fact that kernel ridge regression learns onevsall classifiers, and it is suitable when many classes are presented. This is a limitation of our method, but also a research direction for future study.
4.3 Accumulated Feature Vectors vs Individual Feature Vectors
As in our method is a weighted concatenation of feature vectors from layers with nonzero ’s, it is important to conduct a sanity check on the effectiveness of accumulating layers compared to using these layers alone. Therefore, we gradually accumulate layers sorted by their ’s, and plot the performance curve versus the number of accumulated layers. Then these layers are applied individually to make predictions as a comparison. We use the full training dataset in this subsection. The results are shown in Fig. 3.
For indomain transfer tasks, we see that the performance improves as our method accumulates layers, while the trend is not obvious/significant for outofdomain transfer tasks. Overall, accumulating a few layers provides better performance than making predictions based on individual layers.
Accuracy of semisupervised learning with varying number of labelled samples per class.
The number of labelled examples changes from to per class and the rest are left unlabelled for semisupervised learning. Left yaxis for line plots refers to the accuracy of semisupervised learning, and right yaxis for bar plots refers to the relative improvement brought by unlabelled data. (I) Our method gives better performance than LogReg overall expect for CUB200. (II) Our method is also better at leveraging unlabelled samples for learning as indicated by taller bars for ours than LogReg expect for STL10 and CUB200.4.4 Semisupervised Transfer Learning via Transductive Regression
There are many ways of incorporating unlabelled data into kernel ridge regression, including manifold regularisation Belkin et al. (2006) and transductive learning Cortes and Mohri (2006). Since manifold regularisation requires exact computation or an approximation of the Laplacian matrix on labelled and unlabelled samples, which leads to increased learning time, we adopted the transductive learning method for regression problems to leverage unlabelled data when extremely limited labelled training samples with large amount of unlabelled samples are provided. The solution of transductive ridge regression Cortes and Mohri (2006) is
(4) 
where and are hyparameters that control the contribution from unlabelled data and labelled data, which can be crossvalidated on the labelled data. comes from the ridge regression model learnt only on labelled data, and is given as , where is a hyperparameter, and sets the maximum value of the vector prediction of a data sample to and the rest to . For finetuning the top layer, we use the supervised classifier trained on labelled samples to annotate unlabelled data samples, and incorporate these samples into the training set and retrain the classifier with crossvalidation.
We simulate a semisupervised learning environment by keeping 2, 5, 10, 20, 50, or 100 labeled training samples per class on each dataset, and leave the rest as unlabelled samples. The accuracy of semisupervised transfer learning on the testset is reported in lineplots in Fig. 4, and the relative improvement against supervised transfer learning is reported in barplots in the same figure.
Our method outperforms finetuning the last layer on five out of six transfer tasks indicated by the lineplots. Our method also gives significant relative improvement when unlabelled samples are incorporated through transductive regression on these five tasks as well, while unlabelled samples don’t improve the generalisation ability for the top layer finetuning method. Negative results are concentrated on CUB200, where unlabelled samples become detrimental to our method while helpful for finetuning the top layer.
5 Discussion
5.1 Temperature Scaling for Calibration
Although our method provides both speed up and accuracy improvement in transfer learning, we are also interested in how wellcalibrated our learnt classifier is compared to finetuning the top layer. It is expected that KRR, SVM and treebased boosted classifiers are not wellcalibrated as the predicted outputs can not be directly interpreted as classifers’ confidence NiculescuMizil and Caruana (2005). We calculate the Expected Calibration Error (ECE) Guo et al. (2017)
for our method, and baseline models  finetuning the top layer with logistic regression. The formula of ECE is given as
, where is the number of samples, andis the number of bins in the estimation.
The results shown in Fig. 5 validate our expectation that our method gives worse calibration on test set compared to logistic regression. A simple cure is Temperature Scaling Guo et al. (2017), which optimises a parameter to rescale the output from the classifer in order to reduce the ECE on training data, and doesn’t change the predicted labels. In our case, we can simply crossvalidate very efficiently once the output from KRR is produced.
5.2 Random Projections
Our method adopted the Nyström method for lowrank approximation of feature vectors at individual layers, which involves hashing data samples into buckets first, then solving a linear system, and the hash functions can be applied across all layers. A more direct approach is to hash individual features in each feature vector into buckets as in . This approach eliminates the step of solving a linear system, which reduces the time complexity to for each layer. The comparison between Nyström and random projection is presented in Fig. 6.
Overall, Nyström provides better accuracy across all six tasks than random projection. However, it is noticeable that the difference between the two is smaller on indomain transfer tasks. The observation also serves as a piece of supporting evidence that our method is relatively consistent when different lowrank approximation schemes are applied in the first step.
5.3 Taskdependent Distillation
Our method in previous sections still requires the pretrained ImageNet model during testing time. Now we present taskdependent distillation to leverage the predictions constructed in our method as a regularisation in training smaller networks for individual downstream tasks.
ResNet 8 / ResNet 10  
CIFAR10  CIFAR100  STL10  
w/o  91.50 / 92.22  69.24 / 69.41  67.31 / 69.21 
w  92.23 / 93.22  69.03 / 70.87  69.91 / 71.64 
Once predictions on the training set of a task are made by KRR in our method, we store the predictions and remove the pretrained model. This step only requires memory complexity of , where is the number of classes.
We train a ResNet with 8 layers and one with 10 layers on individual tasks with the crossentropy loss, and use Mean Squared Error loss (MSE) to regress the output of neural networks to the predictions made by KRR from our method. The results are presented in Tab. 3. Overall, regularising small models with predictions of our methods from ImageNet models helps them to obtain better generalisation.
6 Conclusion
We provided a promising fourstep Ridge Regression based transfer learning scheme for deep learning models. It doesn’t require finetuning, which simplifies the transfer learning problem to simple regressions, and it is capable of identifying a few layers to accumulate for making better predictions.
We evaluated our method on supervised transfer with varying portions of training data, and handle semisupervised transfer learning problems via transductive regression. Both show significant improvement compared to finetuning only the last layer.
Discussions addressed the issue of calibration by Temperature Scaling, and demonstrate the superiority of Nyström over plain random projections. Lastly, we showed that the predictions from our method can be used to improve shallow/small models when training directly on transfer tasks.
Acknowledgements
Shuai Tang and Virginia R. de Sa are supported by NSF IIS1817226, and Shuai’s research is partly funded by Adobe’s gift funding. We gratefully thank Charlie Dickens and Wesley J. Maddox for fruitful discussions, and appreciate Mengting Wan and Shi Feng for comments on our draft.
References
 Do deep nets really need to be deep?. ArXiv abs/1312.6184. Cited by: §2.
 Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7, pp. 2399–2434. Cited by: §4.4.
 Deep learning for classical japanese literature. ArXiv abs/1812.01718. Cited by: Appendix A, §4.
 Low rank approximation and regression in input sparsity time. In STOC ’13, Cited by: §3.1.
 An analysis of singlelayer networks in unsupervised feature learning. In AISTATS, Cited by: Appendix A, §4.
 Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research 13, pp. 795–828. Cited by: Appendix B, §1, §3.2.
 On transductive regression. In Advances in Neural Information Processing Systems, Cited by: §4.4.
 Domain adaptation in regression. In Algorithmic Learning Theory, Cited by: §2.
 Boosting for transfer learning. In International Conference on Machine Learning, Cited by: §2.

ImageNet: a largescale hierarchical image database.
2009 IEEE Conference on Computer Vision and Pattern Recognition
, pp. 248–255. Cited by: §4.  BERT: pretraining of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Cited by: §1.
 Finding structure in time. Cognitive Science 14, pp. 179–211. Cited by: §2.

Distilling a neural network into a soft decision tree
. ArXiv abs/1711.09784. Cited by: §2.  Revisiting the nyström method for improved largescale machine learning. The Journal of Machine Learning Research 17 (1), pp. 3977–4041. Cited by: §3.1.
 On calibration of modern neural networks. In International Conference on Machine Learning, Cited by: §5.1, §5.1.
 Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.
 Identity mappings in deep residual networks. ArXiv abs/1603.05027. Cited by: §4.2, §4.
 Distilling the knowledge in a neural network. ArXiv abs/1503.02531. Cited by: §2.
 Long shortterm memory. Neural Computation 9, pp. 1735–1780. Cited by: §2.
 Multilayer feedforward networks are universal approximators. Neural Networks 2, pp. 359–366. Cited by: §2.
 Understanding sparse jl for feature hashing. In Advances in Neural Information Processing Systems, Cited by: §3.1.
 Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Toronto, Toronto, (en). External Links: Link Cited by: Appendix A, Appendix A, §4.
 Gradientbased learning applied to document recognition. Cited by: §2.
 Domain generalization via invariant feature representation. In International Conference on Machine Learning, Cited by: §2.
 Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, External Links: Link Cited by: Appendix A, §4.

Predicting good probabilities with supervised learning
. In International Conference on Machine Learning, Cited by: §5.1.  A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, pp. 1345–1359. Cited by: §2.
 Fast and scalable polynomial kernels via explicit feature maps. In SIGKDD Conference on Knowledge Discovery and Data Mining, Cited by: §3.1.
 ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, pp. 211–252. Cited by: §4.
 Why are big data matrices approximately low rank?. SIAM Journal on Mathematics of Data Science 1 (1), pp. 144–160. Cited by: §3.1.
 Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §2.
 CaltechUCSD Birds 200. Technical report Technical Report CNSTR2010001. Cited by: Appendix A, §4.
 Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science 10, pp. 1–157. Cited by: §3.1, §3.1.
 How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
Appendix A Dataset Descriptions
CIFAR10 Krizhevsky (2009) consists of images, each of size . The train/test split is made available, and the training set contains images and the test set contains the rest. Each image has an object at the center, and the total object categories are similar to ones in ImageNet dataset.
CIFAR100 Krizhevsky (2009) has images as well, each of size . The ratio of training images and test ones is the same as in CIFAR10. Each image has an object at the center, and in total, there are 100 object categories, which makes the task harder than CIFAR10.
STL10 Coates et al. (2011) has images for training and images for testing per class, each of size and in total, there are classes. Since images of this dataset come from labelled samples from ImageNet but with lower resolution, models pretrained on ImageNet are expected to generalise well.
SVHN Netzer et al. (2011) consists of realworld images obtained from house numbers in Google Street View images, therefore, there are categories. The training set contains images, and the test set contains . The dataset also provides a set of unlabelled images, and we didn’t make use of it in our study.
CUB200 Welinder et al. (2010) is an image dataset with photos of 200 bird species (mostly North American). The total number of training images is , therefore, each class has around 30 training examples. The task itself is considered to difficult as it requires the model to pay attention to details of the bird presented in each image, which makes it a finegrained classification problem.
Kuzushiji49 Clanuwat et al. (2018) is a Japanese character recognition task, which contains Hiragana characters and one Hiragana iteration mark. The dataset itself is much larger than aforementioned ones, and contains only grayscale images. The training set contains images, and in our study, we only used half of the whole set. The test set contains images, which is used to evaluate the effectiveness of our method and other methods.
Appendix B Results: RBF Baselines
RBF kernels as universal kernels are widely used in many research domains. Since we used features produced by neural network models learnt on the ImageNet dataset as inputs to an RBF kernel, it is reasonable to compare to the method that takes an ensemble of RBF kernels with various bandwidths and directly takes the vectorised images as inputs. Nyström approximation is applied to reduce the memory complexity.
Individual RBF kernels are selected as follows, and learning kernel alignment Cortes et al. (2012) is also applied to find the optimal combination of RBF kernels with different bandwidths.
(5)  
where  
and 
The results are presented in Tab. 4. Since RBF kernels are directly operating on pixels of images without neural networks, the performance is worse than ours or finetuning the top layer (LogReg). It serves as an supporting evidence that inductive biases (prior knowledge) introduced by convolutional layers are important in image recognition tasks.
Methods  CIFAR10 [In]  CIFAR100 [In]  STL10 [In] 
LogReg  87.45 / 89.94  69.08 / 72.76  95.08 / 96.55 
RBF  51.40  21.76  43.58 
Ours  90.77 / 92.31  71.31 / 74.63  96.30 / 97.31 
Appendix C Results: Transferring within Indomain tasks
We have three indomain transfer tasks, and train a model for each task then evaluate its performance on other tasks using our method. The results are presented in Tab. 5. Overall, our method provides reasonable performance across tasks, and it doesn’t involve finetuning the models. Specifically, for STL10, as the dataset itself has very few images in the training set, models trained on CIFAR10 and CIFAR100 give better generalisation on STL10 than those trained on STL10 itself.
Tasks for Pretraining  Transfer Tasks  
CIFAR10  CIFAR100  STL10  
CIFAR10  94.61  57.16  82.81 
CIFAR100  87.85  76.21  80.63 
STL10  73.72  43.65  67.85 
Comments
There are no comments yet.