1 Introduction
Deep neural networks (DNNs) have demonstrated extraordinary success in a variety of fields such as computer vision
(Krizhevsky & Hinton, 2012; He et al., 2016), speech recognition (Hinton et al., 2012), and natural language processing
(Mikolov et al., 2010). However, DNNs are resourcehungry which hinders their wide deployment to some resourcelimited scenarios, especially lowpower embedded devices in the emerging InternetofThings (IoT) domain. To address this limitation, extensive works have been done to accelerate or compress deep neural networks. Putting those works on designing (Chollet, 2016) or automatically searching efficient network architecture aside (Zoph & Le, 2016), most studies try to optimize DNNs from four perspectives: network pruning (Han et al., 2016; Li et al., 2016), network decomposition (Denton et al., 2014; Jaderberg et al., 2014), network quantization (or lowprecision networks) (Gupta et al., 2015; Courbariaux et al., 2016; Rastegari et al., 2016) and knowledge distillation (Hinton et al., 2015; Romero et al., 2015).Among these method categories, knowledge distillation is somewhat different due to the utilization of information from the pretrained teachernet. The concept was proposed by (Bucila et al., 2006; Ba & Caruana, 2014; Hinton et al., 2015)
for transferring knowledge from a large "teacher" model to a compact yet efficient "student" model by matching certain statistics between "teacher" and "student". Further research introduced various kinds of matching mechanisms in the field of DNN optimization. The distillation procedure designs a loss function based on the matching mechanisms and enforces the loss during a full training process. Hence, all these methods usually require timeconsuming training procedure along with fully annotated largescale training dataset.
Meanwhile, some network pruning (Li et al., 2016; Liu et al., 2017) and decomposition (Zhang et al., 2016; Kim et al., 2016) methods can produce extremely small networks, but with large accuracy drops so that timeconsuming finetuning is required for possible accuracy recovery. Usually, it may still not be able to recover the accuracy drops with the original crossentropy loss due to its low representation capacity. Hence, knowledge distillation may be used to alleviate the problem, since the compact studentnet can sometimes be trained to match the performance of the teachernet. For instance, Crowley et al. (2017) uses cheap group convolutions and pointwise convolutions to build a small studentnet and adopts knowledge distillation to transfer knowledge from a fullsized "teachernet" to the "studentnet". However, it still suffers from high training cost.
As is known, children can learn knowledge concept from adults with few examples. This cognition phenomenon has motivated the development of the fewshot learning (FeiFei et al., 2006; Bart & Ullman, 2005), which aims to learn information about object categories from a few training samples, and focuses more on image classification task. Nevertheless, it inspires people to consider the possibility of knowledge distillation from few samples. Some recent works on knowledge distillation address this problem by constructing "pseudo" training data (Kimura et al., 2018; Lopes et al., 2017)
with complicated heuristics and heavy engineering, which are still costly.
This paper proposes a novel and simple threestep method for fewsample knowledge distillation (FSKD) as illustrated in Figure 1, including studentnet design, teacherstudent alignment, and absorbing added convlayer. We assume that both "teacher" and "student" nets have the same feature map sizes at each corresponding block. However, the relatively small studentnet can be obtained in various ways, such as pruning/decomposing the teachernet, and fully redesigned network with random initialization. We add a 11 convlayer at the end of each block of the studentnet and align the blocklevel outputs between "teacher" and "student", which is done by estimating the parameters of the added layer with few samples using least square regression. Since the added 11 convlayers have relatively few parameters, we can get a good approximation from a small number of samples. We further prove that the added 11 convlayer can be absorbed/merged into the previous convlayer when certain conditions fulfill, so that the new convlayer has the same number of parameters and computation cost as the older/previous one.
We argue that FSKD has many potential applications, especially when finetuning or full training is not feasible in practice. We just name a few such cases below. First, edge devices have limited computing resources, while FSKD offers the possibility of ondevice learning to compress deep models with a limited number of samples. Second, FSKD may help software/hardware vendors optimizing the deep models from their customers when full training data is unavailable due to privacy or confidential issues. Third, FSKD enables fast model deployment optimization when there is a strict time budget. Our major contributions can be summarized as follows:

To the best of our knowledge, we are the first to show that knowledge distillation can be done with few samples within minutes on a desktop PC.

The proposed FSKD method is widely applicable not only for fully redesigned studentnets but also for compressed networks from pruning and decompositionbased methods.

We demonstrate significant performance improvement of the studentnet by FSKD, comparing to existing distillation techniques on various datasets and network structures.
2 Related Work
Knowledge Distillation (KD) transfers knowledge from a pretrained large "teacher" network (or even an ensemble of networks) to a small "student" network, for facilitating the deployment at test time. Originally, this is done by regressing the softmax output of the teacher model (Hinton et al., 2015). The soft continuous regression loss used here provides richer information than the label based loss, so that the distilled model can be more accurate than training on labeled data with crossentropy loss. Later, various works have extended this approach by matching other statistics, including intermediate feature responses (Romero et al., 2015; Chen et al., 2016), gradient (Srinivas & Fleuret, 2018), distribution (Huang & Wang, 2017), Gram matrix (Yim et al., 2017), etc. Deep mutual learning (Zhang et al., 2018) trains a cohort of studentnets and teaches each other collaboratively with model distillation throughout the training process. All these methods require a large amount of data (known as the “transfer set”) to transfer the knowledge, whereas we aim to provide a solution with a limited number of samples. We need emphasize that FSKD has a quite different philosophy on aligning intermediate responses to the closest knowledge distillation method FitNet (Romero et al., 2015). FitNet retrains the whole studentnet with intermediate supervision using a larger amount of data, while FSKD only estimates parameters for the added convlayer with few samples. We will verify in experiments that FSKD is not only more efficient but also more accurate than FitNet.
Network Pruning methods obtain a small network by pruning weights from a trained larger network, which can keep the accuracy of the larger model if the prune ratio is set properly. Han et al. (2015) proposes to prune the individual weights that are near zero. Recently, channel pruning has become increasingly popular thanks to its better compatibility with offtheshelf computing libraries, compared with weights pruning. Different criteria have been proposed to select the channel to be pruned, including norm of weights (Li et al., 2016), scales of multiplicative coefficients (Liu et al., 2017), statistics of next layer (Luo et al., 2017), etc. It is usually required iterative loop between pruning and finetuning for achieving better pruning ratio and speedup. Similar gradually adjusting trick is also applied to train verydeep neural networks (Smith et al., 2016). Meanwhile, Network Decomposition methods try to factorize heavy layers in DNNs into multiple lightweight ones. For instance, it may adopt lowrank decomposition to fullyconnection layers (Denton et al., 2014)
, and different kinds of tensor decomposition to convlayers
(Zhang et al., 2016; Kim et al., 2016). However, aggressive network pruning or network decomposition usually lead to large accuracy drops, thus finetuning is required to alleviate those drops (Li et al., 2016; Liu et al., 2017; Zhang et al., 2016). As aforementioned, KD is more accurate than directly training on labeled data, it is of great interest to explore KD on extremely pruned or decomposed networks, especially under the fewsample setting.Learning with few samples has been extensively studied under the concept of oneshot or fewshot learning. One category of methods directly model fewshot samples with generative models (FeiFei et al., 2006; Lake et al., 2011)
, while most others study the problem under the notion of transfer learning
(Bart & Ullman, 2005; Ravi & Larochelle, 2017). In the latter category, metalearning methods (Vinyals et al., 2016; Finn et al., 2017) solve the problem in a learning to learn fashion, which has been recently gaining momentum due to their application versatility. Most studies are devoted to the image classification task, while it is still lessexplored for knowledge distillation from few samples. Recently, some works tried to address this problem. Kimura et al. (2018) constructs pseudoexamples using the inducing point method, and develops a complicated algorithm to optimize the model and pseudoexamples alternatively. Lopes et al. (2017) records perlayer metadata for the teachernet in order to reconstruct a training set, and then adopts a standard training procedure to obtain the studentnet. Both are very costly due to the complicated and heavy training procedure. On the contrary, we aim for a simple solution for knowledge distillation from few samples.3 FewSample Knowledge Distillation (FSKD)
3.1 Overview
Our FSKD method consists of three steps as shown in Figure 1. First, we design a studentnet either by pruning/decomposing the teachernet, or by fully redesigning a small studentnet with random initialization. Second, we add a 11 convlayer at the end of each block of the studentnet and align the blocklevel outputs between "teacher" and "student" by estimating the parameters for the added layer from few samples. Third, we absorb the added 11 convlayer into the previous convlayer without introducing extra parameters and computations into the studentnet.
Two reasons make this idea work efficiently. First, the 11 convlayers have relatively few parameters, which do not require too many data for the estimation. Second, the blocklevel output from teachernet provides rich information as shown in FitNet (Romero et al., 2015). Below, we will first provide the theoretical derivation why the added 11 convlayer could be absorbed/merged into the previous convlayer. Then we provide details on how we do the blocklevel output alignment.
3.2 Absorbable convlayer
Let’s first give some mathematic notions for different kinds of convolutions before moving to the theoretical derivation. A regular convolution consists of multichannel and multikernel filters which build both crosschannel correlations and spatial correlations. Formally, a regular convolution layer can be represented by a 4dimensional tensor , where and are the number of output and input channels respectively, and is the squared spatial kernel size. The pointwise (PW) convolution, also known as convolution (Lin et al., 2014) can be represented by a tensor , which is actually degraded from a 4dimensional tensor to a 2dimensional matrix. The depthwise (DW) convolution (Chollet, 2016) does perchannel 2D convolution for each input channel, so that it can be represented by a tensor . Due to nocorrelation among output channels, it usually follows by a pointwise convolution to model their correlations. This combination (DW + PW) is also named as depthwise separable convolution by Chollet (2016).
Theorem 1.
A pointwise convolution with tensor can be absorbed into the previous convolution layer with tensor to obtain the absorbed tensor , where is absorbing operator and if the following conditions are satisfied.

The output channel number of equals to the input channel number of , i.e., .

No nonlinear activation layer like ReLU (Nair & Hinton, 2010) between and .
Due to the space limitation, we put the proof and the detailed form of the absorbing operator in AppendixA. The number of output channels of is , which is different from that of (i.e., ). It is easy to have the following corollary.
Corollary 1.
When the following condition is satisfied for ,

the number of input and output channels of equals to the number of output channel of , i.e., , ,
the absorbed convolution tensor has the same parameters and computation cost as , i.e. both .
This condition is required not only for ensuring the same parameter size and computing cost, but also for ensuring current layer output size matching/connectable to next layer input size.
3.3 Blocklevel Alignment and Absorbing
Now we consider the knowledge distillation problem. Suppose are the blocklevel output in matrix form for the studentnet and teachernet respectively, where is the perchannel feature map resolution size. We add a convlayer at the end of each block of studentnet before nonlinear activation, which satisfies condition . As is degraded to the matrix form, it can be estimated with least squared regression as
(1) 
where is the number of samples used, and "*" here means matrix product. The number of parameters of is , where is the number of output channels in the block, which is usually not too large so that we can estimate with a limited number of samples.
Suppose there are corresponding blocks in the teacher and student networks required to align, to achieve our goal, we need minimize the following loss function
(2) 
where is the tensor for the added convlayer of the th block. In practice, we optimize this loss with a blockcoordinate descent (BCD) algorithm (Hong et al., 2017), which greedy handles each of the terms/blocks in Equation 2 in the studentnet sequentially as shown in algorithm 1 at AppendixB, instead of optimizing this loss all together using standard SGD. The BCD algorithm for FSKD has the following advantages:

The BCD algorithm processes each block greedy with a sequential update rule, and each can be solved much cheaper with a small number of samples by aligning the blocklevel responses between teacher and student networks, while SGD considers all together which theoretically requires more data.

The alignment procedure is very efficient, which can be usually done within several minutes for the entire network.

The alignment procedure itself does not require class label information of input data due to its regression nature. However, if we fully redesign the studentnet from scratch with random weights, we may leverage SGD on a few labeled samples to initialize the network. Our FSKD can still produce significant performance gains over SGD in this case.

Our FSKD works extremely well for studentnet obtained by aggressively pruning/decomposing the teachernet. It beats the standard finetuning based solution on the number of data required, processing speed, and accuracy of the output studentnet.
4 Experiment
We perform extensive experiments on different image classification datasets to verify the effectiveness of FSKD on various studentnet construction methods. Studentnets can be obtained either from compressing the teachernet or redesigning network structure with random initialization (termed “zero student network”). For the former case, we evaluate FSKD on three wellknown compression methods, filter pruning (Li et al., 2016), network slimming (Liu et al., 2017), and network decoupling (Guo et al., 2018)
. We implement the code with PyTorch, and conduct experiments on a desktop PC with Intel i77700K CPU and one NVidia 1080TI GPU.
Top1before(%)  Top1after(%)  FLOPs()  Reduced  #Param()  Pruned  #Samples  

VGG16  92.66          
SchemeA + FSKDBCD  85.42  92.37  34%  64%  100  
SchemeA + FSKDSGD  85.42  92.18  34%  64%  100  
SchemeA + FitNet  85.42  91.23  34%  64%  100  
SchemeA + FSKDBCD  85.42  92.46  34%  64%  500  
SchemeA + FSKDSGD  85.42  92.42  34%  64%  500  
SchemeA + FitNet  85.42  92.13  34%  64%  500  
SchemeA + Finetuning  85.42  90.25  34%  64%  500  
SchemeA + Full finetuning  85.42  92.54  34%  64%  50000  
SchemeB + FSKDBCD  47.90  90.17  58%  77%  100  
SchemeB + FSKDSGD  47.90  89.41  58%  77%  100  
SchemeB + FitNet  47.90  88.76  58%  77%  100  
SchemeB + FSKDBCD  47.90  91.21  58%  77%  500  
SchemeB + FSKDSGD  47.90  90.76  58%  77%  500  
SchemeB + FitNet  47.90  90.68  58%  77%  500  
SchemeB + Finetuning  47.90  83.36  58%  77%  500  
SchemeB + Full finetuning  47.90  91.53  58%  77%  50000 
4.1 Student Network from Compressing teacher network
Filter Pruning
We first obtain the studentnets using the filter pruning method (Li et al., 2016), which prunes out convfilters according to the norm of their weights. The norm of filter weights are sorted and the smallest portion of filters will be pruned to reduce the number of filterchannels in a convlayer.
We make a comprehensive study of VGG16 (Simonyan & Zisserman, 2015) on CIFAR10 dataset to evaluate the performance of FSKD along with different configuration settings. Following Li et al. (2016), we first prune half of the filters in conv1_1, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv5_3 while keeps the other layer unchanged (schemeA). We also propose another more aggressive pruning scheme named schemeB, which pruned 10% more filters in the aforementioned layers, and also pruned 20% filters for the remaining layers. SchemeA prunes 64% of total parameters with 7% accuracy drop. SchemeB prunes 77% of total parameters with almost 50% accuracy drop. We use those two pruned networks as studentnets in this study. ^{2}^{2}2An extremely pruned case "schemeC" is also provided in AppendixD.
We illustrate in Figure 7 in AppendixC how teacher and student are aligned at blocklevel. For the fewsample setting, we randomly select 100 (10 for each category) and 500 (50 for each category) images from the CIFAR10 training set, and keep them fixed in all experiments. Table 1 lists the results of different methods of recovering a pruned network, including FitNet (Romero et al., 2015), finetuning with limited data and full training data (Li et al., 2016). Note that we optimize FSKD with two algorithms: FSKDBCD uses the BCD algorithm on blocklevel and FSKDSGD optimizes the loss (Equation 2) all together with SGD algorithm. In the BCD algorithm, we do not observe benefit for the iteration number over , so that we set in all our following experiments. This is consistent with the finding by (Hong et al., 2017) that the convergence is sublinear when each block is minimized exactly (here due to linear structure). Regarding the processing speed, FSKDBCD can be done in 19.3 seconds for studentnet from schemeB with 500 samples, while FitNet requires 157.3 seconds when converged, which is about 8.1 slower. This verifies our previous claim that FSKD is more efficient than FitNet. It can be seen that in the fewsample setting, FSKDBCD provides better accuracy recovery than both FitNet and the finetuning procedure adopted in Li et al. (2016). For instance, for schemeB with only 500 samples, FSKD can recover the accuracy from 47.9% to 91.2%, while fewsample finetuning can only recover the accuracy to 83.36%. When full training set available, it will take about 30 minutes for full finetuning to reach similar accuracy as FSKD. This demonstrates the big advantages of FSKD over full finetuning based solutions.
Figure 3 further studies the performance with different amount of training samples available. It can be observed that our FSKDBCD keep outperforming FSKDSGD, FitNet under the same training samples. In particular, FSKDSGD and FitNet experience a noticeable accuracy drop when the number of samples is less than 100, while FSKDBCD can still recover the accuracy of the pruned network to a high level. It is also interesting to note that finetuning experiences even larger accuracy drops than FitNet when the data amount is limited. This verifies that knowledge distillation methods like FitNet provide richer information than finetuning/retraining with label based loss.
As is shown, FSKDBCD performs better and tends to be more sampleefficient than FSKDSGD. Therefore, we choose it as the default algorithm, and denote it as FSKD for simplification in the following studies. We further illustrate the perlayer (block) feature responses difference between teacher and student before and after using FSKD in Figure (a)a. Before applying FSKD, the correlation between teacher and student is broken due to the aggressive compression. However, after FSKD, the perlayer correlations between teacher and student are restored. This verifies the ability of FSKD for recovering lost information. We do see a decreasing trend with layer depth increased, which is possibly due to error accumulation through multiple convolutional layers. We also show the accuracy change during sequentially blocklevel alignment in Figure (b)b, which clearly demonstrate the effectiveness of our sequentially blockbyblock update in the BCD algorithm.
Network Slimming
We then study the studentnet from another channel pruning method named network slimming (Liu et al., 2017), which removes insignificant filter channels and corresponding feature maps using sparsified channel scaling factors. Network slimming consists of three steps: sparse regularized training, pruning and finetuning. Here, we replace the timeconsuming finetuning step with our FSKD, and follow the original paper (Liu et al., 2017) to conduct experiments to prune different networks on different datasets. We apply FSKD on networks pruned from VGG19, ResNet164, and DenseNet40 (Huang et al., 2017), on both CIFAR10 and CIFAR100 datasets. Table 2 lists results on CIFAR10, while Table 3 lists results on CIFAR100. Note that the channelpruneratio (like 70% in Table 2) means the portion of channels that are removed in comparison to the total number of channels in the network. It shows that FSKD consistently outperforms FitNet and finetuning with a notable margin under the fewsample setting on all evaluated networks and datasets.
Channelpruneratio  Top1before(%)  Top1after(%)  FLOPs()  Reduced  #Param()  Pruned 

VGG19  93.38        
70% + FSKD  15.90  93.41  51%  89%  
70% + FitNet  15.90  90.47  51%  89%  
70% + Finetuning  15.90  62.86  51%  89%  
ResNet164  95.07        
60% + FSKD  54.46  94.19  45%  37%  
60% + FitNet  54.46  88.94  45%  37%  
60% + Finetuning  54.46  60.94  45%  37%  
DenseNet40  94.18        
60% + FSKD  88.24  93.62  46%  54%  
60% + FitNet  88.24  91.37  46%  54%  
60% + Finetuning  88.24  88.98  46%  54% 
Channelpruneratio  Top1before(%)  Top1after(%)  FLOPs()  Reduced  #Param()  Pruned 

VGG19  72.08        
50% + FSKD  9.24  71.98  37%  75%  
50% + FitNet  9.24  69.52  37%  75%  
50% + Finetuning  9.24  48.75  37%  75%  
ResNet164  76.56        
40% + FSKD  46.07  76.11  33%  14%  
40% + FitNet  46.07  73.87  33%  14%  
40% + Finetuning  46.07  57.45  33%  14%  
DenseNet40  73.21        
40% + FSKD  60.62  73.26  30%  36%  
40% + FitNet  60.62  71.08  30%  36%  
40% + Finetuning  60.62  62.36  30%  36% 
Network Decoupling
Network decoupling (Guo et al., 2018) decomposes a regular convolution layer into the sum of several blocks, where each block consists of a depthwise (DW) convolution layer and a pointwise (PW, 11) convolution layer. The ratio of compression increases as the number of blocks decreases, but the accuracy of the compressed model will also drop. Since each decoupled block ends with a 11 convolution, we can apply FSKD at the end of each decoupled block.
Following (Guo et al., 2018)
, we obtain studentnets by decoupling VGG16 and ResNet18 pretrained on ImageNet with different
values, where stands for the number of DW + PW blocks that a convlayer decouples out. Figure 7 in AppendixC illustrates how teacher and student are aligned at blocklevel in this case. For VGG16, we also decouple half of the convlayer with and the other half , and denote the case as "mix". We evaluate the resulted network performance on the validation set of the ImageNet classification task. We randomly select one image from each of the 1000 classes in ImageNet training set to obtain 1000 samples as our FSKD training set. Table 4 shows the top1 accuracy of studentnet before and after applying FSKD on VGG16 and ResNet18.It is quite interesting to see that in the case of mix for VGG16 and for ResNet18, we can recover the accuracy of studentnet from nearly random guess (0.12%, 0.21%) to a much higher level (51.3% and 49.5%) with only 1000 samples. In all the other cases, FSKD can recover the accuracy of a highlycompressed network to be comparable with the original network. One possible explanation is that the highlycompressed networks still inherit some representation power from the teachernet i.e., the depthwise 33 convolution, while lacking the ability to output meaningful predictions due to the inaccurate/degraded convolution. The FSKD calibrates the convolution by aligning the blocklevel responses between "teacher" and "student" so that the lost information in convolution is compensated, and reasonable recovery is achieved ^{3}^{3}3We thus make a bold hypothesis that pointwise is more critical for performance than depthwise, so that even depthwise convlayers are initialized to be orthogonal from random data, training only pointwise convlayers could provide enough accurate results. We verify this in AppendixE. .
4.2 Zero Student Network
Top1before(%)  Top1after (%)  GFLOPs  Reduced  #Param()  Pruned  
VGG16 (teacher)  68.4    15.47    14.71   
Decoupled () + FSKD  0.24  62.7  3.76  75.7%  3.35  77.2% 
Decoupled () + FSKD  1.57  67.1  5.54  64.2%  5.02  65.8% 
Decoupled () + FSKD  54.6  67.6  7.31  52.7%  6.69  54.5% 
ResNet18 (teacher)  67.1    1.83    11.17   
Decoupled () + FSKD  0.21  49.5  0.55  70.0%  2.69  75.9% 
Decoupled () + FSKD  3.99  61.9  0.75  59.0%  3.95  64.6% 
Decoupled () + FSKD  26.5  65.1  0.95  48.1%  5.20  53.4% 
Decoupled () + FSKD  53.6  66.3  1.15  37.2%  6.46  42.2% 
Finally, we evaluate FSKD on fully redesigned studentnet with a different structure from the teacher and random initialized parameters (named as zero studentnet). We conduct experiments on CIFAR10 and CIFAR100 with VGG19 as the teachernet and a shallower VGG13 as the studentnet. Due to the similar structure between VGG13 and VGG19, they can be easily aligned in blocklevel.
The random initialized network does not contain any information about the training set. Simply training this network using SGD with few samples will lead to poor generalization ability, as shown in Figure 4. We propose two schemes to combine FSKD in the training procedure: SGD+FSKD and FitNet+FSKD.
In the SGD+FSKD case, we first use SGD to train the studentnet (without using teachernet information) on the given few labeled samples with 160 epochs (multistep learningrate decay 1/10 at 80 and 120 epochs from 0.01 to 0.0001), and then apply FSKD to the obtained studentnet using the same fewsample set. We repeat these two steps until the training loss converges. In the FitNet+FSKD case, we keep the same fewsample set, and simply replace the SGD with FitNet (using teachernet information) to add supervision on intermediate responses during training
.We compare the results from four different recovery methods: running SGD until convergence, SGD+FSKD, running FitNet until convergence, and FitNet+FSKD. In order to better simulate the fewsample setting, we do not apply data augmentation to the training set. We randomly pick 100, 200, 500, 1000 samples from the CIFAR10 training set, and 500, 1000, 1500, 2000 samples from the CIFAR100 training set, and keep these fewsample sets fixed in this study. Figure 4 shows the comparison results on the four methods and four fewsample sets. It shows that FSKD+SGD takes a big jump over pure SGD, and FSKD+FitNet also takes a big jump over pure FitNet. FSKD+SGD performs much better than FitNet on CIFAR10, while this is not true on CIFAR100. There are two possible reasons. First, we did not enable data augmentation so that fewsample SGD is underfitting, which provides much less information than what the studentnet can get from the teachernet in FitNet. Second, CIFAR100 is much more difficult than CIFAR10 so that the performance is more sensitive to the number of samples. However, FSKD+SGD can still achieve accuracy on par with FitNet. We should also note here that the zerostudentnets have accuracy gaps with the fullytrained teachernets on both CIFAR10 and CIFAR100. This is reasonable and acceptable, considering that we did not use data augmentation and trained the model with very few samples. Nevertheless, it still demonstrates the advantages of our FSKD over SGD and FitNet based methods. To further illustrate the benefit of FSKD over SGD, we visualize the convolution kernel (in terms of decoupled pointwise convolutions) before SGD, after SGD, and after SGD+FSKD in AppendixF.
5 Conclusion
We proposed a novel and simple method for knowledge distillation from few samples (FSKD). The method works for studentnets constructed in various ways, including compression from teachernets and fully redesigned networks with random initialization on various datasets. Experiments demonstrate that FSKD outperforms existing knowledge distillation methods by a large margin in the fewsample setting, while requires significantly less computation budget. This advantage will bring many potential applications and extensions for FSKD.
References
 Ba & Caruana (2014) Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS, 2014.
 Bart & Ullman (2005) Evgeniy Bart and Shimon Ullman. Crossgeneralization: Learning novel classes from a single example by feature replacement. In CVPR, 2005.
 Bucila et al. (2006) Cristian Bucila, Rich Caruana, Alexandru NiculescuMizil, et al. Model compression. In SIGKDD. ACM, 2006.
 Chen et al. (2016) Tianqi Chen, Ian Goodfellow, Jonathon Shlens, et al. Net2net: Accelerating learning via knowledge transfer. In ICLR, 2016.
 Chollet (2016) François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
 Courbariaux et al. (2016) M. Courbariaux, Y. Bengio, JeanPierre David, et al. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. In ICLR, 2016.
 Crowley et al. (2017) Elliot J Crowley, Gavin Gray, and Amos Storkey. Moonshine: Distilling with cheap convolutions. arXiv preprint arXiv:1711.02613, 2017.
 Denton et al. (2014) Emily Denton, Zaremba, Yann Lecun, et al. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
 FeiFei et al. (2006) Li FeiFei, Rob Fergus, Pietro Perona, et al. Oneshot learning of object categories. IEEE Trans PAMI, 2006.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, Sergey Levine, et al. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.
 Guo et al. (2018) Jianbo Guo, Yuxi Li, Weiyao Lin, Yurong Chen, and Jianguo Li. Network decoupling: From regular to depthwise separable convolutions. In BMVC, 2018.
 Gupta et al. (2015) Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, et al. Deep learning with limited numerical precision. In ICML, 2015.
 Han et al. (2015) Song Han, Jeff Pool, John Tran, William Dally, et al. Learning both weights and connections for efficient neural network. In NIPS, 2015.
 Han et al. (2016) Song Han, Huizi Mao, Bill Dally, et al. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In NIPS, 2016.
 He et al. (2016) K. He, X. Zhang, J. Sun, et al. Deep residual learning for image recognition. In CVPR, 2016.
 Hinton et al. (2015) G. Hinton, O. Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 2012.
 Hong et al. (2017) Mingyi Hong, Xiangfeng Wang, Meisam Razaviyayn, and ZhiQuan Luo. Iteration complexity analysis of block coordinate descent methods. Mathematical Programming, 163(12), 2017.
 Huang et al. (2017) Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In CVPR, 2017.
 Huang & Wang (2017) Zehao Huang and Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219, 2017.

Jaderberg et al. (2014)
M. Jaderberg, A. Vedaldi, A. Zisserman, et al.
Speeding up convolutional neural networks with low rank expansions.
In BMVC, 2014.  Kim et al. (2016) Y. Kim, E. Park, S. Yoo, et al. Compression of deep convolutional neural networks for fast and low power mobile applications. In ICLR, 2016.
 Kimura et al. (2018) Akisato Kimura, Zoubin Ghahramani, Koh Takeuchi, et al. Fewshot learning of neural networks from scratch by pseudo example optimization. In BMVC, 2018.
 Krizhevsky & Hinton (2012) A. Krizhevsky and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 Lake et al. (2011) Brenden Lake, Ruslan Salakhutdinov, Jason Gross, et al. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33, 2011.
 Li et al. (2016) Hao Li, Asim Kadav, Durdanovic I, et al. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
 Lin et al. (2014) Min Lin, Qiang Chen, and Shuicheng nd others Yan. Network in network. In ICLR, 2014.
 Liu et al. (2017) Zhuang Liu, Jianguo Li, Zhiqiang Shen, et al. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
 Lopes et al. (2017) Raphael Gontijo Lopes, Stefano Fenu, Thad Starner, et al. Datafree knowledge distillation for deep neural networks. arXiv preprint arXiv:1710.07535, 2017.
 Luo et al. (2017) J. Luo, J. Wu, W. Lin, et al. Thinet: A filter level pruning method for deep neural network compression. In ICCV, 2017.
 Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, et al. Recurrent neural network based language model. In INTERSPECH, 2010.
 Nair & Hinton (2010) Vinod Nair and Geoffrey Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
 Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
 Ravi & Larochelle (2017) Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. In ICLR, 2017.
 Romero et al. (2015) Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, et al. Fitnets: Hints for thin deep nets. In ICLR, 2015.
 Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 Smith et al. (2016) Leslie N Smith, Emily M Hand, and Timothy Doster. Gradual dropin of layers to train very deep neural networks. In CVPR, 2016.
 Srinivas & Fleuret (2018) Suraj Srinivas and Francois Fleuret. Knowledge transfer with jacobian matching. arXiv preprint arXiv:1803.00443, 2018.
 Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, et al. Matching networks for one shot learning. In NIPS, 2016.
 Yim et al. (2017) Junho Yim, Donggyu Joo, Jihoon Bae, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, 2017.
 Zhang et al. (2016) X. Zhang, J. Zou, J. Sun, et al. Accelerating very deep convolutional networks for classification and detection. IEEE TPAMI, 38(10), 2016.
 Zhang et al. (2018) Ying Zhang, Tao Xiang, Timothy Hospedales, et al. Deep mutual learning. In CVPR, 2018.

Zoph & Le (2016)
B. Zoph and Quoc V Le.
Neural architecture search with reinforcement learning.
In ICLR, 2016.
AppendixA: Proof of Theorem 1
Proof.
When is a pointwise convolution with tensor , both and are degraded into matrix form. It is obvious that when condition satisfied, the theorem holds with in this case, where indicates matrix multiplication.
When is a regular convolution with tensor , the proof is nontrivial. Fortunately, recent work on network decoupling (Guo et al., 2018) presents an important theoretic result as the basis of our derivation.
Lemma 1.
Regular convolution can be exactly expanded to a sum of several depthwise separable convolutions. Formally, , , where , ,
(3) 
where is the compound operation, which means performing before .
Please refer to Guo et al. (2018) for the details of proof for this Lemma. When is applied to an input patch
, we obtain a response vector
as(4) 
where , and here means convolution operation. is a tensor slice along the th input and th output channels, is a tensor slice along the th channel of 3D tensor .
When pointwise convolution is added after without nonlinear activation between them, we have
(5) 
With Lemma1, we have
(6) 
As both and are degraded into matrix form, denoting and , we have This proves the case when is a regular convolution. ∎
AppendixB: Algorithm of Blocklevel alignment for FSKD
The blocklevel alignment algorithm for FSKD is in fact a blockcoordinate descent (BCD) algorithm with greedy sequential blocklevel update rule, as shown in algorithm 1.
The accuracy of FSKDBCD versus the number of iterations is illustrated in Figure 5, which shows that more iterations do not bring noticeable performance gain. This is because in each iteration, the subproblem is a linear optimization problem so that we can find exact minimization. This is consistent with the finding by (Hong et al., 2017). Therefore, in the paper, we only report the accuracy of for FSKDBCD.
AppendixC: Illustration of FSKD on pruning and decoupling
AppendixD: Iterative pruning and FSKD
Previous works show that one time extremely pruning may yield the pruned network unable to recovery from finetuning, while the iteratively pruning and finetuning procedure is observed effective to obtain extreme model compression (Han et al., 2016; Li et al., 2016; Liu et al., 2017). Inspired by these works, we proposed the iteratively pruning and FSKD procedure as described in algorithm 2 to achieve extremely compression rate. This solution is still much more efficient than iteratively pruning and finetuning due to the great efficiency of FSKD over finetuning.
Based on this procedure, we extremely prune VGG16 on CIFAR10 by 88% total parameters. Table 5 list the results comparison to finetuning, FitNet, etc. It verfies the effectiveness of our FSKD on this extremely pruned case.
Top1before(%)  Top1after(%)  FLOPs()  Reduced  #Param()  Pruned  #Samples  

VGG16  92.66          
SchemeC + FSKDBCD  13.05  89.55  65%  88%  100  
SchemeC + FSKDSGD  13.05  89.01  65%  88%  100  
SchemeC + FitNet  13.05  85.09  65%  88%  100  
SchemeC + FSKDBCD  13.05  90.41  65%  88%  500  
SchemeC + FSKDSGD  13.05  90.12  65%  88%  500  
SchemeC + FitNet  13.05  88.31  65%  88%  500  
SchemeC + Finetuning  13.05  78.13  65%  88%  500  
SchemeC + Full finetuning  13.05  90.77  65%  88%  50000 
AppendixE: Training Only pointwise convlayer is accurate enough
People may challenge that learning conv may loss representation power and ask why the added convolution works so well with such few samples. According to the network decoupling theory (Lemma1), any regular convlayer could be decomposed into a sum of depthwise separable blocks, where each depthwise separable block consists of a depthwise (DW) convolution (for spatial correlation modeling) followed by a pointwise (PW) convolution (for crosschannel correlation modeling). The added convlayer is absorbed/merged into the previous PW layer finally. The decoupling yields that the number of parameters in PWlayer occupies most (>=80%) parameters of the whole network. We argue that learning conv is still very powerful, and make a bold hypothesis in subsection 4.1 that PW convlayer is more critical for performance than DW convlayer. To verify this hypothesis, we conduct experiments on VGG16 and ResNet50 on CIFAR10 and CIFAR100 under below different settings^{4}^{4}4This experiment was conducted by Mingjie Sun, another Intern at Intel Labs from Tsinghua University. Details are under preparing in another techreport..

We train the network from random initialization with 160 epochs with learningrate decay 1/10 at 80, 120 epochs from 0.01 to 0.0001.

We start from a random initialized network (VGG16 or ResNet50), and do full rank decoupling ( in Equation 3) so that channels in DW layers are orthogonal, and PW layers are still fully random. Note that Lemma1 ensures the network before and after decoupling are equivalent (i.e., able to transfer back and force from each other). We keep all the DWlayers fixed (with random orthogonal basis), and train only the PW layers with 160 epochs. We denote this scheme as ND1*1.
Model  CIFAR10(%)  CIFAR100(%) 

VGG16  93.00  73.35 
VGG16 (ND1*1)  93.91  73.61 
ResNet50  92.64  69.93 
ResNet50 (ND1*1)  93.51  70.83 
Note that except the setting explicitly described, all the other configurations (including training epochs, hyperparameters, hardware platform, etc) are kept the same on both experimental cases. Table 6 lists the experimental results on these two cases on both datasets with two different network structures. It is obvious that the 2nd case (ND1*1) clearly outperforms the 1st case. Figure 8 further illustrates the test accuracy at different training epochs, which shows that the 2nd case (ND1*1) converges faster and better than the 1st case. This experiment verifies our hypothesis that when keeping DW channels orthogonal, training only the pointwise () convlayer is accurate enough, or even better than training all the parameters together.
AppendixF: Filter visualization on Zero studentnet
To help better understanding how FSKD impacts the filters, we try to visualize the filter kernels. As the regular convlayer kernel size is just in the zero studentnet (VGG13), it is hard to see a difference in such a small kernelsize. Instead, we consider visualizing the pointwise convolution tensor (degraded to matrix form) in Figure 9 for the following three cases:

We initialize VGG13 with the MSRA initialization method, and then decouple one layer (64 input channels and 64 output channels) to obtain the PW convlayer. For simplicity, we only visualize the PW tensor of the first decoupling block (in the left), which has size ;

We run SGD on few samples for VGG13 from random initialization until convergence, and then decouple the same layer to obtain the firstrank PW tensor (visualized in the middle);

We further run FSKD on few samples for VGG13 already optimized by SGD, and then decouple the same layer to obtain the firstrank PW tensor (visualized in the right).
It shows that the tensor before SGD is fairly random on the value range, the tensor after SGD is less random, while the tensor after FSKD further shows some regular patterns, which indicates that there are some strong correlations among depthwise channels.
Comments
There are no comments yet.