Accelerating Large Scale Knowledge Distillation via Dynamic Importance Sampling

by   Minghan Li, et al.
University of Alberta

Knowledge distillation is an effective technique that transfers knowledge from a large teacher model to a shallow student. However, just like massive classification, large scale knowledge distillation also imposes heavy computational costs on training models of deep neural networks, as the softmax activations at the last layer involve computing probabilities over numerous classes. In this work, we apply the idea of importance sampling which is often used in Neural Machine Translation on large scale knowledge distillation. We present a method called dynamic importance sampling, where ranked classes are sampled from a dynamic distribution derived from the interaction between the teacher and student in full distillation. We highlight the utility of our proposal prior which helps the student capture the main information in the loss function. Our approach manages to reduce the computational cost at training time while maintaining the competitive performance on CIFAR-100 and Market-1501 person re-identification datasets.


page 1

page 2

page 3

page 4


Recurrent knowledge distillation

Knowledge distillation compacts deep networks by letting a small student...

PURSUhInT: In Search of Informative Hint Points Based on Layer Clustering for Knowledge Distillation

We propose a novel knowledge distillation methodology for compressing de...

Dynamic Rectification Knowledge Distillation

Knowledge Distillation is a technique which aims to utilize dark knowled...

Self-Referenced Deep Learning

Knowledge distillation is an effective approach to transferring knowledg...

Large-scale Knowledge Distillation with Elastic Heterogeneous Computing Resources

Although more layers and more parameters generally improve the accuracy ...

Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?

This work investigates the compatibility between label smoothing (LS) an...

1 Introduction

In the last few years, deep neural networks have achieved state-of-the-art performance in applications like image recognition [8, 21]

and natural language processing

[12, 24], yet large scale classification such as face verification [15, 20, 23] and neural machine translation [11, 18]

still remains challenging. The main difficulty of such massive classification tasks comes from the last softmax layer of modern deep neural nets. Computing the full activations often involves calculating probabilities over all classes in the normalization constant, which requires substantial computational power to compute its dot product with the last hidden layer of the neural network.

Figure 1: Predictions over classes and ranks of two different samples. We sort the classes in descending order of the teacher model’s prediction to obtain the class ranking. The term “rank” will be frequently used in this work.

The same problem also occurs in large scale knowledge distillation. Knowledge distillation is a model compression technique where a shallow student model tries to mimic the output of a large complex teacher model. Similar to the regular training paradigm, knowledge distillation also suffers from a growing time complexity of computing softmax probabilities over a large number of classes. To mitigate the problem, the specialists ensemble [9] was proposed where each specialist was assigned to a subset of data and learned from a single teacher in parallel. Co-distillation [1] trained multiple neural nets together on disjoint sets of data and encouraged them to share knowledge with each other. Although these methods managed to accelerate training by a large margin, more parameters are involved since we are training multiple models in parallel, which needs massive computational power and storage space, making it difficult to deploy on mobile devices.

Aiming to efficiently distill the knowledge from a teacher model on large scale datasets, we apply the sampling based approaches [3, 11] which is often used in Neural Machine Translation. However, such methods often require a prior distribution of the word frequency (such as a unigram), making it difficult to extend to other areas. Fortunately, in knowledge distillation [9] it is very likely that we can obtain this prior from an oracle that already generalizes well, yet we find scant amount of research in this direction.

Figure 2: Forward-view of importance-sampling based distillation. We first relabel the dataset with the teacher’s prediction, i.e. the soft labels. We then sample ranks from the proposal distribution and find the corresponding subset of classes. We finally compute the cross-entropy loss between the teacher’s and student’s predictions over the subset. This foward-view is for illustration. For implementation, we use the backward-view described in Alg. 1.

In this work, we present a simple yet effective approach called dynamic importance sampling, which samples a subset of ranked classes from a proposal distribution that is dynamically adjusted during training. The proposal prior is derived from a method called dynamic class selection [26]

, which enables the student to back-propagate the main information in the loss function without computing the full softmax activation. We also compare our method with fixed importance sampling approaches which sample from a uniform distribution or directly from the teacher’s prediction. We show that neither of these two methods perform well when the sampled subset is small compared to using the dynamic distribution. Our approach reduces computational costs during training and sometimes even outperforms the full distillation on CIFAR-100 and Market-1501 person re-identification datasets.

2 Background

Suppose we have a training set consisting of pairs of sample and label , where

is the set of all classes. In order to avoid confusion when deriving the importance-sampling based distillation in the following section, we adopt the terms in energy based models

[22] to describe the basic components of a neural net. Given the input and network’s parameters, the energy function of the neural network is:


where is the final representation, is the weight matrix at the last layer and denotes all the trainable parameters in the network. To obtain the prediction of the student network, we need to normalize the exponential energy:


where , which is called the temperature parameter, controls the entropy of the network’s prediction . As , gradually converges to the uniform distribution. In practice, we often pick a medium temperature so as to reveal sufficient inter-class information in the teacher’s prediction . Let be the prediction of the student. We can present our loss function as:


where is the cross entropy loss and

is a hyperparameter that balances the two cross entropy losses.

[19] shows that distillation can be seen as a special form of curriculum learning if is gradually decreased as training proceeds. However, in this paper we set in all our experiments for simplicity and clarity.

3 Methodology

In this section, we formalize our approach in an importance-sampling based framework [3, 11] which samples classes from a proposal distribution. We derive a mixture of Laplace distributions whose the parameters can be dynamically adjusted during training from the dynamic class selection process[26]. Our method not only speeds up the training significantly but also maintain a competitive performance to the full distillation.

3.1 Importance-Sampling Based Distillation

The main idea of importance-sampling based distillation is to approximate the expected gradients of the full energy function with the one computed over a set of sampled classes. Moreover, instead of directly sampling from the student’s prediction which is costly to compute, we sample from a proposed prior distribution

to estimate the expected gradients. We formalize our approach using the framework of importance-sampling based approximation

[3, 11] which avoids computing the full matrix multiplication at the softmax layer. The gradients of the cross-entropy loss in Equation (3) w.r.t the model’s parameters over the complete set of classes are:


where is the teacher’s prediction, is the student’s prediction and is the energy function of the student. The main difficulty here is to estimate both and when the number of classes is large. Therefore, we need to sample from another pre-defined distribution to efficiently estimate the expectation. If we have a proposal distribution such as the one in Fig 2, we can approximate this expectation by importance sampling:

// Initialization
// Add target class
for j=1 todo
      // Sample negative classes
Algorithm 1 Backward-View of Important Sampling Based Distillation

However, although we don’t have to sample from anymore, we still need to compute over all the classes. [3] proposed a biased but more efficient version of importance sampling estimator to :


where and . is a subset of classes sampled from with replacement. Though this estimator is biased, it was shown that [3] the estimation converges to the true mean as . So the gradients in Equation (4) can be approximated by:


where , and is the energy function of the teacher. Since we manually add the target class to the sampled subset, is set to 1 when computing and . We describe the importance-sampling based distillation in Alg. 1. As we can see, the proposal distribution plays an important role in our method. The default option for this prior is usually the uniform distribution, which assumes that we have no prior information about the frequency distribution of classes. However, in distillation we can utilize the teacher model to derive our own proposal distribution. Ideally, this prior should tell us about the main information we need to back-propagate.

3.2 Prediction-Difference based Selection

Before we present our design of the dynamic distribution, we need to first introduce a dynamic class selection method [26] which we refer to as the prediction-difference based selection (PDBS) in this paper, for it is the key-stone to derive the proposal prior. The basic idea of the prediction-difference based selection in distillation is to select classes that have the largest absolute difference between the teacher’s and student’s prediction. After the selection stage, we use the selected subset of classes to approximate the full softmax activation:


where is the submatrix of the complete weight matrix and .

The major assumption behind PDBS is that most gradients are concentrated on the classes that have the biggest absolute difference between the predictions and labels. An empirical study [26]

shows that the gradients w.r.t. logits of the classes that have the k biggest absolute values indeed take up the most proportion. And we know from Equation (

4) that the gradients are proportional to the difference between the labels and predictions, which explains why using the prediction difference to select the classes. To avoid confusion, it is worth mentioning that this selection method is deterministic while the sampling approach introduces randomness. We make further comparison between these two approaches and analyze the experiment results of them in a later section.

Although this method shows a competitive performance to the one trained by full softmax, in order to obtain the k largest absolute prediction difference, it still involves computing a full softmax activation in the student’s prediction. In the next section, we use a mixture of distributions to approximate the dynamics of the PDBS method. Combining the importance sampling technique, we are exempt from querying the whole softmax activation of the student model while providing an effective approximation.

Figure 3: Frequency distribution of each rank being selected during the training with PDBS method. The variation in the distribution is dataset-independent.

3.3 Dynamic Mixture of Laplace Distributions

In order to better approximate the deterministic selection process of PDBS with a stochastic distribution, we first count the frequency of each rank being selected during training. Fig. 1 illustrates the difference between class and rank. By organizing the classes in descending order of the teacher’s prediction, we observe some interesting patterns in the frequency statistics. We make three important observations from Fig. 3:

(1) At the early stage of training, PDBS method prefers to select classes with both high and low ranks, which corresponds to the ends in the frequency distribution. The left tip is often twice as high as the one on the right.

(2) By the middle of training, the height of the tip on the right gradually decreases.

(3) In the end, it converges to an exponential distribution and ends up selecting high rank classes more often, which forms a distribution just like the teacher’s prediction over ranked classes.

Figure 4: Predictions of both the teacher and student at different training stages. At the beginning of training, the major difference between the teacher’s and student’s predictions is mainly distributed at the both of the ranking.

Moreover, we find that this pattern seems to be dataset-independent as it emerges across different datasets. The main reason of forming such peculiar distribution is because the initialization strategy we choose for the student network. Fig. 4 presents both the teacher’s and student’s predictions over a set of ranked classes. Since we initialize the weights and biases with really small floating numbers, at the beginning of the training, the prediction is almost a uniform distribution. Also, since the teacher is a trained network, its prediction is likely to have a wider range. Therefore, the prediction difference on both high-rank classes and low-rank classes are relatively large.

Based on these observation, we propose to fit the normalized frequency distribution with a mixture of two Laplace distributions as shown in Fig. 5

(a). In fact, we can choose any appropriate distribution to fit the frequency distribution, such as a mixture of Gaussian distributions. The reasons we choose a mixture of two Laplace distributions are: (1) Both ends in Fig.

3 are pointy, which are similar to the one in the Laplacian. (2) The distribution seems to decrease exponentially from the ends towards the middle. (3) It simulates the variation in Fig. 3 easily by increasing the scale of the second Laplace as training proceeds.

It is also reasonable to choose other distributions like a mixture of Gaussians. However, we can see from Fig. 5(b) that a Gaussian has a flatter top which doesn’t approximate the frequency distribution very well. We also verify this in practice that using the mixture of Laplace distributions is slightly better than using a mixture of Gaussians.

A typical Laplace distribution is defined as follows:


where is the location parameter and

is the scale paramter, which corresponds to the mean and variance in a gaussian distribution. To fit the frequency distribution in Fig.

3, we set for the left Laplace and set for the right one. Then we discretize the composite distribution within [0,1] into bins. Finally we normalize the mixture over all the bins. In order to further simulate the dynamics of the PDBS selection process, we fix the scale of the left Laplacian and linearly increase the scale of the second. In this way, the right Laplacian will gradually converge to a uniform distribution, which makes the overall distribution similar to the one in Fig. 3(d).

We stress again that this mixture distribution is defined over a set of ranks. During training, we sample a subset of ranks for each mini-batch and then find the corresponding weight vector of each rank as shown in Fig.

2. Following this method, we obtain an effective approximation to the PBDS selection process without computing the full softmax. Combined with the importance-sampling based distillation, we present the dynamic importance sampling (DIS) method to accelerate large scale distillation, which reduces the computational costs significantly while maintaining competitive performance. For comparison, we also develop a method called fixed-teacher-importance-sampling (FTIS) which uses the prediction of the teacher as the proposal distribution. Experiments show that our approach beats the FTIS method as well as other sampling based methods on benchmark datasets.

Figure 5: Fitting the normalized frequency distribution over ranks in Fig. 3 obtained from PDBS with (a) a mixture of two Laplace distributions (b) a mixture of two Gaussian distributions.

4 Experiments

We compare our approach with distillation and other baseline methods on two benchmark datasets. We adopt the notations in [2] to denote the model structures. We evaluate the model performance using different metrics, as well as compare the computational costs of each method with different hyperparameters.

4.1 Datasets

Experiments are conducted on two datasets. The CIFAR-100 [13] consists of 60,000 32x32 color images in 100 classes, and we use 50,000 samples as the training set and the rest for testing. Market-1501 [27] is a benchmark dataset in person re-identification problem that requires algorithms to spot a person of interest across different camera views. The dataset consists of 32,668 images of 1,501 identities captured from 6 non-overlapping camera views. We use 751 identities for training and the rest for testing.

Methods Accuracy
teacher (ResNet32) 69.42%
student (shallow CNN) 37.39%
distillation 44.28%
PDBS (k=10) 44.61%
uniform (k=10) 42.79%
FTIS (k=10) 44.84%
DIS (k=10) 45.16%
Table 1: Top-1 classification accuracy of the shallow CNN trained by various methods on CIFAR-100 dataset. We pick the optimal hyperparameters for each method using grid search.
Methods Accuracy
teacher (ResNet32) 69.42%
student (LeNet) 39.63%
distillation 46.35%
PDBS (k=10) 46.90%
uniform (k=10) 46.27%
FTIS (k=10) 46.48%
DIS (k=10) 47.30%
Table 2: Top-1 classification accuracy of LeNet trained by various methods on CIFAR-100 dataset. We pick the optimal hyperparameters for each method using grid search.

4.2 Metrics

We report top-1 classification accuracy on CIFAR-100 dataset. As for Market-1501, we report one extra metric in information retrieval called mean average precision (meanAP) which computes the mean of the average precision scores for each query. As for evaluating the computational costs during training, we report the runtime of computing the last softmax activation and the gradients against the corresponding top-1 classification accuracy.

4.3 Implementation Details

On CIFAR-100, we choose ResNet32 [8] as the teacher model. For the student models, we choose LeNet [14]

and a shallow neural network that has 1 convolutional layer with 32 5x5 kernels (stride=2) followed by a 2x2 maxpooling layer. To reduce the network parameters, We insert a 1200-dim linear bottleneck layer between the pooling layer and the last 2048 FC layer with ReLU activation. We use ADAM for Optimizer (initial learning rate=0.01,


=0.99) to train all the student models for 30 epochs.

On Market-1501, we use ResNet152 as the teacher model and ResNet18 as the student model. We run all the experiments run for 180 epochs. We train the student model with the original one-hot labels with dropout. Each model is trained by RMSProp optimizer (initial learning rate=0.01, momentum=0.9) for 180 epochs. We normalize the samples without performing any other data augmentation. We use the teacher model to relabel the datasets before training the students.

4.4 Methods

Figure 6: Top-1 accuracy of the shallow CNN trained by different methods vs. the size of the selected subset on CIFAR-100 dataset. The number of classes for full distillation is 100, but for visualization purpose we set it to 35.

We train each student model by all the approaches mentioned below with different hyperparameters:

(1) Hard Labels: A regular training method where the student is trained with the one-hot labels.

(2) Knowledge Distillation: A model transfer technique that enables a shallow student model to learn from a well-performing teacher by minimizing the cross entropy between the two predictions.

(3) Prediction-Difference based Selection (PDBS):

A heuristic class selection approach which selects classes that have the biggest prediction difference between the teacher and student and then compute the partial softmax over the selected subset. However, this method needs to compute the full softmax activation to obtain the prediction difference.

(4) Uniform Sampling: An importance-sampling based distillation that uses the uniform distribution as the proposal distribution for each sample.

(5) Fixed Teacher Importance Sampling (FTIS): An importance-sampling based distillation that uses the teacher’s prediction as the proposal distribution for each corresponding sample.

(5) Dynamic Importance Sampling (DIS): An importance-sampling based approach that uses the mixture of Laplace distributions as the proposal distribution for each mini-batch. Note that we can only sample ranks from this distribution and we need to convert those ranks to the corresponding classes as shown in Fig. 2. The mixture of distributions varies while training.

4.5 Results on CIFAR-100

Tables 1 and 2 summarize the results on CIFAR-100 dataset. We select the optimal hyperparameters for each method given the size of the selected subset. Fig. 6 illustrates the trade-off between the number of classes and the performance of each method. We can conclude from these results that: (1) All the sampling based or selection based approaches reach the similar accuracy when the size of the subset is large. (2) The adaptive guided sampling method has the highest performance and is the most stable one over different sizes of the subset. (3) The FTIS method achieves similar performance to the PDBS method, but it is still worse than our propsed DIS method. (4) uniform sampling performs the worst among all those methods. However, it still manages to surpass the one trained by the original one-hot labels, suggesting that even a little knowledge from the teacher can be very helpful.

The performance of either sampling from a fixed teacher or the pure selection method implicates that introducing randomness in the selection process and back-propagating the gradients with largest absolute value are equally important. Our method combines the advantages of both the sampling based approach and selection based approach.

Methods meanAP Accuracy
teacher (ResNet152) 63.7% 84.2%
student (ResNet18) 55.5% 79.3%
distillation 61.7% 82.6%
PDBS (k=20) 59.5% 81.9%
PDBS (k=120) 62.0% 82.7%
uniform (k=20) 52.8% 73.9%
uniform (k=120) 59.5% 79.8%
FTIS (k=20) 52.1% 77.9%
FTIS (k=120) 59.3% 82.1%
DIS (k=20) 58.9% 79.6%
DIS (k=120) 61.2% 81.9%
Table 3: Top-1 classification accuracy and meanAP of ResNet18 trained by various methods on Market-1501 dataset. We pick the optimal hyperparameter for each method using grid search.

4.6 Results on Market-1501

Fig. 3 summarizes the meanAp, allshots and the top-1 classification accuracy of the student model trained by different methods on Market-1501. Fig. 7 compares the computational cost against the performance for various methods. We choose top-1 accuracy to characterize the performance. We can see that our method strikes a good balance between the approximation precision and computational costs. The dynamic importance sampling method reduces the time of computing the last softmax activation per iteration from 60.68s to 46.01s on a 2 GHz Intel Core i5 CPU and a Tesla m60 GPU, speeding up by 23%. We observe that the PDBS and DIS still outperform the FTIS and uniform sampling methods, as well as achieve a really close performance to distillation.

Fig. 7 summarizes results of performance against costs of different methods. The computational costs of sampling from a distribution is non-negligible because of the way we implement it. For FTIS method which needs to sample the teacher’s prediction for every sample, the run-time goes up quickly as the size of the subset increases. Those sampling methods could have used less time if we had optimized the procedure of the sampling process. However, even with non-negligible extra overheads which could be avoided, our method still reduces the training time by a large margin while maintaining a competitive performance. The experiment results on Market-1501 further prove the effectiveness of our dynamic importance sampling method.

Figure 7: Top-1 classification accuracy vs. computational cost of ResNet18 trained by different methods on Market-1501. Points closer to the upper left means high accuracy with low computational costs.

5 Related Work

5.1 Knowledge Distillation

Model compression [4] aims to compressing a large complex model into a smaller one without significant loss in performance. For models like neural networks, a direct approach is to minimize the L2 loss between the two networks’ logits [2]. Knowledge distillation [9]

is also one of the model compression methods which transfers the knowledge within a teacher model to a shallower student model. Other than the accuracy improvement, distilling knowledge from a deep neural net to models, like decision tree, helps to interpret how the network makes decisions

[6]. Moreover, methods that utilize the teacher model’s intermediate layers to guide the training of a student also provide extra benefits to train very deep models. [19, 17, 10]. The born-again network distills the knowledge to itself in order to learn from the past experience [7]. There are also works that try to provide a theoretical explanation for distillation by unifying with the privileged information theory [16]. Distillation can also be applied to meta-learning such as transferring the attention map of a deep CNN [25].

5.2 Approximate Softmax

Sampling based approaches [11] samples a small subset of the full classes. In hierarchical softmax [18] the flat softmax layer is replaced with a hierarchical layer that has the words as leave nodes. Differentiated softmax [5] is based on the intuition that not all the words need the same number of paramters to fit to. There are also selection based approaches [26] which are designed to pick the classes according to some heuristics. Most of the approaches mentioned above mean to lower the computational cost of the softmax activation, however, in some cases the model performance was improved by using those approximation methods [11, 26]. It is also applicable to replace the sampling-based method in our work with any of the above approximation methods to accelerate knowledge distillation. However, we find the sampling approach is more intuitive and more compatible with distillation in our initial exploratory experiments.

6 Discussion

Our experimental results show that gradients computed over a subset of classes can provide effective approximation with proper selection approaches. We have already presented two kinds of method for selecting such subsets: importance sampling and heuristic selection. The purposes of these two approaches are actually the same, which are to approximate the expected gradients of the energy function as accurately as possible. A difference between these two methods is whether to introduce randomness in the selection process. Our results in Fig. 6 demonstrate that introducing noises while selecting the subset sometimes can provide extra regularization on the representation. As we can see, when the number of classes are extremely small, the performance of our DIS method does not seem to drop as quickly as other methods.

7 Conclusions

In this work, we present a novel importance-sampling based method which not only reduces the computational costs for large scale distillation, but also sometimes even outperforms the original distillation method. We highlight the utility of our dynamic distribution which is derived from the frequency statistics of the prediction-difference based selection. By sampling from this prior, we save the cost from querying the full softmax activation while maintaining the major information to back-progagate. Experiments on large scale datasets show that our proposed method can accelerate the training speed by a large margin without significant loss in precision.