1 Introduction
In the last few years, deep neural networks have achieved stateoftheart performance in applications like image recognition [8, 21]
and natural language processing
[12, 24], yet large scale classification such as face verification [15, 20, 23] and neural machine translation [11, 18]still remains challenging. The main difficulty of such massive classification tasks comes from the last softmax layer of modern deep neural nets. Computing the full activations often involves calculating probabilities over all classes in the normalization constant, which requires substantial computational power to compute its dot product with the last hidden layer of the neural network.
The same problem also occurs in large scale knowledge distillation. Knowledge distillation is a model compression technique where a shallow student model tries to mimic the output of a large complex teacher model. Similar to the regular training paradigm, knowledge distillation also suffers from a growing time complexity of computing softmax probabilities over a large number of classes. To mitigate the problem, the specialists ensemble [9] was proposed where each specialist was assigned to a subset of data and learned from a single teacher in parallel. Codistillation [1] trained multiple neural nets together on disjoint sets of data and encouraged them to share knowledge with each other. Although these methods managed to accelerate training by a large margin, more parameters are involved since we are training multiple models in parallel, which needs massive computational power and storage space, making it difficult to deploy on mobile devices.
Aiming to efficiently distill the knowledge from a teacher model on large scale datasets, we apply the sampling based approaches [3, 11] which is often used in Neural Machine Translation. However, such methods often require a prior distribution of the word frequency (such as a unigram), making it difficult to extend to other areas. Fortunately, in knowledge distillation [9] it is very likely that we can obtain this prior from an oracle that already generalizes well, yet we find scant amount of research in this direction.
In this work, we present a simple yet effective approach called dynamic importance sampling, which samples a subset of ranked classes from a proposal distribution that is dynamically adjusted during training. The proposal prior is derived from a method called dynamic class selection [26]
, which enables the student to backpropagate the main information in the loss function without computing the full softmax activation. We also compare our method with fixed importance sampling approaches which sample from a uniform distribution or directly from the teacher’s prediction. We show that neither of these two methods perform well when the sampled subset is small compared to using the dynamic distribution. Our approach reduces computational costs during training and sometimes even outperforms the full distillation on CIFAR100 and Market1501 person reidentification datasets.
2 Background
Suppose we have a training set consisting of pairs of sample and label , where
is the set of all classes. In order to avoid confusion when deriving the importancesampling based distillation in the following section, we adopt the terms in energy based models
[22] to describe the basic components of a neural net. Given the input and network’s parameters, the energy function of the neural network is:(1) 
where is the final representation, is the weight matrix at the last layer and denotes all the trainable parameters in the network. To obtain the prediction of the student network, we need to normalize the exponential energy:
(2) 
where , which is called the temperature parameter, controls the entropy of the network’s prediction . As , gradually converges to the uniform distribution. In practice, we often pick a medium temperature so as to reveal sufficient interclass information in the teacher’s prediction . Let be the prediction of the student. We can present our loss function as:
(3) 
where is the cross entropy loss and
is a hyperparameter that balances the two cross entropy losses.
[19] shows that distillation can be seen as a special form of curriculum learning if is gradually decreased as training proceeds. However, in this paper we set in all our experiments for simplicity and clarity.3 Methodology
In this section, we formalize our approach in an importancesampling based framework [3, 11] which samples classes from a proposal distribution. We derive a mixture of Laplace distributions whose the parameters can be dynamically adjusted during training from the dynamic class selection process[26]. Our method not only speeds up the training significantly but also maintain a competitive performance to the full distillation.
3.1 ImportanceSampling Based Distillation
The main idea of importancesampling based distillation is to approximate the expected gradients of the full energy function with the one computed over a set of sampled classes. Moreover, instead of directly sampling from the student’s prediction which is costly to compute, we sample from a proposed prior distribution
to estimate the expected gradients. We formalize our approach using the framework of importancesampling based approximation
[3, 11] which avoids computing the full matrix multiplication at the softmax layer. The gradients of the crossentropy loss in Equation (3) w.r.t the model’s parameters over the complete set of classes are:(4)  
where is the teacher’s prediction, is the student’s prediction and is the energy function of the student. The main difficulty here is to estimate both and when the number of classes is large. Therefore, we need to sample from another predefined distribution to efficiently estimate the expectation. If we have a proposal distribution such as the one in Fig 2, we can approximate this expectation by importance sampling:
(5)  
However, although we don’t have to sample from anymore, we still need to compute over all the classes. [3] proposed a biased but more efficient version of importance sampling estimator to :
(6) 
where and . is a subset of classes sampled from with replacement. Though this estimator is biased, it was shown that [3] the estimation converges to the true mean as . So the gradients in Equation (4) can be approximated by:
(7) 
where , and is the energy function of the teacher. Since we manually add the target class to the sampled subset, is set to 1 when computing and . We describe the importancesampling based distillation in Alg. 1. As we can see, the proposal distribution plays an important role in our method. The default option for this prior is usually the uniform distribution, which assumes that we have no prior information about the frequency distribution of classes. However, in distillation we can utilize the teacher model to derive our own proposal distribution. Ideally, this prior should tell us about the main information we need to backpropagate.
3.2 PredictionDifference based Selection
Before we present our design of the dynamic distribution, we need to first introduce a dynamic class selection method [26] which we refer to as the predictiondifference based selection (PDBS) in this paper, for it is the keystone to derive the proposal prior. The basic idea of the predictiondifference based selection in distillation is to select classes that have the largest absolute difference between the teacher’s and student’s prediction. After the selection stage, we use the selected subset of classes to approximate the full softmax activation:
(8)  
where is the submatrix of the complete weight matrix and .
The major assumption behind PDBS is that most gradients are concentrated on the classes that have the biggest absolute difference between the predictions and labels. An empirical study [26]
shows that the gradients w.r.t. logits of the classes that have the k biggest absolute values indeed take up the most proportion. And we know from Equation (
4) that the gradients are proportional to the difference between the labels and predictions, which explains why using the prediction difference to select the classes. To avoid confusion, it is worth mentioning that this selection method is deterministic while the sampling approach introduces randomness. We make further comparison between these two approaches and analyze the experiment results of them in a later section.Although this method shows a competitive performance to the one trained by full softmax, in order to obtain the k largest absolute prediction difference, it still involves computing a full softmax activation in the student’s prediction. In the next section, we use a mixture of distributions to approximate the dynamics of the PDBS method. Combining the importance sampling technique, we are exempt from querying the whole softmax activation of the student model while providing an effective approximation.
3.3 Dynamic Mixture of Laplace Distributions
In order to better approximate the deterministic selection process of PDBS with a stochastic distribution, we first count the frequency of each rank being selected during training. Fig. 1 illustrates the difference between class and rank. By organizing the classes in descending order of the teacher’s prediction, we observe some interesting patterns in the frequency statistics. We make three important observations from Fig. 3:
(1) At the early stage of training, PDBS method prefers to select classes with both high and low ranks, which corresponds to the ends in the frequency distribution. The left tip is often twice as high as the one on the right.
(2) By the middle of training, the height of the tip on the right gradually decreases.
(3) In the end, it converges to an exponential distribution and ends up selecting high rank classes more often, which forms a distribution just like the teacher’s prediction over ranked classes.
Moreover, we find that this pattern seems to be datasetindependent as it emerges across different datasets. The main reason of forming such peculiar distribution is because the initialization strategy we choose for the student network. Fig. 4 presents both the teacher’s and student’s predictions over a set of ranked classes. Since we initialize the weights and biases with really small floating numbers, at the beginning of the training, the prediction is almost a uniform distribution. Also, since the teacher is a trained network, its prediction is likely to have a wider range. Therefore, the prediction difference on both highrank classes and lowrank classes are relatively large.
Based on these observation, we propose to fit the normalized frequency distribution with a mixture of two Laplace distributions as shown in Fig. 5
(a). In fact, we can choose any appropriate distribution to fit the frequency distribution, such as a mixture of Gaussian distributions. The reasons we choose a mixture of two Laplace distributions are: (1) Both ends in Fig.
3 are pointy, which are similar to the one in the Laplacian. (2) The distribution seems to decrease exponentially from the ends towards the middle. (3) It simulates the variation in Fig. 3 easily by increasing the scale of the second Laplace as training proceeds.It is also reasonable to choose other distributions like a mixture of Gaussians. However, we can see from Fig. 5(b) that a Gaussian has a flatter top which doesn’t approximate the frequency distribution very well. We also verify this in practice that using the mixture of Laplace distributions is slightly better than using a mixture of Gaussians.
A typical Laplace distribution is defined as follows:
(9) 
where is the location parameter and
is the scale paramter, which corresponds to the mean and variance in a gaussian distribution. To fit the frequency distribution in Fig.
3, we set for the left Laplace and set for the right one. Then we discretize the composite distribution within [0,1] into bins. Finally we normalize the mixture over all the bins. In order to further simulate the dynamics of the PDBS selection process, we fix the scale of the left Laplacian and linearly increase the scale of the second. In this way, the right Laplacian will gradually converge to a uniform distribution, which makes the overall distribution similar to the one in Fig. 3(d).We stress again that this mixture distribution is defined over a set of ranks. During training, we sample a subset of ranks for each minibatch and then find the corresponding weight vector of each rank as shown in Fig.
2. Following this method, we obtain an effective approximation to the PBDS selection process without computing the full softmax. Combined with the importancesampling based distillation, we present the dynamic importance sampling (DIS) method to accelerate large scale distillation, which reduces the computational costs significantly while maintaining competitive performance. For comparison, we also develop a method called fixedteacherimportancesampling (FTIS) which uses the prediction of the teacher as the proposal distribution. Experiments show that our approach beats the FTIS method as well as other sampling based methods on benchmark datasets.4 Experiments
We compare our approach with distillation and other baseline methods on two benchmark datasets. We adopt the notations in [2] to denote the model structures. We evaluate the model performance using different metrics, as well as compare the computational costs of each method with different hyperparameters.
4.1 Datasets
Experiments are conducted on two datasets. The CIFAR100 [13] consists of 60,000 32x32 color images in 100 classes, and we use 50,000 samples as the training set and the rest for testing. Market1501 [27] is a benchmark dataset in person reidentification problem that requires algorithms to spot a person of interest across different camera views. The dataset consists of 32,668 images of 1,501 identities captured from 6 nonoverlapping camera views. We use 751 identities for training and the rest for testing.
Methods  Accuracy 

teacher (ResNet32)  69.42% 
student (shallow CNN)  37.39% 
distillation  44.28% 
PDBS (k=10)  44.61% 
uniform (k=10)  42.79% 
FTIS (k=10)  44.84% 
DIS (k=10)  45.16% 
Methods  Accuracy 

teacher (ResNet32)  69.42% 
student (LeNet)  39.63% 
distillation  46.35% 
PDBS (k=10)  46.90% 
uniform (k=10)  46.27% 
FTIS (k=10)  46.48% 
DIS (k=10)  47.30% 
4.2 Metrics
We report top1 classification accuracy on CIFAR100 dataset. As for Market1501, we report one extra metric in information retrieval called mean average precision (meanAP) which computes the mean of the average precision scores for each query. As for evaluating the computational costs during training, we report the runtime of computing the last softmax activation and the gradients against the corresponding top1 classification accuracy.
4.3 Implementation Details
On CIFAR100, we choose ResNet32 [8] as the teacher model. For the student models, we choose LeNet [14]
and a shallow neural network that has 1 convolutional layer with 32 5x5 kernels (stride=2) followed by a 2x2 maxpooling layer. To reduce the network parameters, We insert a 1200dim linear bottleneck layer between the pooling layer and the last 2048 FC layer with ReLU activation. We use ADAM for Optimizer (initial learning rate=0.01,
=0.9,=0.99) to train all the student models for 30 epochs.
On Market1501, we use ResNet152 as the teacher model and ResNet18 as the student model. We run all the experiments run for 180 epochs. We train the student model with the original onehot labels with dropout. Each model is trained by RMSProp optimizer (initial learning rate=0.01, momentum=0.9) for 180 epochs. We normalize the samples without performing any other data augmentation. We use the teacher model to relabel the datasets before training the students.
4.4 Methods
We train each student model by all the approaches mentioned below with different hyperparameters:
(1) Hard Labels: A regular training method where the student is trained with the onehot labels.
(2) Knowledge Distillation: A model transfer technique that enables a shallow student model to learn from a wellperforming teacher by minimizing the cross entropy between the two predictions.
(3) PredictionDifference based Selection (PDBS):
A heuristic class selection approach which selects classes that have the biggest prediction difference between the teacher and student and then compute the partial softmax over the selected subset. However, this method needs to compute the full softmax activation to obtain the prediction difference.
(4) Uniform Sampling: An importancesampling based distillation that uses the uniform distribution as the proposal distribution for each sample.
(5) Fixed Teacher Importance Sampling (FTIS): An importancesampling based distillation that uses the teacher’s prediction as the proposal distribution for each corresponding sample.
(5) Dynamic Importance Sampling (DIS): An importancesampling based approach that uses the mixture of Laplace distributions as the proposal distribution for each minibatch. Note that we can only sample ranks from this distribution and we need to convert those ranks to the corresponding classes as shown in Fig. 2. The mixture of distributions varies while training.
4.5 Results on CIFAR100
Tables 1 and 2 summarize the results on CIFAR100 dataset. We select the optimal hyperparameters for each method given the size of the selected subset. Fig. 6 illustrates the tradeoff between the number of classes and the performance of each method. We can conclude from these results that: (1) All the sampling based or selection based approaches reach the similar accuracy when the size of the subset is large. (2) The adaptive guided sampling method has the highest performance and is the most stable one over different sizes of the subset. (3) The FTIS method achieves similar performance to the PDBS method, but it is still worse than our propsed DIS method. (4) uniform sampling performs the worst among all those methods. However, it still manages to surpass the one trained by the original onehot labels, suggesting that even a little knowledge from the teacher can be very helpful.
The performance of either sampling from a fixed teacher or the pure selection method implicates that introducing randomness in the selection process and backpropagating the gradients with largest absolute value are equally important. Our method combines the advantages of both the sampling based approach and selection based approach.
Methods  meanAP  Accuracy 

teacher (ResNet152)  63.7%  84.2% 
student (ResNet18)  55.5%  79.3% 
distillation  61.7%  82.6% 
PDBS (k=20)  59.5%  81.9% 
PDBS (k=120)  62.0%  82.7% 
uniform (k=20)  52.8%  73.9% 
uniform (k=120)  59.5%  79.8% 
FTIS (k=20)  52.1%  77.9% 
FTIS (k=120)  59.3%  82.1% 
DIS (k=20)  58.9%  79.6% 
DIS (k=120)  61.2%  81.9% 
4.6 Results on Market1501
Fig. 3 summarizes the meanAp, allshots and the top1 classification accuracy of the student model trained by different methods on Market1501. Fig. 7 compares the computational cost against the performance for various methods. We choose top1 accuracy to characterize the performance. We can see that our method strikes a good balance between the approximation precision and computational costs. The dynamic importance sampling method reduces the time of computing the last softmax activation per iteration from 60.68s to 46.01s on a 2 GHz Intel Core i5 CPU and a Tesla m60 GPU, speeding up by 23%. We observe that the PDBS and DIS still outperform the FTIS and uniform sampling methods, as well as achieve a really close performance to distillation.
Fig. 7 summarizes results of performance against costs of different methods. The computational costs of sampling from a distribution is nonnegligible because of the way we implement it. For FTIS method which needs to sample the teacher’s prediction for every sample, the runtime goes up quickly as the size of the subset increases. Those sampling methods could have used less time if we had optimized the procedure of the sampling process. However, even with nonnegligible extra overheads which could be avoided, our method still reduces the training time by a large margin while maintaining a competitive performance. The experiment results on Market1501 further prove the effectiveness of our dynamic importance sampling method.
5 Related Work
5.1 Knowledge Distillation
Model compression [4] aims to compressing a large complex model into a smaller one without significant loss in performance. For models like neural networks, a direct approach is to minimize the L2 loss between the two networks’ logits [2]. Knowledge distillation [9]
is also one of the model compression methods which transfers the knowledge within a teacher model to a shallower student model. Other than the accuracy improvement, distilling knowledge from a deep neural net to models, like decision tree, helps to interpret how the network makes decisions
[6]. Moreover, methods that utilize the teacher model’s intermediate layers to guide the training of a student also provide extra benefits to train very deep models. [19, 17, 10]. The bornagain network distills the knowledge to itself in order to learn from the past experience [7]. There are also works that try to provide a theoretical explanation for distillation by unifying with the privileged information theory [16]. Distillation can also be applied to metalearning such as transferring the attention map of a deep CNN [25].5.2 Approximate Softmax
Sampling based approaches [11] samples a small subset of the full classes. In hierarchical softmax [18] the flat softmax layer is replaced with a hierarchical layer that has the words as leave nodes. Differentiated softmax [5] is based on the intuition that not all the words need the same number of paramters to fit to. There are also selection based approaches [26] which are designed to pick the classes according to some heuristics. Most of the approaches mentioned above mean to lower the computational cost of the softmax activation, however, in some cases the model performance was improved by using those approximation methods [11, 26]. It is also applicable to replace the samplingbased method in our work with any of the above approximation methods to accelerate knowledge distillation. However, we find the sampling approach is more intuitive and more compatible with distillation in our initial exploratory experiments.
6 Discussion
Our experimental results show that gradients computed over a subset of classes can provide effective approximation with proper selection approaches. We have already presented two kinds of method for selecting such subsets: importance sampling and heuristic selection. The purposes of these two approaches are actually the same, which are to approximate the expected gradients of the energy function as accurately as possible. A difference between these two methods is whether to introduce randomness in the selection process. Our results in Fig. 6 demonstrate that introducing noises while selecting the subset sometimes can provide extra regularization on the representation. As we can see, when the number of classes are extremely small, the performance of our DIS method does not seem to drop as quickly as other methods.
7 Conclusions
In this work, we present a novel importancesampling based method which not only reduces the computational costs for large scale distillation, but also sometimes even outperforms the original distillation method. We highlight the utility of our dynamic distribution which is derived from the frequency statistics of the predictiondifference based selection. By sampling from this prior, we save the cost from querying the full softmax activation while maintaining the major information to backprogagate. Experiments on large scale datasets show that our proposed method can accelerate the training speed by a large margin without significant loss in precision.
References
 [1] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235, 2018.
 [2] J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014.
 [3] Y. Bengio and J.S. Senécal. Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks, 19(4):713–722, 2008.
 [4] C. Buciluǎ, R. Caruana, and A. NiculescuMizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006.
 [5] W. Chen, D. Grangier, and M. Auli. Strategies for training large vocabulary neural language models. arXiv preprint arXiv:1512.04906, 2015.
 [6] N. Frosst and G. Hinton. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784, 2017.
 [7] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar. Born again neural networks. arXiv preprint arXiv:1805.04770, 2018.

[8]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [9] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [10] Z. Huang and N. Wang. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219, 2017.
 [11] S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007, 2014.
 [12] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
 [13] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[15]
W. Liu, Y. Wen, Z. Yu, and M. Yang.
Largemargin softmax loss for convolutional neural networks.
In ICML, pages 507–516, 2016.  [16] D. LopezPaz, L. Bottou, B. Schölkopf, and V. Vapnik. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015.
 [17] Y. Luo. Can subclasses help a multiclass learning problem? In Intelligent Vehicles Symposium, 2008 IEEE, pages 214–219, 2008.
 [18] F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pages 246–252. Citeseer, 2005.
 [19] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.

[20]
F. Schroff, D. Kalenichenko, and J. Philbin.
Facenet: A unified embedding for face recognition and clustering.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.  [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.

[22]
Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton.
Energybased models for sparse overcomplete representations.
Journal of Machine Learning Research
, 4(Dec):1235–1260, 2003.  [23] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
 [24] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 [25] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.
 [26] X. Zhang, L. Yang, J. Yan, and D. Lin. Accelerated training for massive classification via dynamic class selection. arXiv preprint arXiv:1801.01687, 2018.
 [27] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person reidentification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, pages 1116–1124, 2015.