1 Introduction
Deploying complex DNNs on resourceconstrained edge devices such as smartphones or IoT devices is still challenging due to their tight memory and computation requirements. To address this, several research works have been proposed which can be roughly categorized into three folds: weight pruning [han2015learning, han2015deep, liu2018rethinking, frankle2018lottery], quantization, [krishnamoorthi2018quantizing, zhang2018lq, lin2016fixed, wang2018two, cai2017deep] and knowledge distillation [hinton2015distilling, kim2018paraphrasing, yim2017gift].
Today, a majority of edge devices require fixedpoint inference for compute and powerefficiency. Hence, quantizing the weights and activations of deep networks to a certain bitwidth becomes a necessity to make them compatible to edge devices. Recently, hardware accelerators like NPUs (Neural Processing Units), NVIDIA’s Tensor Core units and even CIM (Computeinmemory) devices have targeted support for subbyte level processing such as 4bit
4bit or 2bit2bit matrix multiplyandaccumulate operations [markidis2018nvidia, pan2018multilevel, sharma2019accelerated, sumbul2019compute]. These are much more power and computeefficient than conventional 8bit processing. Thus, there are increased demands for lowerbit quantization.The accuracy of very lowbit quantized networks such as using 4, 3 or 2bits inevitably decreases because of their low representation power compared to typical 8bit quantization. Some works [mishra2017apprentice, polino2018model] used knowledge distillation (KD) to improve the accuracy of quantized networks by transferring knowledge from a fullprecision teacher network to a quantized student network. However, even though KD is one of the widely used techniques to enhance a DNN’s accuracy [hinton2015distilling], directly applying KD to very lowbit quantized networks incurs several challenges:

[leftmargin=*]

Due to its limited representation power, quantized networks generally show lower training and test accuracies compared to the networks with full precision. On the other hand, because KD uses not only the ground truth labels but also the teacher’s estimated class probability distributions, it acts as a heavy regularizer
[hinton2015distilling]. Combining these contradictory characteristics of quantization and KD could sometimes result in further degradation of performance from that of quantizationonly. 
In general, it has been shown that a powerful teacher with high accuracy can teach a student better than a weaker one [zhu2018knowledge]. Also, [kim2018paraphrasing, mirzadeh2019improved] show that if there is a large gap from the capacity or inherent differences between the teacher and student, it can hinder the knowledge transfer because this knowledge is unadaptable to the student. Previous works of quantization with KD [mishra2017apprentice, polino2018model] directly applied conventional KD to quantization along with fixed teacher network, and thus they suffer from above mentioned limitations.
In this paper, we propose Quantizationaware Knowledge Distillation (QKD) to address the above mentioned challenges, especially for very lowprecision quantization. QKD consists of three phases. In the first ‘selfstudying’ phase, instead of directly applying knowledge distillation to the quantized student network from the beginning, we first try to find a good initialization. In the second ‘costudying’ phase, we improve the teacher by using the knowledge of the student network. This phase makes the teacher more powerful and quantizationfriendly (See Section 4.5), which is essential to improve accuracy. In final ‘tutoring’ phase, we freeze the teacher network and transfer its knowledge to the student. This phase saves unnecessary training time and memory of the teacher network which tends to have already saturated in the costudying phase. The overal process of QKD is depicted in the Figure 2. Our key contributions are the following:

We propose Quantizationaware Knowledge Distillation which can be effectively applied to very lowbit (2,3,4bit) quantized networks. Considering the characteristics of low bit networks and distillation methods, we design a combination of three phases for QKD that overcome the shortcomings posed by conventional quantization + KD methods as described above.

We empirically verify that QKD works well on depthwise convolution networks which are known to be difficult to quantize and our work is the first to apply KD to train lowbit depthwise convolution networks (viz. MobileNetV2 and EfficientNet)

We show that our QKD obtains stateoftheart accuracies on CIFAR and ImageNet datasets compared to other existing methods. (See Figure 1.)
2 Related work
Quantization Reducing the precision of neural networks [courbariaux2015binaryconnect, li2016ternary, zhu2016trained, nagel2019data, sung2015resiliency] has been studied extensively due to its computational and storage benefits. Although binary [courbariaux2015binaryconnect] or tenary [li2016ternary] quantizations are typically used, their methods only consider the quantization of weights. To fully utilize bitwise operations, activation maps also should be quantized in the same way as weights. Some researches consider quantizing both weights and activation maps [hubara2016binarized, soudry2014expectation, rastegari2016xnor, zhou2016dorefa, wang2019haq]. Binary neural networks [hubara2016binarized]
quantize both weights and activation maps as binary and compute gradients with binary values. XNORNet
[rastegari2016xnor] also quantizes its weights and activation maps with scaling factors obtained by constrained optimization. Furthermore, HAQ [wang2019haq]adaptively changes bitwidth per layer by leveraging reinforcement learning and HWawareness.
Recent works have tried to improve quantization further by learning the range of weights and activation maps [choi2018pact, jung2019learning, esser2019learned, uhlich2019differentiable]. These approaches can easily outperform previous methods which do not train quantizationrelated parameters. PACT [choi2018pact]
proposes a clipping activation function using trainable a parameter which limits the maximum value of activation. LSQ
[esser2019learned] and TQT [tqt2019] introduce uniform quantization using trainable interval values. QIL [jung2019learning] uses a trainable quantizer which performs both pruning and clipping. Our QKD leverages these trainable approaches in the baseline quantization implementation. On top of this quantizationonly method, our elaborated knowledge distillation process boosts the accuracy, thereby achieving the state of art accuracy on lowbit quantization.Knowledge Distillation
KD is one of the most popular methods in model compression. It is widely used in many computer vision tasks. This framework transfers the knowledge of a teacher network to a smallersized student network in two ways: Offline and Online. First, offline KD uses a fixed pretrained teacher networks and the knowledge transfer can happen in different ways. In
[hinton2015distilling], they encourage the student network to mimic the softened distribution of the teacher network. Other works [zagoruyko2016paying, heo2019comprehensive, kim2018paraphrasing, romero2014fitnets] transfer the information using different forms of activation maps from the teacher network. Second, online KD methods [zhu2018knowledge, kim2019feature, zhang2018deep] train both the teacher and the student networks simultaneously without pretrained teacher models. Deep Mutual Learning (DML) [zhang2018deep] have tried to transfer each knowledge of independent networks using KL loss, although it is not studied in the context of quantization. Our QKD uses both online and offline KD methods sequentially.Quantization + Knowledge distillation Some researches have tried to use distillation methods to train low precision network [mishra2017apprentice, polino2018model]. They use a full precision network as the teacher and the low bit network as a student. In Apprentice (AP) [mishra2017apprentice], the teacher and student networks are initialized with the corresponding pretrained full precision networks and the student is then finetuned using distillation. Due to AP’s initialization of the student, AP tends to get stuck in a local minimum in the case of very lowbit quantized student networks. The selfstudy phase of QKD directly mitigates this issue. Also, using a fixed teacher, as in [mishra2017apprentice, polino2018model], can limit the knowledge transfer due to the inherent differences between the distributions of the fullprecision teacher and lowprecision student network. We tackle this problem via online costudying (CS) and offline tutoring (TU).
3 Proposed Method
In this section, we first explain the quantization method used in QKD and then describe the proposed QKD method.
3.1 Quantization
In QKD, we use a trainable uniform quantization scheme for our baseline quantization implementation. We choose uniform quantization because of its hardwarefriendly characteristics. In this work, we quantize both weights and activations. So, we introduce two trainable parameters for the interval values of each layer’s weight parameters and input activations similar to [esser2019learned, tqt2019, uhlich2019differentiable]. These parameters can be trained with either the task loss or a combination of the task and distillation losses. Considering bit quantization for weights and activations, the weight and activation quantizers are defined as follows.
Weight Quantizer: Since weights can take positive as well as negative values, we quantize each weight to integers in the range . Before rounding, the weights are first constrained to this range using a clamping function where
(1) 
We use a trainable parameter for the interval value of weight quantization. It is trained along with the weights of the network. We can calculate the quantization level of the input weight using a rounding operation on . The overall quantizationdequantization scheme for the weights of the network is defined as follows:
(2) 
where is the flooring operation. Note that the dequantization step (multiplication by ) just brings the quantized values () back to their original range and is commonly used when emulating the effect of quantization [bhandare2019efficient, tqt2019, rodriguez2018lower].
Activation Quantizer:
Since most of the networks today use ReLU as the activation function, the activation values are unsigned
^{1}^{1}1EfficientNet uses Swish extensively which introduces a small proportion of negative activation values, but we can save one bit per activation by clamping these negative values at 0 using unsigned quantization. Hence, to quantize the activations, we use an quantization function with the range . The activation quantizer can be obtained as(3) 
where, and represent activation value and the interval value of activation quantization.
Prior to training, we initialize these interval values (,) for every layer using the minmax values of weights and activations (form one forward pass), similar to TFLite [tensorflow2015whitepaper]. These quantizers are nondifferentiable, so we use a straightthrough estimator (STE) [bengio2013estimating] to backprop through the quantization layer. STE approximates the gradient by 1. Thus, we can approximate the gradient of loss , , with
(4) 
3.2 Quantizationaware Knowledge Distillation
We will now describe the three phases of QKD. Algorithm 1 depicts the overall QKD training process.
3.2.1 Phase 1: Selfstudying
Directly combining quantization and KD cannot make the most of the positive effects of KD and easily cause unexpectedly lower performance. This is because of the regularizing characteristics of KD and limited representative power of the quantized network. This can cause the quantized network to easily get trapped in a poor local minima. To mitigate this issue and to provide a good starting point for KD, we train the lowbit network for some epochs using only the task loss, i.e. the standard cross entropy loss. Such a selfstudying phase provides the student with good initial parameters before KD is applied.
Our method can be compared to progressive quantization (PQ) [zhuang2018towards] which also uses two stages and progressively quantizes weights and activation maps for a good initial point. While PQ conducts progressive quantization which uses a higher precision network as a initialization and conducts iterative update between weight and activation maps, our method directly initializes the same low bitwidth for the target lowbit network because learning the interval of weights and activation maps works well without iterative and progressive training.
This strategy of parameter initialization and knowledge distillation has good synergy in terms of generalization and finding a good local minimum. More specifically, our initialization scheme helps to start with a good starting point and the distillation loss guides the student network to good local minima acting as a regularizer.
3.2.2 Phase 2: Costudying
To make a powerful and adaptable teacher, we jointly train the teacher network (fullprecision) and the student network (lowprecision) in an online manner. Kullback–Leibler divergence (KL) between the student and teacher distributions is used to the make the teacher more powerful in terms of accuracy as well as it’s adaptability to the student distribution than the fixed pretrained teacher. In this framework, teacher network is trained by softened distribution of student network and vice versa. Teacher network can be adapted to the quantized student network with KL loss by awaring the distribution of the quantized network.
Assuming that there are classes, the crossentropy loss for both the teacher and the student networks is obtained by firstly computing the softmax posterior with temperature as follows:
(5) 
(6) 
where and
represent logit and crossentropy of the
th network, i.e. the student or the teacher (). The temperature value, , is used to make distribution softer for using the dark knowledge. We can compute the KL loss between student and teacher network using logits.(7) 
Then, we update each network with cross entropy and KL loss as below:
(8) 
(9) 
and refer to the loss of student and teacher network, respectively. We multiply because the gradient scale of logits decrease as much as .
3.2.3 Phase 3: Tutoring
After a few epochs of costudying, the accuracy of teacher network starts saturating. This is because the teacher (fullprecision) has relatively high representative power than the student (lowprecision). To reduce the computational cost and memory in calculating the gradient of the teacher network, we freeze the teacher network and train only the lowbit student network with loss in an offline manner. In this phase, we use the knowledge of a teacher network which is now more quantizationaware as a result of costudying. As we will show later (See Section 4.5), using tutoring gives us equal or better student performance compared to only using costudying throughout the training.
4 Experiments
We perform a comprehensive evaluation of our method on the CIFAR10/100 [krizhevsky2009learning] and ImageNet [ILSVRC15] datasets. We compare the proposed QKD with existing stateoftheart quantization methods on 2, 3, and 4bits (i.e., W2A2, W3A3, W4A4). To show the robustness of our method, we provide comparisons on standard convolutions (i.e., ResNet [he2016deep]) as well as depthwise separable convolutions (i.e., MobileNetV2 [sandler2018mobilenetv2] and EfficientNet [tan2019efficientnet]). Furthermore, we perform an extensive ablation study to analyze the effectiveness of the different components of QKD. Following methods are considered for performance comparison:

[leftmargin=*]

‘Baseline (BL)’ is the baseline quantizationonly version of QKD as described in 3.1; no teacher is used during training. We train the low precision network using crossentropy and initialize the weights with the pretrained fullprecision weights.

‘SS + BL’ is BL, but the lowbit betwork is initialized with the weights and interval values trained in the selfstudying (SS) phase.^{2}^{2}2Note that SS is the same as BL, only difference being that during SS, the lowbit network is trained for much fewer epochs.

‘AP*’ is a modified version of the original Apprentice [mishra2017apprentice] method for knowledge distillation. The original Apprentice uses WRPN scheme [mishra2017wrpn]. We replace this quantizer with BL and initialize both teacher and student with pretrained fullprecision newtork.

‘SS + AP*’ is AP* initialized with the weights and interval values trained in the selfstudying (SS) phase for the lowprecision network.

‘CS + TU’ means that we initialize the student network with pretrained fullprecision network the same way as AP*, then perform “Costudying” and “Tutoring” between the teacher and the student using BL.

‘QKD (SS + CS + TU)’ is our proposed method. We initialize the student network by the SS phase. Then, we perform KD in the CS + TU phase.
4.1 Implementation Details
We quantize the weights and input activations of all the Conv and Linear layers as described in Section 3.1. We quantize first and last layer to 8bits to ensure compatibility on any fixedpoint hardware. We set the temperature value as 2 in QKD. In all the experiments, we use the same settings while training all 6 baselines mentioned above.
CIFAR We train for up to epochs with step learning rate schedule with SGD same as [zhang2018lq]. We use the starting learning rate (LR) of and for models corresponding to CIFAR10/100 respectively. We use 30 epochs for SS phase by dividing LR with for every 10 epoch. After SS phase, we reset the LR to and corresponding to CIFAR10/100. Then, we use other 170 epochs for CS + TU phase. LR learning rate is divided by at and epoch. We use 100 epochs for CS and the remaining 70 epochs for TU. For teacher network, we use the same schedule as the student for the CS phase, but we freeze the teacher in TU. Compared to the LR used for student model’s weight parameters, we use 100 times smaller LR for updating and same LR for updating .
ImageNet For all our compared methods, we used total 120 epochs for training (this is the same length as QIL [jung2019learning]). In the methods SS + BL, SS + AP* and SS + CS + TU, we finetune the student model for 50 epochs in the SS phase and the remaining 70 epochs are used for the rest. In the CS + TU method, the first 60 epochs are for CS and the remaining 60 epochs for TU. For training the student, we used a Mixed Optimizer (MO) setting [opennmt]. In the MO setting, we use SGD [bottou2010large] optimizer to update the weights and Adam [kingma2014adam] for updating the interval values (, ). We found that the exponential learning rate schedule works best for QKD. For the ResNet architectures, we use an initial learning rate of for both SS and CS + TU phases. For EfficientNet and MobileNetV2, initial learning rate of was used. Similar to CIFAR100 setting, we use 100 times smaller LR for and same LR for , compared to LR for model weights. For and , not much finetuning was required because of the Mixed Optimizer setting. More details are described in supplementary details.
4.2 CIFAR10 Results
We evaluate our method on the CIFAR10 dataset. The performance numbers are shown in the Table 1. We use ResNet56 (FP : 93.4%) as a teacher network for both AP* and QKD. Interestingly, for AP* and CS + TU, which combine BL and knowledge distillation, performance is worse than BL at W2A2. This is because the network has lowrepresentative power at W2A2 and can be negatively affected by the heavy regularization imposed by KD. To alleviate this issue, we use the weights trained in the SS phase for a better initialization for the weights of the student. SS + AP* and QKD outperform the existing methods and the Baseline (BL) method for all the cases. These results suggest that initialization is very important in combining quantization and any KD method. In our proposed method, 2% gain was observed at W2A2 with the SS phase. Compared to the distillation method AP*, QKD which also trains the teacher shows a significant improvement in performance. QKD also provides best results for W3A3 and W4A4.
4.3 CIFAR100 Results
Table 2 shows the experimental results on CIFAR100. We use a ResNet56 (FP : 73.4%) as a teacher in this experiment. These experiments show similar tendencies to those with CIFAR10. The accuracy increases with the SS phase which helps us start with a better initialization especially in very lowbit quantization (W2A2). Considering KD methods, QKD performs significantly better than SS+AP*. We attribute this gain in accuracy to the use of an adaptable teacher trained during costudying compared to a fixed teacher. Further discussion on this is provided in Section 4.5. Compared to the fullprecision version of ResNet32, the W3A3 and W4A4 quantized versions have a 1.4% and 2.5% gain respectively.
4.4 ImageNet Results
4.4.1 Standard Convolutions
Table 3 shows the performance of the proposed method on the original ResNet18, ResNet34 and ResNet50 architectures [he2016deep]. We compare our proposed QKD method with QIL [jung2019learning], PACT [choi2018pact], DSQ [gong2019differentiable] and LQNets [zhang2018lq], which are current SOTA methods that show results with 2, 3, and 4bit quantization on the original ResNet architectures. We use ResNet101, ResNet50 and ResNet34 as teachers for the student networks ResNet50, ResNet34 and ResNet18 respectively.
We observed that the QKD method outperforms the existing approaches in terms of both top1 and top5 accuracy. Our distillation method consistently gave us 0.51.1% gain in top1 accuracy across all bitwidths over our baseline quantization method (BL). QKD even exceeds the fullprecision accuracy by more than at W4A4 for all ResNet architectures. Interestingly, CS+TU outperforms AP* everywhere but SS+AP* has better performance than CS+TU at W2A2. This again confirms the efficacy of using selfstudying especially at 2bit quantization. Also, it can be seen that QKD (SS+CS+TU) outperforms all the other methods at W2A2.
4.4.2 Depthwise Separable Convolutions
Table 4 shows the performance of our method on MobileNetv2 (width_multiplier=) [sandler2018mobilenetv2] and EfficientNetB0 [tan2019efficientnet]. DSQ [gong2019differentiable] shows results on MobileNetV2 at W4A4. We refer to the PACT [choi2018pact] performance on MobileNetV2 from the HAQ paper [wang2019haq]. EfficientNetB0 is the current stateoftheart on ImageNet among architectures with similar parameter and MAC count. Hence, we also provide the results of our method on EfficientNetB0. We include W6A6 into our set of bitwidths. We use MobileNetV2 (width=) as teacher for the student network MobileNetV2 (width=) and EfficientNetB1 as the teacher for EfficientNetB0.
In general, we see higher gains with the QKD (over the BL and AP* methods) with both MobileNetV2 and EfficientNetB0 than what we observed on the ResNet architectures. If we compare a similar accuracy range (6667%) which is observed with W4A4 on MobileNetV2 and W3A3 on EfficientNetB0, we can see 1.2% and 1.7% top1 accuracy improvement respectively with QKD over BL. Interestingly, we observe a significant difference of 1.3% between AP* and SS+AP* at W3A3 with MobileNetV2.
With W2A2, we observe a drastic drop in top1 accuracy. This was expected since depthwise convolution layers are known to be highly sensitive to quantization [wang2019haq, tqt2019]. So, we ran another set of experiments where the weights and input activations of the depthwise separable layers are quantized to W8A8 whereas the rest of the layers are quantized to W2A2, W3A3 or W4A4. We include these results in the supplementary details.
4.5 Discussion
Effectiveness of Selfstudying Figure 3 shows the training accuracy of the student in the final epoch with and without the SelfStudying phase on the CIFAR100 dataset. At very lowbit quantization, the model has low representative power and without SS, it gets stuck in a local minima as is reflected from the lower training accuracy. The SS phase helps the student start from a better initialization point and KD then guides it to a better local minima, hence increasing the training accuracy. Note that although training accuracy of QKD seems quite lower than others, its test accuracy is better than others, as was shown in Table 2. This tendency is usually seen when using KD in general fullprecision domain because KD regularizes the network for enhanced generalization performance. In QKD, the gap between blue and red one is large compared to BL, meaning that it suffers from regularization effect more.
Adaptability of teacher network In Costudying, we train the teacher network with (9) with the goal of making it’s distribution more adaptable to that of the lowprecision student. In Figure 4, we plot the KL divergence between the teacher and student class distributions during QKD training and during SS+AP* training for one of the CIFAR100 experiments. The KLdivergence during QKD is consistently lower than that of SS + AP*. This indicates that our QKD makes the teacher more adaptable to the lowprecision student in terms of the similarity of their class distribution.
Powerful teacher network In addition to improving teacher’s adaptability to the quantized student network, we observed that the teacher network’s accuracy significantly increases during the Costudying phase. This can be attributed to the regularizing effect and the knowledge that is being transferred from the student’s posterior distribution. Figure 5 shows the variation of teacher accuracy with epochs for different settings during the CS + TU phase, compared to a fixed teacher used in AP*. The improvement in teacher accuracy was most in the case of W4A4 and the least with W2A2. The reason could be that knowledge transferred by a W2A2 quantized network is limited.
Reasoning behind Tutoring phase From Figure 5, we note that the teacher accuracy saturates towards the end of costudying. This is due to the higher representation power of the teacher. Considering this, we freeze the teacher and turn into the tutoring phase. This helps tremendously in terms of training speed since now only the student is being trained. Interestingly, we also see performance gains by using SS + CS + TU (50 + 35 + 35 epochs) instead of just SS + CS (50 + 70 epochs). To verify this, we used SS + CS to train ResNet18 at 2bit and 4bit with ResNet34 as teacher. The performance of SS + CS was 0.2% lower in case of W4A4 and 0.5% lower with W2A2 compared to QKD.
Activations vs. class posterior The works [zagoruyko2016paying, kim2018paraphrasing, heo2019comprehensive, romero2014fitnets] have shown promising performance in the fullprecision (FP) domain (teacher and student are both FP) by using offline activation (feature map) distillation. To compare distilling activations and the softmaxed posterior used in our QKD in terms of effectiveness on training a quantized network, we transfer the activations of the last layer of the teacher (fullprecision) using L2 loss to the student (lowprecision) similar to [kim2018paraphrasing, romero2014fitnets]. We use a simple regressor used in [romero2014fitnets, heo2019comprehensive]. We will call this baseline as activation distillation (AD). We use ResNet56 as a teacher. Table 5 shows comparison between AD and QKD. AD has reasonable performance at 4bits but it loses power when we use lower bitwidths. The reason is that the activation distributions between lowprecision and fullprecision are quite different so it would not be a good guidance for the student. Hence, we verify that using posterior is more useful than activations in terms of training a lowprecision network.
5 Conclusion
In this paper, we propose a Quantizationaware Knowledge Distillation method that can be effectively applied to very lowbit quantization. We propose a combination of selfstudying, costudying and tutoring methods to effectively combine model quantization and KD, wherein we provide a comprehensive ablation study of the impact of each of these methods on the quantized model accuracy. We show how selfstudying is important in alleviating the regularization effect imposed by KD and how costudying enhances the quantization performance by making the teacher more adaptable and more powerful in terms of accuracy. Overall, with an extensive set of experiments, we show that QKD gives significant performance boost over our baseline quantizationonly method and outperforms the existing stateoftheart approaches. We demonstrate QKD’s results on networks with both standard and depthwise separable convolutions and show that we can recover the fullprecision accuracy at as low as W3A3 quantization of the ResNet architectures and W6A6 quantization of MobileNetV2.
Comments
There are no comments yet.