Deploying complex DNNs on resource-constrained edge devices such as smartphones or IoT devices is still challenging due to their tight memory and computation requirements. To address this, several research works have been proposed which can be roughly categorized into three folds: weight pruning [han2015learning, han2015deep, liu2018rethinking, frankle2018lottery], quantization, [krishnamoorthi2018quantizing, zhang2018lq, lin2016fixed, wang2018two, cai2017deep] and knowledge distillation [hinton2015distilling, kim2018paraphrasing, yim2017gift].
Today, a majority of edge devices require fixed-point inference for compute- and power-efficiency. Hence, quantizing the weights and activations of deep networks to a certain bit-width becomes a necessity to make them compatible to edge devices. Recently, hardware accelerators like NPUs (Neural Processing Units), NVIDIA’s Tensor Core units and even CIM (Compute-in-memory) devices have targeted support for sub-byte level processing such as 4-bit4-bit or 2-bit2-bit matrix multiply-and-accumulate operations [markidis2018nvidia, pan2018multilevel, sharma2019accelerated, sumbul2019compute]. These are much more power- and compute-efficient than conventional 8-bit processing. Thus, there are increased demands for lower-bit quantization.
The accuracy of very low-bit quantized networks such as using 4-, 3- or 2-bits inevitably decreases because of their low representation power compared to typical 8-bit quantization. Some works [mishra2017apprentice, polino2018model] used knowledge distillation (KD) to improve the accuracy of quantized networks by transferring knowledge from a full-precision teacher network to a quantized student network. However, even though KD is one of the widely used techniques to enhance a DNN’s accuracy [hinton2015distilling], directly applying KD to very low-bit quantized networks incurs several challenges:
Due to its limited representation power, quantized networks generally show lower training and test accuracies compared to the networks with full precision. On the other hand, because KD uses not only the ground truth labels but also the teacher’s estimated class probability distributions, it acts as a heavy regularizer[hinton2015distilling]. Combining these contradictory characteristics of quantization and KD could sometimes result in further degradation of performance from that of quantization-only.
In general, it has been shown that a powerful teacher with high accuracy can teach a student better than a weaker one [zhu2018knowledge]. Also, [kim2018paraphrasing, mirzadeh2019improved] show that if there is a large gap from the capacity or inherent differences between the teacher and student, it can hinder the knowledge transfer because this knowledge is unadaptable to the student. Previous works of quantization with KD [mishra2017apprentice, polino2018model] directly applied conventional KD to quantization along with fixed teacher network, and thus they suffer from above mentioned limitations.
In this paper, we propose Quantization-aware Knowledge Distillation (QKD) to address the above mentioned challenges, especially for very low-precision quantization. QKD consists of three phases. In the first ‘self-studying’ phase, instead of directly applying knowledge distillation to the quantized student network from the beginning, we first try to find a good initialization. In the second ‘co-studying’ phase, we improve the teacher by using the knowledge of the student network. This phase makes the teacher more powerful and quantization-friendly (See Section 4.5), which is essential to improve accuracy. In final ‘tutoring’ phase, we freeze the teacher network and transfer its knowledge to the student. This phase saves unnecessary training time and memory of the teacher network which tends to have already saturated in the co-studying phase. The overal process of QKD is depicted in the Figure 2. Our key contributions are the following:
We propose Quantization-aware Knowledge Distillation which can be effectively applied to very low-bit (2-,3-,4-bit) quantized networks. Considering the characteristics of low bit networks and distillation methods, we design a combination of three phases for QKD that overcome the shortcomings posed by conventional quantization + KD methods as described above.
We empirically verify that QKD works well on depth-wise convolution networks which are known to be difficult to quantize and our work is the first to apply KD to train low-bit depth-wise convolution networks (viz. MobileNetV2 and EfficientNet)
We show that our QKD obtains state-of-the-art accuracies on CIFAR and ImageNet datasets compared to other existing methods. (See Figure 1.)
2 Related work
Quantization Reducing the precision of neural networks [courbariaux2015binaryconnect, li2016ternary, zhu2016trained, nagel2019data, sung2015resiliency] has been studied extensively due to its computational and storage benefits. Although binary [courbariaux2015binaryconnect] or tenary [li2016ternary] quantizations are typically used, their methods only consider the quantization of weights. To fully utilize bit-wise operations, activation maps also should be quantized in the same way as weights. Some researches consider quantizing both weights and activation maps [hubara2016binarized, soudry2014expectation, rastegari2016xnor, zhou2016dorefa, wang2019haq]. Binary neural networks [hubara2016binarized]
quantize both weights and activation maps as binary and compute gradients with binary values. XNOR-Net[rastegari2016xnor] also quantizes its weights and activation maps with scaling factors obtained by constrained optimization. Furthermore, HAQ [wang2019haq]
adaptively changes bit-width per layer by leveraging reinforcement learning and HW-awareness.
Recent works have tried to improve quantization further by learning the range of weights and activation maps [choi2018pact, jung2019learning, esser2019learned, uhlich2019differentiable]. These approaches can easily outperform previous methods which do not train quantization-related parameters. PACT [choi2018pact]
proposes a clipping activation function using trainable a parameter which limits the maximum value of activation. LSQ[esser2019learned] and TQT [tqt2019] introduce uniform quantization using trainable interval values. QIL [jung2019learning] uses a trainable quantizer which performs both pruning and clipping. Our QKD leverages these trainable approaches in the baseline quantization implementation. On top of this quantization-only method, our elaborated knowledge distillation process boosts the accuracy, thereby achieving the state of art accuracy on low-bit quantization.
KD is one of the most popular methods in model compression. It is widely used in many computer vision tasks. This framework transfers the knowledge of a teacher network to a smaller-sized student network in two ways: Offline and Online. First, offline KD uses a fixed pre-trained teacher networks and the knowledge transfer can happen in different ways. In[hinton2015distilling], they encourage the student network to mimic the softened distribution of the teacher network. Other works [zagoruyko2016paying, heo2019comprehensive, kim2018paraphrasing, romero2014fitnets] transfer the information using different forms of activation maps from the teacher network. Second, online KD methods [zhu2018knowledge, kim2019feature, zhang2018deep] train both the teacher and the student networks simultaneously without pre-trained teacher models. Deep Mutual Learning (DML) [zhang2018deep] have tried to transfer each knowledge of independent networks using KL loss, although it is not studied in the context of quantization. Our QKD uses both online and offline KD methods sequentially.
Quantization + Knowledge distillation Some researches have tried to use distillation methods to train low precision network [mishra2017apprentice, polino2018model]. They use a full precision network as the teacher and the low bit network as a student. In Apprentice (AP) [mishra2017apprentice], the teacher and student networks are initialized with the corresponding pre-trained full precision networks and the student is then fine-tuned using distillation. Due to AP’s initialization of the student, AP tends to get stuck in a local minimum in the case of very low-bit quantized student networks. The self-study phase of QKD directly mitigates this issue. Also, using a fixed teacher, as in [mishra2017apprentice, polino2018model], can limit the knowledge transfer due to the inherent differences between the distributions of the full-precision teacher and low-precision student network. We tackle this problem via online co-studying (CS) and offline tutoring (TU).
3 Proposed Method
In this section, we first explain the quantization method used in QKD and then describe the proposed QKD method.
In QKD, we use a trainable uniform quantization scheme for our baseline quantization implementation. We choose uniform quantization because of its hardware-friendly characteristics. In this work, we quantize both weights and activations. So, we introduce two trainable parameters for the interval values of each layer’s weight parameters and input activations similar to [esser2019learned, tqt2019, uhlich2019differentiable]. These parameters can be trained with either the task loss or a combination of the task and distillation losses. Considering -bit quantization for weights and activations, the weight and activation quantizers are defined as follows.
Weight Quantizer: Since weights can take positive as well as negative values, we quantize each weight to integers in the range . Before rounding, the weights are first constrained to this range using a clamping function where
We use a trainable parameter for the interval value of weight quantization. It is trained along with the weights of the network. We can calculate the quantization level of the input weight using a rounding operation on . The overall quantization-dequantization scheme for the weights of the network is defined as follows:
where is the flooring operation. Note that the dequantization step (multiplication by ) just brings the quantized values () back to their original range and is commonly used when emulating the effect of quantization [bhandare2019efficient, tqt2019, rodriguez2018lower].
Since most of the networks today use ReLU as the activation function, the activation values are unsigned111EfficientNet uses Swish extensively which introduces a small proportion of negative activation values, but we can save one bit per activation by clamping these negative values at 0 using unsigned quantization. Hence, to quantize the activations, we use an quantization function with the range . The activation quantizer can be obtained as
where, and represent activation value and the interval value of activation quantization.
Prior to training, we initialize these interval values (,) for every layer using the min-max values of weights and activations (form one forward pass), similar to TF-Lite [tensorflow2015-whitepaper]. These quantizers are non-differentiable, so we use a straight-through estimator (STE) [bengio2013estimating] to backprop through the quantization layer. STE approximates the gradient by 1. Thus, we can approximate the gradient of loss , , with
3.2 Quantization-aware Knowledge Distillation
We will now describe the three phases of QKD. Algorithm 1 depicts the overall QKD training process.
3.2.1 Phase 1: Self-studying
Directly combining quantization and KD cannot make the most of the positive effects of KD and easily cause unexpectedly lower performance. This is because of the regularizing characteristics of KD and limited representative power of the quantized network. This can cause the quantized network to easily get trapped in a poor local minima. To mitigate this issue and to provide a good starting point for KD, we train the low-bit network for some epochs using only the task loss, i.e. the standard cross entropy loss. Such a self-studying phase provides the student with good initial parameters before KD is applied.
Our method can be compared to progressive quantization (PQ) [zhuang2018towards] which also uses two stages and progressively quantizes weights and activation maps for a good initial point. While PQ conducts progressive quantization which uses a higher precision network as a initialization and conducts iterative update between weight and activation maps, our method directly initializes the same low bit-width for the target low-bit network because learning the interval of weights and activation maps works well without iterative and progressive training.
This strategy of parameter initialization and knowledge distillation has good synergy in terms of generalization and finding a good local minimum. More specifically, our initialization scheme helps to start with a good starting point and the distillation loss guides the student network to good local minima acting as a regularizer.
3.2.2 Phase 2: Co-studying
To make a powerful and adaptable teacher, we jointly train the teacher network (full-precision) and the student network (low-precision) in an online manner. Kullback–Leibler divergence (KL) between the student and teacher distributions is used to the make the teacher more powerful in terms of accuracy as well as it’s adaptability to the student distribution than the fixed pre-trained teacher. In this framework, teacher network is trained by softened distribution of student network and vice versa. Teacher network can be adapted to the quantized student network with KL loss by awaring the distribution of the quantized network.
Assuming that there are classes, the cross-entropy loss for both the teacher and the student networks is obtained by firstly computing the softmax posterior with temperature as follows:
represent logit and cross-entropy of the-th network, i.e. the student or the teacher (). The temperature value, , is used to make distribution softer for using the dark knowledge. We can compute the KL loss between student and teacher network using logits.
Then, we update each network with cross entropy and KL loss as below:
and refer to the loss of student and teacher network, respectively. We multiply because the gradient scale of logits decrease as much as .
3.2.3 Phase 3: Tutoring
After a few epochs of co-studying, the accuracy of teacher network starts saturating. This is because the teacher (full-precision) has relatively high representative power than the student (low-precision). To reduce the computational cost and memory in calculating the gradient of the teacher network, we freeze the teacher network and train only the low-bit student network with loss in an offline manner. In this phase, we use the knowledge of a teacher network which is now more quantization-aware as a result of co-studying. As we will show later (See Section 4.5), using tutoring gives us equal or better student performance compared to only using co-studying throughout the training.
We perform a comprehensive evaluation of our method on the CIFAR10/100 [krizhevsky2009learning] and ImageNet [ILSVRC15] datasets. We compare the proposed QKD with existing state-of-the-art quantization methods on 2, 3, and 4-bits (i.e., W2A2, W3A3, W4A4). To show the robustness of our method, we provide comparisons on standard convolutions (i.e., ResNet [he2016deep]) as well as depth-wise separable convolutions (i.e., MobileNetV2 [sandler2018mobilenetv2] and EfficientNet [tan2019efficientnet]). Furthermore, we perform an extensive ablation study to analyze the effectiveness of the different components of QKD. Following methods are considered for performance comparison:
‘Baseline (BL)’ is the baseline quantization-only version of QKD as described in 3.1; no teacher is used during training. We train the low precision network using cross-entropy and initialize the weights with the pretrained full-precision weights.
‘SS + BL’ is BL, but the low-bit betwork is initialized with the weights and interval values trained in the self-studying (SS) phase.222Note that SS is the same as BL, only difference being that during SS, the low-bit network is trained for much fewer epochs.
‘AP*’ is a modified version of the original Apprentice [mishra2017apprentice] method for knowledge distillation. The original Apprentice uses WRPN scheme [mishra2017wrpn]. We replace this quantizer with BL and initialize both teacher and student with pre-trained full-precision newtork.
‘SS + AP*’ is AP* initialized with the weights and interval values trained in the self-studying (SS) phase for the low-precision network.
‘CS + TU’ means that we initialize the student network with pre-trained full-precision network the same way as AP*, then perform “Co-studying” and “Tutoring” between the teacher and the student using BL.
‘QKD (SS + CS + TU)’ is our proposed method. We initialize the student network by the SS phase. Then, we perform KD in the CS + TU phase.
4.1 Implementation Details
We quantize the weights and input activations of all the Conv and Linear layers as described in Section 3.1. We quantize first and last layer to 8-bits to ensure compatibility on any fixed-point hardware. We set the temperature value as 2 in QKD. In all the experiments, we use the same settings while training all 6 baselines mentioned above.
CIFAR We train for up to epochs with step learning rate schedule with SGD same as [zhang2018lq]. We use the starting learning rate (LR) of and for models corresponding to CIFAR-10/100 respectively. We use 30 epochs for SS phase by dividing LR with for every 10 epoch. After SS phase, we reset the LR to and corresponding to CIFAR-10/100. Then, we use other 170 epochs for CS + TU phase. LR learning rate is divided by at and epoch. We use 100 epochs for CS and the remaining 70 epochs for TU. For teacher network, we use the same schedule as the student for the CS phase, but we freeze the teacher in TU. Compared to the LR used for student model’s weight parameters, we use 100 times smaller LR for updating and same LR for updating .
ImageNet For all our compared methods, we used total 120 epochs for training (this is the same length as QIL [jung2019learning]). In the methods SS + BL, SS + AP* and SS + CS + TU, we finetune the student model for 50 epochs in the SS phase and the remaining 70 epochs are used for the rest. In the CS + TU method, the first 60 epochs are for CS and the remaining 60 epochs for TU. For training the student, we used a Mixed Optimizer (MO) setting [opennmt]. In the MO setting, we use SGD [bottou2010large] optimizer to update the weights and Adam [kingma2014adam] for updating the interval values (, ). We found that the exponential learning rate schedule works best for QKD. For the ResNet architectures, we use an initial learning rate of for both SS and CS + TU phases. For EfficientNet and MobileNetV2, initial learning rate of was used. Similar to CIFAR-100 setting, we use 100 times smaller LR for and same LR for , compared to LR for model weights. For and , not much fine-tuning was required because of the Mixed Optimizer setting. More details are described in supplementary details.
4.2 CIFAR-10 Results
We evaluate our method on the CIFAR-10 dataset. The performance numbers are shown in the Table 1. We use ResNet-56 (FP : 93.4%) as a teacher network for both AP* and QKD. Interestingly, for AP* and CS + TU, which combine BL and knowledge distillation, performance is worse than BL at W2A2. This is because the network has low-representative power at W2A2 and can be negatively affected by the heavy regularization imposed by KD. To alleviate this issue, we use the weights trained in the SS phase for a better initialization for the weights of the student. SS + AP* and QKD outperform the existing methods and the Baseline (BL) method for all the cases. These results suggest that initialization is very important in combining quantization and any KD method. In our proposed method, 2% gain was observed at W2A2 with the SS phase. Compared to the distillation method AP*, QKD which also trains the teacher shows a significant improvement in performance. QKD also provides best results for W3A3 and W4A4.
4.3 CIFAR-100 Results
Table 2 shows the experimental results on CIFAR-100. We use a ResNet-56 (FP : 73.4%) as a teacher in this experiment. These experiments show similar tendencies to those with CIFAR-10. The accuracy increases with the SS phase which helps us start with a better initialization especially in very low-bit quantization (W2A2). Considering KD methods, QKD performs significantly better than SS+AP*. We attribute this gain in accuracy to the use of an adaptable teacher trained during co-studying compared to a fixed teacher. Further discussion on this is provided in Section 4.5. Compared to the full-precision version of ResNet-32, the W3A3 and W4A4 quantized versions have a 1.4% and 2.5% gain respectively.
4.4 ImageNet Results
4.4.1 Standard Convolutions
Table 3 shows the performance of the proposed method on the original ResNet18, ResNet34 and ResNet50 architectures [he2016deep]. We compare our proposed QKD method with QIL [jung2019learning], PACT [choi2018pact], DSQ [gong2019differentiable] and LQ-Nets [zhang2018lq], which are current SOTA methods that show results with 2-, 3-, and 4-bit quantization on the original ResNet architectures. We use ResNet101, ResNet50 and ResNet34 as teachers for the student networks ResNet50, ResNet34 and ResNet18 respectively.
We observed that the QKD method outperforms the existing approaches in terms of both top-1 and top-5 accuracy. Our distillation method consistently gave us 0.5-1.1% gain in top-1 accuracy across all bit-widths over our baseline quantization method (BL). QKD even exceeds the full-precision accuracy by more than at W4A4 for all ResNet architectures. Interestingly, CS+TU outperforms AP* everywhere but SS+AP* has better performance than CS+TU at W2A2. This again confirms the efficacy of using self-studying especially at 2-bit quantization. Also, it can be seen that QKD (SS+CS+TU) outperforms all the other methods at W2A2.
4.4.2 Depthwise Separable Convolutions
Table 4 shows the performance of our method on MobileNetv2 (width_multiplier=) [sandler2018mobilenetv2] and EfficientNet-B0 [tan2019efficientnet]. DSQ [gong2019differentiable] shows results on MobileNetV2 at W4A4. We refer to the PACT [choi2018pact] performance on MobileNetV2 from the HAQ paper [wang2019haq]. EfficientNet-B0 is the current state-of-the-art on ImageNet among architectures with similar parameter and MAC count. Hence, we also provide the results of our method on EfficientNet-B0. We include W6A6 into our set of bit-widths. We use MobileNetV2 (width=) as teacher for the student network MobileNetV2 (width=) and EfficientNet-B1 as the teacher for EfficientNet-B0.
In general, we see higher gains with the QKD (over the BL and AP* methods) with both MobileNetV2 and EfficientNet-B0 than what we observed on the ResNet architectures. If we compare a similar accuracy range (6667%) which is observed with W4A4 on MobileNetV2 and W3A3 on EfficientNet-B0, we can see 1.2% and 1.7% top-1 accuracy improvement respectively with QKD over BL. Interestingly, we observe a significant difference of 1.3% between AP* and SS+AP* at W3A3 with MobileNetV2.
With W2A2, we observe a drastic drop in top-1 accuracy. This was expected since depth-wise convolution layers are known to be highly sensitive to quantization [wang2019haq, tqt2019]. So, we ran another set of experiments where the weights and input activations of the depth-wise separable layers are quantized to W8A8 whereas the rest of the layers are quantized to W2A2, W3A3 or W4A4. We include these results in the supplementary details.
Effectiveness of Self-studying Figure 3 shows the training accuracy of the student in the final epoch with and without the Self-Studying phase on the CIFAR-100 dataset. At very low-bit quantization, the model has low representative power and without SS, it gets stuck in a local minima as is reflected from the lower training accuracy. The SS phase helps the student start from a better initialization point and KD then guides it to a better local minima, hence increasing the training accuracy. Note that although training accuracy of QKD seems quite lower than others, its test accuracy is better than others, as was shown in Table 2. This tendency is usually seen when using KD in general full-precision domain because KD regularizes the network for enhanced generalization performance. In QKD, the gap between blue and red one is large compared to BL, meaning that it suffers from regularization effect more.
Adaptability of teacher network In Co-studying, we train the teacher network with (9) with the goal of making it’s distribution more adaptable to that of the low-precision student. In Figure 4, we plot the KL divergence between the teacher and student class distributions during QKD training and during SS+AP* training for one of the CIFAR-100 experiments. The KL-divergence during QKD is consistently lower than that of SS + AP*. This indicates that our QKD makes the teacher more adaptable to the low-precision student in terms of the similarity of their class distribution.
Powerful teacher network In addition to improving teacher’s adaptability to the quantized student network, we observed that the teacher network’s accuracy significantly increases during the Co-studying phase. This can be attributed to the regularizing effect and the knowledge that is being transferred from the student’s posterior distribution. Figure 5 shows the variation of teacher accuracy with epochs for different settings during the CS + TU phase, compared to a fixed teacher used in AP*. The improvement in teacher accuracy was most in the case of W4A4 and the least with W2A2. The reason could be that knowledge transferred by a W2A2 quantized network is limited.
Reasoning behind Tutoring phase From Figure 5, we note that the teacher accuracy saturates towards the end of co-studying. This is due to the higher representation power of the teacher. Considering this, we freeze the teacher and turn into the tutoring phase. This helps tremendously in terms of training speed since now only the student is being trained. Interestingly, we also see performance gains by using SS + CS + TU (50 + 35 + 35 epochs) instead of just SS + CS (50 + 70 epochs). To verify this, we used SS + CS to train ResNet18 at 2-bit and 4-bit with ResNet34 as teacher. The performance of SS + CS was 0.2% lower in case of W4A4 and 0.5% lower with W2A2 compared to QKD.
Activations vs. class posterior The works [zagoruyko2016paying, kim2018paraphrasing, heo2019comprehensive, romero2014fitnets] have shown promising performance in the full-precision (FP) domain (teacher and student are both FP) by using offline activation (feature map) distillation. To compare distilling activations and the softmaxed posterior used in our QKD in terms of effectiveness on training a quantized network, we transfer the activations of the last layer of the teacher (full-precision) using L2 loss to the student (low-precision) similar to [kim2018paraphrasing, romero2014fitnets]. We use a simple regressor used in [romero2014fitnets, heo2019comprehensive]. We will call this baseline as activation distillation (AD). We use ResNet-56 as a teacher. Table 5 shows comparison between AD and QKD. AD has reasonable performance at 4bits but it loses power when we use lower bit-widths. The reason is that the activation distributions between low-precision and full-precision are quite different so it would not be a good guidance for the student. Hence, we verify that using posterior is more useful than activations in terms of training a low-precision network.
In this paper, we propose a Quantization-aware Knowledge Distillation method that can be effectively applied to very low-bit quantization. We propose a combination of self-studying, co-studying and tutoring methods to effectively combine model quantization and KD, wherein we provide a comprehensive ablation study of the impact of each of these methods on the quantized model accuracy. We show how self-studying is important in alleviating the regularization effect imposed by KD and how co-studying enhances the quantization performance by making the teacher more adaptable and more powerful in terms of accuracy. Overall, with an extensive set of experiments, we show that QKD gives significant performance boost over our baseline quantization-only method and outperforms the existing state-of-the-art approaches. We demonstrate QKD’s results on networks with both standard and depthwise separable convolutions and show that we can recover the full-precision accuracy at as low as W3A3 quantization of the ResNet architectures and W6A6 quantization of MobileNetV2.