CTMQ: Cyclic Training of Convolutional Neural Networks with Multiple Quantization Steps

by   HyunJin Kim, et al.
Universidad Complutense de Madrid

This paper proposes a training method having multiple cyclic training for achieving enhanced performance in low-bit quantized convolutional neural networks (CNNs). Quantization is a popular method for obtaining lightweight CNNs, where the initialization with a pretrained model is widely used to overcome degraded performance in low-resolution quantization. However, large quantization errors between real values and their low-bit quantized ones cause difficulties in achieving acceptable performance for complex networks and large datasets. The proposed training method softly delivers the knowledge of pretrained models to low-bit quantized models in multiple quantization steps. In each quantization step, the trained weights of a model are used to initialize the weights of the next model with the quantization bit depth reduced by one. With small change of the quantization bit depth, the performance gap can be bridged, thus providing better weight initialization. In cyclic training, after training a low-bit quantized model, its trained weights are used in the initialization of its accurate model to be trained. By using better training ability of the accurate model in an iterative manner, the proposed method can produce enhanced trained weights for the low-bit quantized model in each cycle. Notably, the training method can advance Top-1 and Top-5 accuracies of the binarized ResNet-18 on the ImageNet dataset by 5.80 6.85


page 1

page 2

page 3

page 4


Searching for Low-Bit Weights in Quantized Neural Networks

Quantized neural networks with low-bit weights and activations are attra...

Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights

This paper presents incremental network quantization (INQ), a novel meth...

Training Multi-bit Quantized and Binarized Networks with A Learnable Symmetric Quantizer

Quantizing weights and activations of deep neural networks is essential ...

Towards Accurate Quantization and Pruning via Data-free Knowledge Transfer

When large scale training data is available, one can obtain compact and ...

An Underexplored Dilemma between Confidence and Calibration in Quantized Neural Networks

Modern convolutional neural networks (CNNs) are known to be overconfiden...

Pyramid Vector Quantization and Bit Level Sparsity in Weights for Efficient Neural Networks Inference

This paper discusses three basic blocks for the inference of convolution...

Where Should We Begin? A Low-Level Exploration of Weight Initialization Impact on Quantized Behaviour of Deep Neural Networks

With the proliferation of deep convolutional neural network (CNN) algori...

1 Introduction

In recent years, quantized CNNs are widely adopted for reducing the hardware complexity (Gupta et al. (2015); Han et al. (2015); Wu et al. (2016); Hubara et al. (2017); Wang et al. (2018); Wu et al. (2020); Yu et al. (2020)). Moreover, it has been proved that low-bit quantized CNNs using binary (Hubara et al. (2016); Courbariaux et al. (2016); Rastegari et al. (2016); Liu et al. (2018)), ternary (Zhou et al. (2016); Deng et al. (2017); Wan et al. (2018)) quantizations can be trainable, producing acceptable performance. However, there is a significant performance gap between real-valued CNNs and low-bit quantized CNNs. Quantization noise disturbs weight updating in the quantization-aware training (Wang et al. (2018)), thus degrading the performance of trained models in low-bit quantized CNNs. The retraining method uses pretrained weights to initialize a target model, and the initialized model is retrained. Because the retraining method automatically searches for suitable parameter values, it can produce better classification results compared with naive post-training quantization (Gholami et al. (2021)). It is known that the initial weights from accurate models are helpful to produce better training results. However, low-bit quantized CNNs show significant performance degradation immediately after the initialization.

In each batch of training, after updating real-valued weights in back-propagation, quantized weights are used in forward paths, making quantization noise. The quantization noise hinders the training step to find its optimized weights (Helwegen et al. (2019); Xu et al. (2021a)). Besides, the scaling and biasing parameters used in normalization have disturbances due to the gap between real-valued parameters and their quantized values. There have been several works to overcome the degraded performance in low-bit quantized models (Helwegen et al. (2019); Kim et al. (2020); Martinez et al. (2020); Xu et al. (2021a, b)). However, the existing studies for low-bit quantized models do not consider the training method for achieving better weight initialization from accurate models.

This paper proposes a method consisting of multiple cyclic training for achieving better performance in low-bit quantized CNNs. Firstly, the proposed method softly transfers the learned knowledge based on accurate models to low-bit quantized models with multiple quantization steps. In a quantization step, the pretrained weights are used to initialize the weights of the next model with the quantization bit depth reduced by one. With small change of the quantization bit depth, the performance gap can be bridged, thus producing better weight initialization. In the multiple quantization steps, the pretrained weights in a step are used in the initialization of weights to be trained in the next quantization step. As proceeding with each quantization step, the quantization bit depth of trained models decreases gradually. Secondly, the proposed cyclic training uses the trained weights of a low-bit quantized model in the weight initialization of its accurate model to be trained. Then, after training the accurate model, its trained weights are used in the training of the low-bit quantized model in an iterative manner. The proposed method uses better training ability of the accurate model in each cycle, producing better weight initialization for the low-bit quantized model in an iterative manner. The proposed training method is applied to the training of the binarized ResNet-18 (He et al. (2016)). Quantized CNNs based on DoReFa-Net (Zhou et al. (2016)

) and XNOR-Net (

Rastegari et al. (2016)) are trained with the proposed method. Without any structural changes in ResNet-18, the proposed training method outperforms the training results from scratch and using the initialization of pretrained real-valued models. In the experiments on the ImageNet dataset, Top-1 and Top-5 accuracies of the binarized ResNet-18 are enhanced by 5.80% and 6.85%, respectively.

This paper is organized as follows: in the preliminaries, the difficulties in training highly approximate CNNs are described in detail. Then, the proposed cyclic training having multiple quantization steps is explained along with a detailed process description. Next, we explain our experimental environments and experimental results for training the binarized ResNet-18.

2 Preliminaries

2.1 Quantized CNNs

As the easiest way to reduce the computational resources and storage requirements of CNNs, quantized CNNs have been studied recently. Normalized activations and weights used in the forward path of CNNs are expressed using a fixed-point format. Quantized CNNs quantize activations and weights to have 8-bit, 4-bit, 2-bit, ternary, and binary formats. The simplified format and its lightweight operations can dramatically reduce the hardware complexity of convolutional operations in CNNs, compared with real-valued CNNs. Notably, the binarized CNNs utilize 1-bit activations and weights in binarized convolutions.

It has been proven in several existing studies that quantized CNNs can achieve a certain level of inference performance (Gholami et al. (2021)). However, due to quantization errors, their performance can be degraded compared with real-valued CNNs. Notably, there may be a significant performance degradation in the quantized CNN using weights based on a format of less than 2-bit (Zhou et al. (2016)). Therefore, it is necessary to improve the performance of quantized CNNs.

When a pretrained CNN model is given, the pretrained model can be used to initialize its quantized CNN model. Quantization of CNNs has been categorized depending on whether retrained is performed or not. Post-training quantization reuses the parameters of the pretrained model without retraining, where quantization levels are uniformly spaced or unequally determined to minimize the effects of quantization errors (Gholami et al. (2021)). However, due to the non-linearity in each layer of the CNN model, it is difficult to guarantee the best value by reducing the amount of quantization errors. On the other hand, the retraining scheme (Han et al. (2015)) uses the existing pretrained model for the weight initialization of a model and performs training on the quantized CNN model. In quantization-aware training (Wang et al. (2018)), the error of the forward path in the quantized model is considered as the amount of loss. Besides, the parameters for scaling and normalizing activations can be optimized during retraining. Therefore, it is known that the performance after retraining can be higher than that without retraining in quantized CNN models.

2.2 Training of Quantized CNNs and its Difficulties

In general, the objective function of a CNN is a non-convex loss function denoted as

, where training tries to find a local optimum using the gradient descent method (Ruder (2016)). The gradient descent is described in Algorithm 1.

Input : initial weights , number of iterations , learning rate
Output : final weights
1 for  to  do
2        ()
3 end for
Algorithm 1 Gradient Descent

Although other optimizers such as ADAM (Kingma and Ba (2014)) have their own formulations, the gradient descent was explained using (1) for its simplicity. The local optimum is repeatedly searched by updating the gradient of . The updated weights are considered when obtaining the next in an iteration.

On the other hand, in quantized CNNs, the quantization error could nullify or suppress the effects of the updating with () in Algorithm 1. Weights at the -th iteration can be rewritten using -bit quantized weights and their following error as:


From (1), and affect the quantized weights used in the next forward path. The equation in line 2 of Algorithm 1 can be rewritten as:


From (2), we can conclude two facts: firstly, terms and have different signs, which could somewhat cancel out the quantization error of the updated weights .

Secondly, entire values of in (2), cannot be considered in the weight updating when and are not the same. When becomes small in low-bit quantized CNNs, the cancellation can be hard, which can mislead searches for the local optimum in training steps. It is assured that when a model has high quantization bit depth with small , it can produce good training results.

Several training methods for quantized CNNs empirically adopt a straight-through estimator (STE) (

Bengio et al. (2013); Yin et al. (2019)) for back-propagating gradients instead of using derivatives of the quantizer. For example, binarized CNNs can use the function to binarize activations (Zhou et al. (2016); Rastegari et al. (2016); Liu et al. (2018)). The derivative of the function can be the Dirac Delta function (Hassani (2009)), which is hard to implement. Using a STE, the approximation of the derivative of the function can be shown by (3).


Instead of using the Dirac Delta function, low-bit quantized CNNs adopt the STE, making the incoming gradients equal to their output gradients in the back-propagation.

In spite of long training time, a STE in low-bit quantized and binarized CNNs can produce acceptable performance in many several works (Zhou et al. (2016); Rastegari et al. (2016); Liu et al. (2018); Kim (2021); Shin and Kim (2022)). However, term in (1) still exists, producing instability in training steps.

3 CTMQ: cyclic training with multiple quantization steps

3.1 Proposed training method

As discussed in the preliminaries section, accurate models with small quantization errors can provide better training performance along with more stable weight convergence. In existing works of binarized CNNs (Liu et al. (2018); Bulat and Tzimiropoulos (2019); Kim et al. (2020), pretrained weights are used in the weights initialization of the binarized models, which can transfer the training results of accurate models into low-bit quantized models. Let be the pretrained weights after training an -bit quantized CNN model. Let us assume that the pretrained weights are used in weight initialization for -bit quantized model. The weight updating in (2) can be rewritten by as:


Depending on the difference between and

, the distribution of quantized values can be different. For example, small weights in a Gaussian distribution can be quantized into non-zero values in accurate models. On the other hand, when quantization bit depth becomes low in (

2), many small weights are quantized into zeros. The significantly different quantization bit depths could prevent the knowledge of pretrained models from being well transferred into low-bit quantized models.

To overcome the problem, we proposed a new method called CTMQ, which is an abbreviation for cyclic training with multiple quantization steps. In the proposed method, the knowledge of a pretrained model is softly transferred into its low-bit quantized model using weight initialization. Then, the knowledge transfer using initial weights is cyclically repeated between and -bit quantized models. In the cyclic training, the pretrained weights of the -bit quantized model can be used in the initialization for training the -bit quantized model in an iterative manner. Then, the -bit quantized model is trained with the initial weights. After obtaining the pretrained weights of the -bit quantized model, the pretrained weights can be used the training of the -bit quantized model. The cyclic training is repeated several times. In each cycle, based on the initial weights from the -bit quantized model, the training of the -bit quantized model can produce better training weights for the initialization of the -bit quantized model. The proposed CTMQ for training a -bit quantized model is described in Algorithm 2.

Input : initial weights , number of cycles , numbers of iterations , learning rate , -bit quantizations
Output : final weights
1 for  to  do // 1st part: soft knowledge transfer
2        for  to  do
3               ();
5        end for
6       ;
8 end for
9for  to  do // 2nd part: cyclic training with and
10        for  to  do
11               for  to  do
12                      ();
14               end for
15              ;
17        end for
19 end for
20for  to  do
21        ();
23 end for
25 for  to  do // 3rd part: full training with
26        ();
28 end for
Algorithm 2 CTMQ: cyclic training with multiple quantization steps for -bit quantized CNNs

In Algorithm 2, there are three parts: In the first part, after training an -bit quantized model with iterations, the pretrained model is used in the initialization of an -bit quantized model. The initial weights from the pretrained real-valued model can be used as . In -bit quantization, both weights and activations are quantized into bits. This iteration commented as soft knowledge transfer is repeated until (lines 1-6). By letting be (line 5), the training results of an accurate model are delivered into an inaccurate model. In the second part, cyclic training with and quantized models are performed during cycles (lines 7-18). Alternatively, -bit and -bit quantized models are trained during iterations. The trained weights are used to initialize the next quantized model by (line 12). Then, a -bit quantized model is trained after finishing the cyclic training (lines 15-17). In the third part, the trained weights with the final -bit quantized model is used to initialize the final -bit quantized model. Finally, the -bit quantized model is trained during iterations.

3.2 Target quantizations

In the multiple quantization steps of the proposed method, we adopts the quantization scheme shown in DoReFa-Net (Zhou et al. (2016)) for quantizing weights and activations, which is summarized in the following. It is noted that the proposed training cannot depend on any specific quantization method.

Except for binarized weights, weights are normalized into for obtaining -bit quantized weights as:


Terms and mean the hyperbolic tangent and absolute functions, respectively. Notably, function is used to limit the value range. Then, normalized weights are quantized into as:


Besides, binarized weights can be obtained using the function as:


In (7), when , let be to avoid the cases with .

On the other hand, -bit quantized activations denoted as are obtained as:


In (8), term denotes the clamp function to limit to the range . Unlike the -bit quantized weights, the proposed method does not use the function for -bit quantized activations. Therefore, can be or .

A value is multiplied by the denominator and the numerator in (6) and (8). In (7), is multiplied by both the denominator and numerator. The range of values in -bit quantized weights are limited by normalizing weights in (5). In -bit quantized activations, the clamp function in (8) limits the range of values. Regardless of , the range of values is fixed so that two quantized models with different have the same range of weights and activations.

4 Experimental results and analysis

In experiments, existing binarized CNN models were trained to evaluate the proposed method on the training of low-bit quantized CNNs. Binarized CNNs adopted -bit weights and activations in the convolutions so that significant computational and storage resources can be saved. However, the performance degradation of binarized CNNs has been the most critical issue (Qin et al. (2020)). It is expected that the proposed method can solve this problem by achieving high classification accuracy. The pyramid structure with highly stacked convolution layers and residual shortcuts have been adopted in many studies of binarized CNNs (Rastegari et al. (2016); Lin et al. (2017); Liu et al. (2018); Bulat and Tzimiropoulos (2019); Phan et al. (2020); Liu et al. (2020); Kim (2021); Xu et al. (2021a); Shin and Kim (2022)). Considering the popularity of the residual binarized networks, our experiments adopted a residual CNN model called ResNet-18 He et al. (2016)

. Standardized datasets (CIFAR-100 (

Krizhevsky et al. (2009))) and ImageNet (Russakovsky et al. (2015))) were in the evaluations. ResNet-18 was quantized for achieving the final trained model with binarized (1-bit) weights and activations. For fair comparison, we adopted the quantization scheme in Zhou et al. (2016). Besides, we maintained the structure of basic blocks in DoReFa-Net (Zhou et al. (2016)) and XNOR-Net (Rastegari et al. (2016) in the experiments.

4.1 Experimental frameworks

We evaluated the proposed training method using the Pytorch deep learning frameworks (

Paszke et al. (2019)). We coded our training script that supported multiple GPU-based data parallelism. After finishing training, its real-valued trained model was stored in a file with pth extension. As shown in Algorithm 2, the file having pretrained weights was loaded for the initialization used in the next training. The above process was repeated in the proposed cyclic training using shell scripts. It must be noted that quantizations were not implemented in silicon-based hardware circuits. Instead, quantizers were coded as functions to emulate the quantizations of weights and activations in the forward path.

4.2 Target dataset and model structures

Figure 1: ResNet-18 for ImageNet dataset.
Figure 2: Basic blocks for quantized ResNet-18: (a) TypeI; (b) TypeII.

The summary of target datasets is as follows: the CIFAR-100 dataset consists of 60K colour images with 100 classes, where each class contains 500 training and 100 test images. For the data augmentation during training, input images were randomly cropped and flipped from padded images. The ImageNet dataset contains 1.3M training and 50K validation images with 1K classes. During training on the ImageNet dataset, images were randomly resized between 256 and 480 pixels, and then randomly cropped and flipped into images. In inference, we adopted center-cropped images without data augmentations from the validation dataset.

Figure 1 describes the structure of ResNet-18 for the ImageNet dataset. The layers have 5 groups from Conv-1 to Conv-5, where the convolutional layer in a group has the same number of output channels. Firstly, 3 channels from RGB coloured image are used as convolutional layer with (Conv-1). With

, the width and height of output features are reduced to half. After the max pooling with

, the convolutional layers of Conv-2 are performed with 64 input channels.For the CIFAR dataset, a convolutional layer with is used in Conv-1. A shortcut with direct mapping are described as a thin round arrow. Per two convolutional layers, one shortcut is placed so that a basic block contains two convolutional layers and one shortcut. After finishing two basic blocks, the downsampling is performed by doubling the number of channels and having . The thick round arrow denotes the downsampling using an convolution. This process is repeated until finishing convolutions in Conv-5. Then, the average pooling is applied to 512 channels with

features. After performing fully-connected and softmax layers, the final classification results can be obtained. If the quantization is adopted in the first convolution layer of Conv-1, the original

-bit image pixel data could be degraded. Because the width and height of features in Conv-5 are small, large quantization errors can be critical in the fully-connected layer. In our experiments, the convolution in Conv-1 and fully-connected layers adopted real-valued weights and activations like Zhou et al. (2016); Rastegari et al. (2016).

Figure 2 illustrates two basic blocks denoted as TypeI and TypeII. Our experments designed the TypeI and TypeII basic blocks based on DoReFa-Net (Zhou et al. (2016)) and XNOR-Net (Rastegari et al. (2016)

). Term BN denotes the batch normalization (

Ioffe and Szegedy (2015)

). The learnable mean and variance were used to scale and shift the normalized activations. In the two convolutional layers and downsampling,

-bit weights and activations were used except for the first convolution in Conv-2. Besides, whereas TypeI adopted -bit quantization in the downsampling, real-valued weights and activations were used in the downsampling of TypeII. In TypeII, after the scaling of the batch normalization layer, -bit quantized activations were obtained, where the quantization in (8) was used instead of the function in the original XNOR-Net.

4.3 Training on CIFAR-100 and ImageNet datasets

Training on the CIFAR-100 dataset used ADAM optimizer (Kingma and Ba (2014)) having without weight decay. When training a -bit quantized model during epochs, the base learning rate was set as 0.001. The learning rate was decayed based on poly policy in the -th epoch, limiting the maximum learning rate of the ADAM optimizer by . After finishing the training of an -bit quantized model, of the next training was reset as 0.001.

The inputs in (2) were given as follows: pretrained real-valued weights were used as the initial weights . Let and be 9 and 8, respectively. Therefore, an -bit quantized model was firstly trained. When finishing the first part in (2), a trained -bit quantized model was obtained. In the second part, -bit and -bit quantized models were alternatively trained during 9 cycles. Its mini-batch size was set as 512. 20 training epochs were adopted for and so that . The third part peformed 200 training epochs for .

Several hyperparameters used in the training on the ImageNet dataset were the same as those on the CIFAR-100 dataset. Besides,

and were set as 9 and 8 in the training on the ImageNet dataset. Considering a large number of images on the ImageNet dataset, let for 5 training epochs in the first and second parts. In the third part, 50 training epochs were performed for .

Figure 3: Inference accuracies of quantized ResNet-18 models on ImageNet dataset.

Figure 3 illustrates inference accuracies of quantized ResNet-18 models on the ImageNet dataset during training. At each epoch, the inference accuracies of two binarized ResNet-18 models based on TypeI and TypeII were shown. In Torch (2022), it must be noted that 1-crop Top-1 and Top-5 inference accuracies of the original ResNet-18 were 69.76% and 89.08%, respectively. At the starting point of the first part, the quantization error lowered the inference accuracy. As training continued, the accuracies rapidly increased. In the first part, the accuracies from TypeI and TypeII were nearly close, which means that the effects of the quantized downsampling were not significant in the first part. In the second part, compared with those of TypeII, the models using TypeI had low inference accuracies immediately after -bit quantized models were initialized. Therefore, it was concluded that the -bit quantization of the downsampling can degrade performance in binarized CNNs. The initialization of binarized CNNs with pretrained -bit quantized models was performed at the 36-th, 46-th, and 56-th epochs. The inference accuracies at the epochs increased as the cyclic training proceeded. For example, the accuracies at the 46-th epoch were better than those at the 36-th epoch so that better initial weights can be obtained in the cyclic training. After the third cycle, the increase slowed down, as shown in Figure 3. In the third part, compared with the earlier epochs of the second part, the -bit quantized CNNs at the third part can start their training with higher inference accuracies after weight initializations. As training proceeded, the inference accuracies gradually increased in the third part.

Dataset Model CTMQ Initialization Top-1 Top-5
CIFAR-100 TypeI N 32 N 73.91% -
N 8 N 68.28% -
N 1 N 64.13% -
N 1 Y 65.41% -
Y 1 - 67.93% -
TypeII N 32 N 72.61% -
N 8 N 67.21% -
N 1 N 67.69% -
N 1 Y 68.89% -
Y 1 Y 68.84% -
ImageNet ResNet-18 N 32 N 69.76% 89.08%
TypeI N 32 N 67.40% 87.71%
N 1 N 47.51% 72.71%
N 1 Y 42.38% 67.91%
Y 1 Y 51.54% 76.15%
TypeII N 32 N 67.03% 87.27%
XNOR-Net N 1 N 51.2% 73.20%
TypeII Y 1 N 57.0% 80.05%
Table 1: Comparison of -bit quantized models in terms of inference accuracies

Table 1 lists the comparison in terms of inference accuracies. The proposed training method was evaluated based on the average of five runs. The accuracies of ResNet-18 and XNOR-Net in Table 1 were referenced from Torch (2022) and Rastegari et al. (2016). There were no published results on the quantized ResNet-18 on the CIFAR-100 dataset in Zhou et al. (2016) and Rastegari et al. (2016) so that we trained the quantized models during 200 epochs on the CIFAR-100 dataset and 50 epochs on the ImageNet dataset for the counterparts without applying the proposed method. In the training of the counterparts, most of the hyperparameters were identical with those using the proposed method. However, weight decay was set as 0.0001 for the regularization. In Table 1, when the initialization was Y, real-valued pretrained models were used in weight initialization. Besides, when item CTMQ was N, the soft knowledge transfer and cyclic training were not applied.

Due to the small number of training images and small image size in the CIFAR-100 dataset, overfitting was shown in the experiments of ResNet-18. Therefore, the enhancements with the proposed method were limited in the experiments on the CIFAR-100 dataset. On the other hand, the proposed method significantly enhanced the final inference accuracies on the ImageNet dataset compared with other cases. On the ImageNet dataset, when not using the proposed method, TypeI with initialization showed significant performance drop only having 42.38% Top-1 and 67.91% Top-5 inference accuracies. TypeI using the proposed CTMQ achieved 51.54% Top-1 and 76.15% Top-5 accuracies, which produced better performance over the original XNOR-Net Rastegari et al. (2016). The binarized ResNet-18 with TypeII had Top-1 and Top-5 inference accuracies up to 57.0% and 80.05%, respectively, which showed 5.80% and 6.85% enhancements over the training results of the original XNOR-Net.

Considering the experimental results, it is concluded that the proposed training method was effective for training complex residual networks on large dataset.

5 Conclusion

We focus on a new training method performing cyclic training with different -bit quantizations. The training method performs multiple cyclic training for softly delivering the knowledge of pretrained accurate models and using better training ability of accurate networks. Although the number of training epochs increases during three training parts, it has been proven that the final inference accuracies for complex residual networks on large dataset are significantly enhanced in experimental results. It is noted that there are no structural changes in CNNs for enhancing the final inference accuracy. Although several binarized CNNs are trained in experiments, the proposed method can be applied to other quantized CNNs based on the existing quantization and optimizer. Notably, the training method can advance Top-1 and Top-5 accuracies of the binarized ResNet-18 on the ImageNet dataset by 5.80% and 6.85%, respectively. Considering the enhanced classification outputs, it is concluded that the proposed cyclic training method is useful for producing high-performance systems in quantized CNNs.