Effective and Fast: A Novel Sequential Single Path Search for Mixed-Precision Quantization

03/04/2021 ∙ by Qigong Sun, et al. ∙ Xidian University NetEase, Inc 0

Since model quantization helps to reduce the model size and computation latency, it has been successfully applied in many applications of mobile phones, embedded devices and smart chips. The mixed-precision quantization model can match different quantization bit-precisions according to the sensitivity of different layers to achieve great performance. However, it is a difficult problem to quickly determine the quantization bit-precision of each layer in deep neural networks according to some constraints (e.g., hardware resources, energy consumption, model size and computation latency). To address this issue, we propose a novel sequential single path search (SSPS) method for mixed-precision quantization,in which the given constraints are introduced into its loss function to guide searching process. A single path search cell is used to combine a fully differentiable supernet, which can be optimized by gradient-based algorithms. Moreover, we sequentially determine the candidate precisions according to the selection certainties to exponentially reduce the search space and speed up the convergence of searching process. Experiments show that our method can efficiently search the mixed-precision models for different architectures (e.g., ResNet-20, 18, 34, 50 and MobileNet-V2) and datasets (e.g., CIFAR-10, ImageNet and COCO) under given constraints, and our experimental results verify that SSPS significantly outperforms their uniform counterparts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the development of deep neural networks (DNNs), DNNs have achieved impressive performance in many applications. Generous computing resources and footprint memory are urgently required when deeper networks are used to solve various problems. Moreover, with the rapid development of chip technologies, especially GPU and TPU, the computational frequency and efficiency have been greatly improved. Most scholars use GPUs as the basic hardware platform for network training due to excellent acceleration capability. However, for low power consumption platforms (e.g., mobile phone, embedded devices and smart chips), whose resources are limited, it is hard to achieve satisfactory performance for industrial applications. As one of the typical methods for model compression and acceleration, model quantization methods usually quantize full-precision (32-bit) parameters to low-bits of precision (e.g., 8-bit and 4-bit). Even more, we can extremely constrain weights and activation values to binary {-1, +1} [18, 40] or ternary {-1, 0, +1} [61], which can be computed by bitwise operations. This logic calculation is more suitable for the implementation of FPGA and other service-oriented computing platforms. As implemented in [27], they achieved about speed up compared with CPU and are faster than GPU in the peak condition.

Figure 1: (a) Comparison of the GPU memory usage of searching process on multiple models based on the DARTS-based method and SSPS, where each search cell has 5 candidates (i.e., 2-bit to 6-bit) and batch size is 32. (b) The searching entropy variation curves of some layers in ResNet-20 on CIFAR-10.

In recent years, many methods [8, 22, 26, 30, 46, 58] have been proposed to improve the performance of low-precision models. However, the bit-precisions of most quantization models are set manually based on experience, and all layers have the same quantization precision in general. Some studies [5, 7, 10, 11] show that different layers have different sensitivity to quantization. Therefore, the mixed-precision quantization [36, 48, 6] can achieve better performance according to the characteristics of network. Besides, some recent smart chips also support mixed-precision for the DNN inference, e.g., Apple A12 Bionic [12], Nvidia Turing GPU [37], BitFusion [44] and BISMO [49].

With the successful application of Neural Architecture Search (NAS), [6, 33, 50]

converted the mixed-precision quantization problem as a NAS task and used reinforcement learning or gradient-based methods to search an ideal solution. Differentiable neural architecture search methods

[51, 55] searched over a supernet that contains all candidate architectures and needs to reside in memory for searching with feature maps. Fig. 1 (a) shows the GPU memory usage of searching on multiple models based on DARTS [32] framework. Due to its exponential search space with the number of network layers, the searching process requires a lot of hardware resources and is very time-consuming. In order to relieve the pressure of hardware resources and speed up the searching process, some methods [51, 14] presented a specific small search space, and BP-NAS [55] was proposed to search on small datasets and then extend the model to large dataset tasks. However, the methods rely on the incomplete definition of mixed-precision quantization tasks, which are easy to fall into a local optimal solution. This task actually seeks a balance between the task-dependent accuracy and given constraints (e.g., energy consumption, hardware resource, quantization precision, model size and bitwise operations). HAQ [50] and DNAS [51] incorporated the complexity cost into loss function and tuned the corresponding balance weight, which usually takes multiple searches to get an appropriate model. HAWQ [11] manually chose the bit precision among the reduced search space, and HAWQ-V2 [10] developed a Pareto frontier based method for selecting the exact bit precision. Therefore, it is difficult to control the search direction towards the given constraints to meet the deployment requirements.

In order to address the above issues, we propose a novel differentiable sequential single path search (SSPS) method, which can quickly find the ideal mixed-precision model of a specific network (e.g., ResNet-20, 18, 34, 50 and MobileNet-V2) satisfying the given constraints (e.g., average weight bit-width and average operation bit-width). The advantages of our method are shown as follows:

  • Save Resources. We propose a novel differentiable single path search cell, where only one candidate is sampled at a time to carry out calculations. That is, it avoids caching all the candidates in memory or participating in calculation together, thus saving hardware resources. Fig. 1(a) shows the GPU memory usage of our SSPS method, which is significantly less than that of the DARTS-based method.

  • Purpose Search. We use average weight bit-width and average operation bit-width to measure the given constraints (e.g., model size and bit operations) and innovatively introduce them into our constrained loss function. By punishing the quantized candidates which deviate from the objective constraint to guide the search direction, the problem that the parameters need to be adjusted many times to get a satisfactory solution is greatly alleviated.

  • Fast Search. We use entropy to evaluate the selection certainty of each search cell, and determine the quantization bit-precision of cells sequentially during the searching process. Therefore, the complexity is reduced exponentially with the determination of layer quantization bit-precision, and the searching process is significantly accelerated. Our method takes less than 7 hours on 4 V100 GPUs to complete a search for ResNet18 on ImageNet, which is faster than DNAS (40 GPU-hours).

  • State-of-the-art Results. With our proposed techniques applied on a bunch of models (e.g., ResNet-20, 18, 34, 50 and MobileNet-V2) and tasks (e.g., classification and detection), the mixed-precision quantization models we searched are obviously better than other counterparts under the similar constraints.

2 Related Works

2.1 Model Quantization

Model quantization refers to a way to compress and accelerate the model by replacing the full-precision weights or activation values with fixed-precision values in DNNs. [18, 40]

used bitwise operations (e.g., xnor and bitcount) to effectively compute the matrix multiplication and achieved outstanding efficiency and performance. To further improve the representation capability,

[46, 60] used the multi-bit quantization to approximate the full-precision weights and activation values. Most quantizers of multiple bit-widths can be categorized in three modalities: quantizer [8, 30, 50], quantizer [35] and quantizer [22, 26]. According to the quantization granularity of DNNs, model quantization can be divided into - quantization [46, 19], - quantization [53, 50] and - quantization [33, 57].

The mixed-precision quantization [6, 36, 48] can match the sensitivity of each layer in DNNs with appropriate combination of quantization bit-widths, and it can achieve better results under the same constraints. HAQ [50] added the feedback of acceleration information evaluated by hardware simulator to the training cycle, and used reinforcement learning to determine the quantization strategy automatically. [14, 51] converted the quantization task to a NAS problem, and optimized the network weights and architecture parameters by using the back-propagation methods. However, its pipeline behavior is like DARTS [32] at the beginning and it also requires high configuration hardware resources. [51] spent 40 GPU (V100) hours to complete the search of ResNet-18 in a specific small search space. By generating distilled data, ZeroQ [33] can fine-tune the models with arbitrary quantization precisions without using any training or validation datasets. In [56, 16], the trained model can match a variety of quantization precisions without any fine-tune or calibration, which will lead to some performance loss.

2.2 Neural Architecture Search

The emergence of Neural Architecture Search (NAS) breaks the bottleneck of designing neural architectures manually and achieves better performance than human-invented architectures on many tasks, such as image classification [64, 31], object detection [64], semantic segmentation [7]

, and language models

[63, 39, 9]. The success of NAS requires a variety of search spaces and huge amounts of computing resources, which makes the optimization of network become a difficult problem. Commonly used optimization methods are mainly divided into three types: such as reinforcement learning [2, 64, 59]

, evolutionary algorithms

[42, 41], and gradient-based methods [32, 52, 1]. Besides searching computation operators, NAS methods also search for the width and spatial resolution of each block in the network structure [13]. It can also be used to channel pruning [17] or filter numbers search [47]. In [4], the network delay and sparsity are incorporated into the index of search consideration, and it can search the architectures on different tasks (e.g., CIFAR-10 and ImageNet) and different hardware platforms (e.g., GPU, CPU and mobile phones).

3 Method

In this paper, we model the mixed-precision quantization task as a NAS problem. The goal of this task is to find an ideal mixed-precision quantization model for a specific network under some given constraints to meet real-world requirements. Specifically, the learning procedure of architectural parameters is formulated as the following bi-level optimization problem:

(1)
(2)

where and represent the architecture parameters for searching of activation values and weights, and denote the architecture space. and denote the supernet parameters and selected model weights, and represent the task-dependent losses (e.g., the cross-entropy loss) on validation and training datasets, respectively. measures the constraint loss of the quantization network initialized by , and . is a super-parameter, and

is the target vector of given constraints (e.g., average weight bit-width and average operation bit-width). Fig. 2 shows the framework of our proposed SSPS method.

In order to search the ideal mixed-precision quantization model effective and fast, we propose a SSPS method and innovatively introduce some given constraints into its loss function to guide the searching process. In this section, we first describe the search space, and then we propose a differentiable single path search cell that constructs one fully differentiable search supernet. Finally, we describe how we use average weight bit-width and average operation bit-width to evaluate the given constraints and introduce them into our loss function to guide the searching process. The searching process will be described in the next section.

Figure 2:

The framework of our proposed SSPS. Firstly, we set a network as the seed network which will be expanded to a supernet. In order to search the ideal mixed-precision quantization model, we introduce the given constraints into the loss function and use our sequential search method to update the architecture parameters and weights. Finally, we select the target network according to the probability of architecture parameters. Best view in color.

3.1 Weight and Activation Search Cells

Many recently proposed NAS methods [32, 4, 24] focus on cell search (i.e., normal cell and reduction cell). Once cell architectures are confirmed, they will stack many copies of these discovered cells to make up a deep neural network. The purpose of this task is to find the - quantization for specific networks (e.g., ResNet and MobilNet), in which different layers have different quantization precisions. From Fig. 2, the layer operation contains two search cells (i.e., weight search cell and activation value search cell).

Suppose we use and to represent the input data and output data of the -th layer. denotes the weights, and denotes the calculation operations (e.g., full connection or convolution). The computation of those two variables can be formulated as follows:

(3)

where denotes the selected -bit quantized values of by the bit search cell , and denotes the selected -bit quantized values of by the bit search cell . The general search space for the -th search cell is shown as follows:

(4)

where all the integers represent the bit-precision. denotes half-precision floating-point format and denotes the single-precision floating-point format. Thus, the search space size of the whole layer is . Obviously, the search space of this task is exponential in the number of the model layers , which is expressed as .

3.2 Differentiable Single Path Search Cell

Because of the huge search space, reinforcement learning techniques or evolutionary algorithms are computationally expensive and much time-consuming. DARTS-based methods need to reside all candidate architectures and feature maps in memory. Thus, they require multiple GPUs with high memory configuration and a small batch size to search. Fig. 1 (a) shows the GPU memory usage of the DARTS-based method for searching on multiple models.

Figure 3: The differentiable single pass search cell for the quantization of activation values, where the circles represent the quantization of the different precisions of the input , and the output and denote the quantized values and selected bit-width.

In order to save hardware resources and speed up the searching process, we propose a differentiable single path search cell to compose the supernet. We take activation value quantization as an example, and the search cell is shown in Fig. 3. The input denotes the full-precision activation values, the output and denote the quantized values and the selected bit-width. We introduce the - [21, 34]

to control the search strategy. It approximates the multi distributed sampling process by re-parameterization, which provides an efficient way to draw samples from a discrete probability distribution. By this approximation, we can transform the non-differentiable sampling problem into differentiable computation. Here, we use

to represent the sampling probability vector and is the -th element of , which is formulated as follows:

(5)

where is the -th element of a -dimensional learnable architecture parameter vector ,

is a random variable drawn from the Gumbel distribution (

with ). is the temperature coefficient used to control the smoothness of sampling. Therefore, the activation value search cell can be expressed as follows:

(6)
(7)
(8)

where is the -th element of one-hot vector . denotes the -th element of the search space. denotes the -bit quantize function, (e.g., the quantizer). From Eq. (7), we can get the selected bit precision, which will be used as the input for the constrained loss. In general, argmax function is used to select the most probable index. However, since our goal is to sample from a discrete probability distribution, we cannot back-propagate gradients through the argmax function to optimize

. Here, we use the straight-through estimator (STE)

[3] to back-propagate through Eq. (6).

During the searching process, the real discrete distribution can be approached by gradually reducing the temperature . The higher the temperature, the smoother the distribution. The lower the temperature is, the closer the generated distribution is to discrete. At the beginning of the search, it can be regarded as random sampling. With the decrease of , it becomes probability sampling. The resource saving of our search cell can be seen clearly from Fig. 1(a).

3.3 Constrained Loss Function

Except for the task-dependent loss (e.g., the cross entropy loss), hardware resources, energy consumption, model size and computational complexity are also important factors affecting real-world applications. These factors can be effectively controlled by restricting the quantization precision of weights and activation values [44], and the correlation factors of different hardware platforms are different. In order to formulate those factors, we introduce average weight bit and average operation bit-width to evaluate model size and bitwise operation, which are usually applied to evaluate the given constraints [55, 65]. We introduce average weight bit-width and average operation bit-width into our constrained loss function to guide the searching process.

Taking constrain the target model size as expectation, we focus on weight quantization to compress the model. Generally, model parameters are stored in a 32-bit floating-point type. When we quantize the weights, the model size and storage requirements will be reduced. Suppose we have a model of layers, and represents the number of parameters in the -th layer. The average weight bit-width can be defined as follows:

(9)

where denotes the weight architecture parameter vector, and is the output of weight search cell that denotes the selected quantization precision of -th layer, the computation of is similar to , as shown in Eqs. .

The second is to constrain the quantization precision of the weights and activation values to achieve a specific computational complexity. It is also one of the important reasons affecting industrial applications. Here, we use average operation bit-width to evaluate bitwise operation computational complexity. We use to denote the number of float point operations in the -th layer. The average operation bit-width is related to the architecture parameters and , and it is formulated as follows:

(10)

where denotes the architecture parameter of activation value search cell, and is the output of activation value search cell that denotes the selected quantization precision of -th layer.

Based on the above definition, we define a constrained loss function as follows:

(11)

where represent the target average weight bit-width and average operation bit-width, respectively.

4 Searching Process

Entropy is commonly used to measure the uncertainty of a distribution. In this paper, we use entropy to evaluate the selection certainty of search cells. Different entropy values correspond to different selection certainties. The smaller the entropy, the stronger the selection certainty. We use the -th activation value search cell as an example, and its probability distribution is computed as follows:

(12)

where denotes the activation value architecture parameter, and the entropy of this cell is defined as:

(13)

Fig. 1 (b) shows the entropy variation curves of some layers in ResNet-20 on CIFAR-10 based on our single path search cell. We can see that the entropies of different layers have different convergence speeds. And many layers will gradually converge to a steady state in the searching process. If we gradually determine the quantization precision of a certain layer in the searching process, the search space will decrease exponentially. Therefore, we propose a sequential single path search method, which divides this task into subtasks by iterations and optimizes them sequentially. After satisfying the decision conditions, we prioritize the cells with the highest selection certainty. Then, we use the selected quantization precision to replace the original search cell to participate in the subsequent searching process. A new search subproblem is generated by the above method. With the determination of some search cells, the search space decreases exponentially. Finally, a mixed-precision model satisfying the given constraints is obtained by iterative solutions. The iterative procedure is shown in Algorithm 1. In the searching process, the quantization precision of each search cell gradually tends to be stable and the entropy will gradually decrease through continuous iterative updating of architecture parameters.

0:   The sub-training dataset and validation dataset; The search space , supernet and constraints and ;Initialize architecture parameters , and supernet weights .
0:   The architecture with high accuracy under given constraints.
1:  while not terminated do
2:     Using Eqs. to select the forward subnetwork;
3:     Update weights by using the sub-training dataset;
4:     Using Eqs. to select the forward subnetwork;
5:     Update architecture parameters and by using the validation dataset;
6:     if

 decision epoch 

then
7:        Calculate the selection certainty of each search cell by using Eqs. ;
8:        Determine the quantization precision of selected cell by probability;
9:        Remove the selected cell from the search space;
10:     end if
11:  end while
Algorithm 1 : Sequential Single Path Search Method

5 Experiments

In this section, we search the mixed-precision quantization models to verify the effectiveness of our method on two image classification benchmarks (CIFAR-10 and ImageNet) and an object detection benchmark (COCO). We first describe the details of our experimental implementations. Then the experimental results of our method are presented to compare with state-of-the-art methods.

5.1 Implementation Details

We implement our method using Pytorch

[38], in which we can easily implement and debug quantization functions and NAS algorithms. We use a hardware-friendly quantization function as the quantizer, therefore, the inference process can be efficiently implemented by bitwise operations (e.g., xnor and bitcount) to achieve model compression, computational acceleration and resource saving. We quantize the weights linearly into -bit, which can be formulated as follows:

(14)

where the clamp function is used to truncate all values into the range of , and is a learned parameter of the -th search cell. The scaling factor is defined as: . The search space of each search cell is .

In implementation, we combine the weight search cell and activation value search cell into one layer-level search cell. Therefore, there are 25 candidates for each layer. The architecture parameters are optimized by Adam, and the initial learning rate is . Network parameters are updated by SGD, the initial learning rate is , and the weight decay is set to . For ImageNet, the batch size for all the networks is set to . The super-parameter is set to . Following the methods [62, 18, 60], we quantize the first convolutional layer and the last fully-connected layer to -bit. We apply the pre-trained full-precision model to initialize the supernet and then the warm-up strategy is adopted. After searching the desired network, we fine-tune the mixed-precision quantization model to get the final parameters. For COCO detection, we use the mixed-precision architecture obtained by the ImageNet classification task as the backbone. Our network is fine-tuned by SGD for 50K iterations with the initial learning rate and the batch size of 16 for 8 V100 GPUs. The learning rate is decayed by a factor of 10 at iterations 30K and 40K, respectively.

5.2 Experimental Results

5.2.1 Cifar-10

We focus on searching the mixed-precision quantization model under the given average weight bit and average operation bit of ResNet-20 on the CIFAR-10 dataset. This dataset has 50K training images and 10K testing images. We divide the training images into a sub-training dataset (25K images) and a validation dataset (25K images). The sub-training dataset is used to update the weights of supernet and then the validation dataset is used to update architecture parameters. After searching process, we use the whole training images to fine-tune the selected model.

Methods W-Bits A-Bits Top-1 W-Comp Ave-Bits
Baseline 32 32 92.37 1.00 32.00
Dorefa [60] 3 3 89.90 10.67 3.00
PACT [8] 3 3 91.10 10.67 3.00
LQ-Nets [58] 3 3 91.60 10.67 3.00
HAWQ [11] M 4 92.22 13.11 -
BP-NAS [55] M M 92.12 10.74 3.30
SSPS M M 92.54 10.74 3.04
Table 1: Accuracy comparisons of ResNet20 on CIFAR10. Here ’M’ refers to mixed-precision quantization models.
Models Methods W-Bits A-Bits Top-1 W-Comp Ave-Bits
ResNet-18 Baseline 32 32 70.20 1.00 32.00
PACT [8] 3 3 68.10 10.67 3.00
LQ-Nets [58] 3 3 68.20 10.67 3.00
DSQ [15] 3 3 68.66 10.67 3.00
QIL [23] 3 3 69.20 10.67 3.00
SSPS M M 69.64 10.65 2.99
PACT [8] 4 4 69.20 8.00 4.00
LQ-Nets [58] 4 4 69.30 8.00 4.00
DSQ [15] 4 4 69.56 8.00 4.00
QIL [23] 4 4 70.10 8.00 4.00
AutoQ [33] M M 68.20 6.91 -
SSPS M M 70.70 7.95 3.95
ResNet-34 Baseline 32 32 73.8 1.00 32.00
ABC-Net [30] 3 3 66.70 10.67 3.00
LQ-Nets [58] 3 3 71.90 10.67 3.00
DSQ [15] 3 3 72.54 10.67 3.00
QIL [23] 3 3 73.10 10.67 3.00
SSPS M M 73.49 10.69 3.06
BCGD [54] 4 4 70.81 8.00 4.00
DSQ [15] 4 4 72.76 8.00 4.00
QIL [23] 4 4 73.70 8.00 4.00
SSPS M M 74.30 7.99 4.01
ResNet-50 Baseline 32 32 77.15 1.00 32.00
AutoQ [33] M M 63.21 9.12 -
HAQ [50] M M 75.48 - 3.60
HAWQ [11] M M 75.30 - 4.00
BP-NAS [55] M M 76.67 - 3.80
SSPS M M 76.22 8.00 3.98
MobileNet-V2 Baseline 32 32 71.87 1.00 32.00
DSQ [15] 4 4 64.80 8.00 4.00
TQT [20] 4 4 67.79 8.00 4.00
HAQ [50] M M 66.99 - -
AutoQ [33] M M 69.02 7.58 -
SSPS M M 69.10 7.99 4.02
Table 2: Accuracy comparisons of ResNet-18, ResNet-34, ResNet-50 and MobileNet-V2 on ImageNet. Here ’M’ refers to mixed-precision quantization models.

For each compared method, we report its average weight bit, average activation value bit, Top-1 accuracy, model size compression rate and average operation bit. The target average weight bit and average operation bit of searching process are . The results are shown in Table 1. Compared with the full-precision model (Baseline), our model outperforms it by up to 0.17% while still achieving compression ratio for weights. Compared with the - quantization methods, Dorefa, PACT and LQ-Nets, the Top-1 accuracy of our method increases by , and , respectively. Similarly, our method has obvious advantages over the mixed-precision quantization methods. Moreover, our method performs much better than HAWQ and BP-NAS, and its Top-1 accuracy increases by 0.32% and 0.42%, respectively.

5.2.2 ImageNet

In order to verify the search ability of our method on large-scale datasets and deep networks, we implement ResNet-18, 34, 50 and MobileNet-V2 on the ImageNet (ILSVRC2012) dataset. We choose three-quarters of the training dataset as the sub-training dataset to update the weights of the supernet. The remaining one-quarter of the training dataset is used as the validation dataset to update the architecture parameters.

Table 2 shows the experimental results, where the method marked ’M’ represents mixed-precision quantization. Similar to other methods, our experiments mainly focus on the average 3-bit and 4-bit quantization. In the searching process, we set the average weight bit and average operation bit to or to control the search direction. From Table 2, we can see that our selected mixed-precision quantization models of ResNet-18 and ResNet-34 achieve the best accuracies, which are higher than their full-precision counterparts. For ResNet-50, we compare our method with several mixed-precision quantization methods. HAQ and AutoQ apply reinforcement learning to search for mixed-precision quantized architectures. They spend more time on training and the results are still worse than ours. HAWQ manually chooses the bit precision among the reduced search space, and its result is 0.92% lower than ours. BP-NAS uses small sampled datasets to complete the searching process and then transfer it to ResNet-50. Our method is still comparable with BP-NAS, although its results are obtained after 150 epochs of fine-tune with label smooth. As a lightweight network, MobileNet-V2 eliminates many redundant calculations. Therefore, the model quantization of MobileNet-V2 will bring great precision loss. Even so, compared with DSQ, TQT, HAQ and AutoQ, our method can converge well and outperforms them by up to 4.3%, 1.31%, 2.11% and 0.08%, respectively.

5.2.3 COCO Detection

We further explore the effectiveness of our mixed-precision model for detection tasks on the COCO benchmark [29], which is one of the most popular large-scale benchmark datasets for object detection. This dataset consists of images in 80 different categories. We use the trainval35k split for training and minival split for validation. Both one stage RetinaNet [28] detector and two stage Faster R-CNN [43] detector are applied to verify the effectiveness of our selected mixed-precision model. In other words, we use the selected mixed-precision ResNet-50 model in Section 5.2.2 as backbones. For Faster R-CNN, the RPN and ROIhead are quantized to 4-bit. For RetinaNet, the feature pyramid and detection heads are quantized to 4-bit, except for the last layer in the detection heads is quantized to 8-bit.

ResNet-50 + Faster R-CNN
Methods W/A-Bits Ave-Bits AP
Baseline 32/32 32.00 37.7 59.3 40.9 22.0 41.5 48.9
FQN [25] 4/4 4.00 33.1 54.0 35.5 18.2 36.2 43.6
BP-NAS [55] M/M 4.00 35.8 57.9 38.3 21.7 39.8 47.4
SSPS M/M 4.00 37.4 58.1 40.6 22.1 40.4 47.9
ResNet-50 + RetinaNet
Methods W/A-Bits Ave-Bits AP
Baseline 32/32 32.00 37.8 58.0 40.8 23.8 41.6 48.9
FQN [25] 4/4 4.00 32.5 51.5 34.7 17.3 35.6 42.6
Auxi [62] 4/4 4.00 36.1 55.8 38.9 21.2 39.9 46.3
SSPS M/M 4.00 36.4 55.8 38.6 20.8 39.9 47.6
Table 3: Results of Faster R-CNN and RetinaNet on the COCO validation dataset.

We compare the performance of our method with those of FQN [25], Auxi [62] and BP-NAS [55], where FQN and Auxi are fixed-precision methods, and BP-NAS is a mixed-precision method. Table 3 shows the experimental results. As we can see, model quantization can affect the detection results, obviously. For Faster R-CNN detector, our selected mixed-precision model demonstrate better performance. For example, our 4-bit detector with ResNet-50 backbone outperforms FQN and BP-NAS by and , respectively. For RetinaNet detector, our model also shows the best experimental results in quantized models.

5.3 Effective and Fast

5.3.1 Convergence Analysis

We take the mixed-precision search of ResNet-18 as an example for convergence analysis, where we set the expected average weight bit and average operation bit to 4. Fig. 4 show the convergence curves of the searching loss, average operation bit and average weight bit. Just as an ablation experiment, the blue line represents the convergence curve of our SSPS method, and the orange line represents the convergence curve without the decision operations (step 6 10 in Algorithm 1), which is called SPS. From Fig. 4 (a), we can see that our decision operations can significantly improve the convergence of the searching process. The decision operations can reduce the search space and the factors affecting the target, thus increasing the stability of the searching process. With the decrease of temperature coefficient and the convergence of training loss, the fluctuation becomes smaller and smaller until they approach to the target constraints. Fig. 5 shows the selected quantization policy for ResNet-18. The histogram represents the quantization precision of each layer in our model. The upper part represents the weight precision of different layers, and the bottom part represents the quantization precision of activation values in different layers. In particular, we specially mark the order of precision decision for the layers in the process of our model search.

Figure 4: (a) The convergence curve of searching loss. (b) The convergence curve of average operation bit. (c) The convergence curve of average weight bit.
Figure 5: Selected quantization policy for ResNet-18. The Y-axis represents the quantization precision, and the X-axis represents the order of precision decision for the layers. The numbers in the arrow indicate the order in which the layers are determined.

5.3.2 Comparison with Related Work

In this subsection, we discuss the differences of our mthod with two similar methods, DNAS [51] and BP-NAS [55].

Compared with DNAS: (1) By using the Gumbel Softmax with an annealing temperature, the pipeline behaves of DNAS is very similar to DARTS at the beginning, in which multiple candidates participate in the calculation. Therefore, it takes up a lot of memory resources just like DARTS, as shown in Fig. 1 (a). However, only one candidate is allowed to pass through our search cell, thus saving hardware resources. (2) DNAS introduces the cost of candidate structures into the loss function to encourage using lower-precision weights and activations. Because there is no target setting, DNAS needs multiple searches to get an appropriate model. Note that, our method only needs once search to return an optimal model under given constraints. (3) Our sequential search method can exponentially reduce the search space in the search process, thus improving the search speed and convergence stability, as shown in Fig. 4. Our method takes less than 28 GPU-hours to complete a search for ResNet18 on ImageNet, which is much faster than DNAS (40 GPU-hours). (4) DNAS is a block-wise mixed-precision method whose all layers in one block use the same precision. Our method is a layer-wise mixed-precision method, which allows different levels of precision in the same block.

Compared with BP-NAS: (1) BP-NAS applies DARTS to address optimization problems, which leads to a sharp increase in resources demand. (2) For ImageNet, BP-NAS needs to randomly sample 10 categories and takes 5000 images as the training dataset to search. This method relies on the incomplete definition of mixed-precision quantization tasks, which are easy to fall into a local optimal solution. (3) Similar to DNAS, BP-NAS is also a block-wise mixed-precision method. The search space is much smaller than our method.

6 Concluding Remarks

In this paper, we proposed a novel SSPS method for mixed-precision quantization search and introduced constraints into our search loss function to guide the searching process. This is a fully differentiable model, and the searching process can be optimized by gradient descent methods. In the searching process, we determined the quantization precision according to the selection certainty of search cells, which can reduce the search space exponentially and accelerate the search convergence speed. Experimental results demonstrated that our proposed SSPS method achieves better testing performance with similar constraints, compared to state-of-the-art methods on CIFAR-10, ImageNet and COCO. Our future work will focus on mixed-precision quantization architecture search without training datasets and training a universal model that can support multiple quantization precision to meet more industrial demands.

References