VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization

Quantization has been proven to be an effective method for reducing the computing and/or storage cost of DNNs. However, the trade-off between the quantization bitwidth and final accuracy is complex and non-convex, which makes it difficult to be optimized directly. Minimizing direct quantization loss (DQL) of the coefficient data is an effective local optimization method, but previous works often neglect the accurate control of the DQL, resulting in a higher loss of the final DNN model accuracy. In this paper, we propose a novel metric called Vector Loss. Based on this new metric, we develop a new quantization solution called VecQ, which can guarantee minimal direct quantization loss and better model accuracy. In addition, in order to speed up the proposed quantization process during model training, we accelerate the quantization process with a parameterized probability estimation method and template-based derivation calculation. We evaluate our proposed algorithm on MNIST, CIFAR, ImageNet, IMDB movie review and THUCNews text data sets with numerical DNN models. The results demonstrate that our proposed quantization solution is more accurate and effective than the state-of-the-art approaches yet with more flexible bitwidth support. Moreover, the evaluation of our quantized models on Saliency Object Detection (SOD) tasks maintains comparable feature extraction quality with up to 16× weight size reduction.



There are no comments yet.


page 8

page 12

page 14


Toward Extremely Low Bit and Lossless Accuracy in DNNs with Progressive ADMM

Weight quantization is one of the most important techniques of Deep Neur...

Accurate Deep Representation Quantization with Gradient Snapping Layer for Similarity Search

Recent advance of large scale similarity search involves using deeply le...

A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM

Many model compression techniques of Deep Neural Networks (DNNs) have be...

SQWA: Stochastic Quantized Weight Averaging for Improving the Generalization Capability of Low-Precision Deep Neural Networks

Designing a deep neural network (DNN) with good generalization capabilit...

Progressive DNN Compression: A Key to Achieve Ultra-High Weight Pruning and Quantization Rates using ADMM

Weight pruning and weight quantization are two important categories of D...

Distribution Adaptive INT8 Quantization for Training CNNs

Researches have demonstrated that low bit-width (e.g., INT8) quantizatio...

Quantization Loss Re-Learning Method

In order to quantize the gate parameters of the LSTM (Long Short-Term Me...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNNs) have been widely adopted in machine learning based applications

[34, 17]. However, besides DNN training, DNN inference is also a computation-intensive task which affects the effectiveness of DNN based solutions [3, 16, 41]. Neural network quantization employs low precision and low bitwidth data instead of high precision data for the model execution. Compared to the DNNs with floating point with 32-bit width (FP32), the quantized model can achieve up to 32 compression rate with an extremely low-bitwidth quantization [18]. The low-bitwidth processing, which reduces the cost of the inference by using less memory and reducing the complexity of the multiply-accumulate operation, improves the efficiency of the execution of the model significantly [41, 36].

However, lowering the bitwidth of the data often brings accuracy degradation [16, 4, 14]. This requires the quantization solution to balance between computing efficiency and final model accuracy. However, the quantitative trade-off is non-convex and hard to optimize – the impact of the quantization to the final accuracy of the DNN models is hard to formulate.

Previous methods neglect the quantitative analysis of the Direct Quantization Loss (DQL) of the weight data and make the quantization decision empirically while directly evaluating the final model accuracy [18, 7, 29, 43, 22] thus only achieving unpredictable accuracy.

In order to achieve higher training accuracy, finding an optimal quantization solution with minimal loss during the training of the learning kernels is effective and practical. One way of finding a local optimal solution is to minimize the DQL of the weight data, which is widely used in the current quantization solutions [15, 27, 26, 38, 12].

As shown in Fig. 1, denotes the full-precision weight and is the value after quantization. Conventional quantization methods regard as a point (set as origin in Fig. 1) in Euclidean Space, and is a point which is close to in a discrete data space. The discrete data space contains a certain number of data points that can be represented by the selected bitwidth. Therefore, the Square of Euclidean Distance (Square 2-norm or called L2 distance [45]), represented as , between the original weight data and the quantized data is simply used as the loss of the quantization process, which is going to be reduced [27, 26, 38, 12].

Fig. 1: Uncertainty of using L2 to evaluate the quantization loss. is the optimal value for the quantization in the range of [, ]. The optimal distance and achievable distance are denoted as and .

Although the L2 based solutions are proven to be effective and provide good training results in terms of accuracy of the model and bitwidth of the weight data, such solutions still have some major issues: (1) Solving the L2 based quantization always lead to inaccurate results because of the approximation process in it. As shown in Fig. 1, even with an additional quantization scaling factor , which could help to reduce the difference between the original and quantized data, we still can not avoid the accuracy loss during the quantization process. The quantized results always fall into the sub-optimal space (Sub-Q) instead of the optimal value (Opt-Q). (2) The process of L2 based quantization focuses on each of the individual weight data and neglects the correlation among these data points in a kernel or a layer.

To address the issues above, instead of using traditional Euclidean Distance, we propose a more accurate quantization loss evaluation metric; we also propose an algorithm to guide the quantization of the weight data quantitatively. We construct the weights into a vector

rather than scalar data, to take advantage of the characteristic that the loss between vectors can be split into orientation loss and modulus loss, which are independent of each other. As a result, for the first time we are able to achieve the minimal loss of the weight quantization for DNN training.

In this paper, after the concluded related works, we will prove that using vectorization loss as an optimization target is better than directly optimizing the L2 distance of the weights before and after quantization. Based on our proposed vectorized quantization loss measurement, we further propose a Vectorized Quantization method (VecQ) to better explore the trade-off between computing efficiency and the accuracy loss of quantization.

In summary, our contributions are as follows:

  • We propose a new metric, vector loss, as the loss function for DNN weight quantization, which can provide optimal quantization solution.

  • A new quantization training flow based on the vectorized quantization process is proposed, named VecQ, which achieves better model accuracy for different bitwidth quantization target.

  • Parametric estimation and computing template are proposed to reduce the cost of probability density estimation and derivative calculation of VecQ to speed up the quantization process in model training.

  • Extensive experiments show that VecQ achieves a lower accuracy degradation under the same training settings when compared to the state-of-the-art quantization methods in the image classification task with the same DNN models. The evaluations on Saliency Object Detection (SOD) task also show that our VecQ maintains comparable feature extraction quality with up to 16 weight size reduction.

This paper is structured as follows. Section 2 introduces the related works. In Section 3, the theoretical analysis of the effectiveness of vector loss compared to L2 loss are presented. Section 4 presents the detailed approach of VecQ. Section 5 proposes the fast solution for our VecQ quantization as well as the integration of VecQ into the DNN training flow. Section 6 presents the experimental evaluations and Section 7 concludes the paper.

2 Related works and Motivation

As an effective way to compress DNNs, many quantization methods have been explored [14, 18, 31, 7, 27, 45, 1, 29, 12, 37, 42, 22, 11, 38, 26, 9, 4, 20, 21, 15, 40, 43]. These quantization methods can be roughly categorized into 3 different types based on their objective functions for the quantization process:

  • Methods based on heuristic guidance of the quantization, e.g., directly minimizing the final accuracy loss;

  • Methods based on minimizing Euclidean Distance of weight data before and after quantization;

  • Other methods such as training with discrete weights and teacher-student network.

In this section, we first introduce the existing related works based on their different categories and then present our motivation for vectorized quantization.

2.1 Heuristic guidance

The heuristic methods usually directly evaluate the impact of the quantization on the final output accuracy. They often empirically iterate the training process to improve the final accuracy. For example, the BNNs [18] proposed a binary network for fast network inference. It quantizes all the weights and activations in a network to 2 values,

, based on the sign of the data. Although it provides a DNN with 1-bit weights and activations, it is hard to converge without Batch Normalization layers 

[19] and leads to a significant accuracy degradation when compared to full-precision networks. The Binary Connect [7] and Ternary Connect [29] sample the original weights into binary or ternary according to a sampling probability defined by the value of the weights (after scaling to [0,1]). All these works do not quantify the loss during the quantization, so that only the final accuracy is the guideline of the quantization.

Quantization methods in [43, 14] convert the full-precision weights to fixed-point representation by dropping the least significant bits without quantifying the impact.

INQ [42] iteratively processes weight partition, quantization and re-training method until all the weights are quantized into powers-of-two or zeros.

STC [22] introduces a ternary quantization which first scales the weights into the range of , and then quantizes all scaled weights into ternary by uniformly partitioning them. Thus, the values located in and are quantized to -1, 1 and the rest of them are set to 0.

TTQ [45] introduces a ternary quantization which quantizes full-precision weights to ternary by a heuristic threshold but with two different scaling factors for positive and negative values, respectively. The scaling factors are optimized during the back propagation.

The quantization method in [20] (denoted as QAT) employs the affine mapping of integers to real values with two constant parameters: Scale and Zero-point. It first subtracts the Zero-point parameter from data (weights/activation), then divides the data by a scaling factor and obtains the quantized results with rounding operation and affine mapping. The approach of TQT [21] follows QAT but with the improvement of constraining the scale-factors into power-of-2 and relates them to trainable thresholds.

2.2 Optimizing Euclidean Distance

In order to provide better accuracy control, reducing the Euclidean Distance of the data before and after quantization becomes a popular solution.

Xnor-Net [31] adds a scaling factor on the basis of BNNs [18] and calculates the optimal scaling factor to minimize the distance of the weights before and after quantization. The scaling factor boosts the convergence of the model and improves the final accuracy. The following residual quantization method in [11] adopts Xnor-Net [31] to further compensate the errors produced by single binary quantization to improve the accuracy of the quantized model.

TWNs [27] proposes an additional threshold factor together with the scaling factor for ternary quantization. The optimal parameters (scaling factor and threshold factor) are still based on the optimization of the Euclidean distance of weights before and after quantization. TWNs achieves better final accuracy than Xnor-Net and BNNs.

Extremely low bit method (ENN) proposed in [26] quantizes the weights into the exponential values of 2 by iteratively optimizing the L2 distance of the weights before and after quantization.

TSQ [38] presents a two-step quantization method, which first quantizes the activation to low-bit values, and then fixes it and quantizes the weights into ternary. TSQ employs scaling factor for each of the kernel, resulting in a limited model size reduction.

L2Q [12]

first shifts the weights of a layer to a standard normal distribution with a shifting parameter and then employs a linear quantization for the data. The uniform parameter considers the distribution of the weight data, which provides better loss control compared to simply optimizing the Euclidean Distance during the quantization.

Several other works [15, 40]

adopt k-means with irregular non-linear quantization. Although the values are clustered before quantization, the final results are still obtained with the optimization of the Euclidean distance between the original values and the quantized ones.

2.3 Other works

Besides the heuristic and Euclidean Distance approaches, there are still many other works focusing on low-precision DNN training.

GXNOR-Net [9] utilizes the discrete weights during training instead of the full-precision weights. It regards the discrete values as states and projects the gradients in backward propagation as the transition of the probabilities to update the weights directly, hence, providing a network with ternary weights.

T-DLA [4] quantizes the scaling factor of ternary weights and full-precision activation into fixed-point numbers and constrains the quantization loss of activation values by adopting infinite norms. Compared with [14, 43], it shifts the available bitwidth to the most effective data portion to make full use of the targeted bitwidth.

In TNN [1], the authors design a method using ternary student network, which has the same network architecture as the full-precision teacher network, aiming to predict the output of the teacher network without training on the original datasets.

In HAQ [37], the authors proposed a range parameter — all weights out of the range are truncated and the weights within the range are linearly mapped to discrete values. The optimal range parameter was obtained by solving the KL-divergence of the weights during the quantization.

However, comparing to the heuristic guidance and Euclidean Distance based methods, the approaches above either focus on a specific bitwidth or perform worse in terms of the accuracy of the trained DNNs.

2.4 Motivation of the VecQ Method

We have witnessed the effectiveness of the L2 distance-based methods among all the existing approaches. However, as explained in the introduction, there are still two defects that lead to inaccurate DQL measurement.

The first defect is that L2 distance based optimization usually cannot be solved accurately [27, 12, 38, 15, 40], even with an additional scaling factor to scale the data into proper range. As shown in Fig. 1, the quantization function with the additional scaling factor to improve the accuracy [12, 27] is denoted as ; the L2 distance curve between and with the change of is drawn in blue. It has a theoretical optimal solution when with a L2 distance of , shown as the green dot. However, only the solutions with the L2 distance ranging in could be obtained due to the lack of solvable expressions [27, 12], leading to an inaccurate quantization result. Additionally, even the methods involve k-means for clustering of the weights still fall into the sub-optimal solution space [15, 40]. Their corresponding approximated optimal quantized weights are located in the Sub-Q space colored with orange.

The second defect is that L2 based quantization neglects the correlation of the weights within the same kernel or layer, but only focuses on the difference between single values. Even with the k-means based solutions, the distribution of the weights in the same layer is ignored in the quantization process. However, the consideration of the distribution of the weight data is proven to be effective for the accuracy control in the existing approaches [12, 4].

We discover that when we represent quantization loss of the weight for a kernel or a layer using vector distance instead of L2 distance, it will intrinsically solve the two problems mentioned above. The two attributes of a vector, orientation and modulus, can uniquely determine a vector. Moreover, the distance between vectors is naturally determined by these two attributes. Therefore, we define the DQL with vector loss that involves the distance of the orientation and modulus of the vectors. To the best of our knowledge, there is no previous work that leverages the vector loss for the weight quantization of DNNs.

In this work, we prove that the vector loss can provide optimal quantization solution and hence achieve a smaller DQL for the weight data quantization during the model training, which could lead to a higher model accuracy. Based on this, we propose VecQ, which minimizes the loss of the quantization process based on vector loss. We also propose a fast parameter estimation method and a computation template to speed up our vectorized quantization process for easier deployment of our solution.

3 Vectorized distance vs. L2 distance

Fig. 2: The solutions based on L2 loss and Vector loss.

Before introducing our vectorized quantization method, we first explain the effectiveness of loss control with vector loss using two data points as an example for simplicity. Assume a DNN layer with only two weights, denoted as {} whose values are {2.5, 1.75}. The weights will be quantized into bits. The quantization loss based on L2 distance is denoted as and the quantization solution set is expressed in the format of {}, where is the floating point scaling factor and are the discrete values in the quantization set , then we get


Let be a vector from the origin point , and its quantized value is and . could also be represented as the modulus of the distance between vector and . The is calculated as:


As shown in Fig. 2, there are only two dimensions in the solution space, each representing one weight. Each dimension contains possible values. Since the direct optimization based on the L2 distance with -bit constraint is non-convex, a common method to solve this problem is updating the and iteratively till reaching the extreme points [26]. The values of the possible solutions are located on the black dotted and the blue solid lines in the Fig. 2 due to the full precision scaling factor . The intersection angle between and the quantized version is denoted as .

However, due to the non-convex characteristic of optimizing L2 loss under the -bit constraint [26], the result may be found as the red point in the figure with the solution of {} and an angle of , which is the first sub-optimal solution point on the curve of the loss to the values in Fig. 2, ignoring the optimal solution which is the second extreme value on the curve. Therefore, instead of using L2 quantization loss, we use the method of measuring the difference between vectors as vector loss to evaluate the difference between the weight vector and the quantized results to guide the convergence process to obtain the solution.

Generalizing the above observation to multiple weights within a DNN layer, we have the following theorem:

Theorem 1.

Assume is a weight vector of a layer of a DNN containing weights, by minimizing the vector loss , the optimal solution for L2 distance based quantization can be obtained.

We first formally prove that there is always an optimal solution for minimizing vector loss ; we then prove that the optimal solution for is also the optimal solution for minimizing the quantization loss based on L2 distance.


can be divided into orientation loss and modulus loss , which are represented as


where and represent the unit vector for and . Moreover, we obtain the first order derivative of for and hence obtain the Hessian matrix of it as below:


and the second-order derivative of on is


Equ. 4 and Equ. 5 indicate that the functions and are both convex and their optimal solutions can be obtained through efficient convex optimization solvers. The optimization of the orientation loss is only defined by the unit vectors and , where value is not affected; the optimization of the modulus loss is to find the optimal value of . Therefore, the optimization of and can be achieved independently. With the optimization of , without the effect of the joint optimization of the modulus, the optimal solution that indicates the optimal intersection angle could be directly located, which could always satisfy because the L2 loss function based solutions can not guarantee to achieve the optimal in the same solution space. The modulus loss can be optimized completely by scaling without affecting the . Furthermore, with the angles we have:


using the same solution under vector loss minimization, we have the following equation for L2 distance based loss:


which proves that with vector loss, we always achieve the optimal solution for the L2 distance based quantization. ∎

Guided by the theorem above, we have the final optimal solution for the example points in the Fig. 2 as {}, which provides the minimal DQL.

Base on our proposed vector loss metric and theorem, we will discuss our algorithms to obtain the vector loss based quantization solution as well as the methods to speed up the algorithms for practical purpose.

4 Vectorized Quantization

VecQ is designed to follow the theoretical guideline we developed in Section 3. First of all, the vectorization of the weights is introduced. Then the adoption of the vector loss in VecQ is explained in detail. Finally, the process of VecQ quantization with two critical stages is presented.

4.1 Vectorization of weights

For the weight set of layer , we flatten and reshape them as a -dimension vector . ,  indicate the number of input channel and output channel for this layer, and indicates the size of the kernel for this layer. For simplicity, we use to represent the weight vector of a certain layer before quantization.

4.2 Loss function definition

We use the vectorized loss instead of Euclidean distance during the quantization process. Since solving the orientation and modulus loss independently could achieve the optimal solution for each of them, we further define the total quantization loss as the sum of both modulus loss and orientation loss as


to provide a more strict constraint during the quantization process.

4.3 Overall process

Fig. 3: The overall flow of quantization process, including both steering and driving stage.

According to our analysis in Section 3, the orientation loss indicates the optimal intersection angle and the modulus loss indicates the optimal scale at this angle. Therefore, our quantization takes two stages to minimize the two losses independently, which are defined as steering stage and driving stage as shown in Fig. 3. In the steering stage, we adjust the orientation of the weight vector to minimize the orientation loss. Then, we fix the orientation and only scale the modulus of the vector at the driving stage to minimize the modulus loss.

Let be the weight vector of the layer of a DNN in the real space and the be the quantized weight vector in the uniformly discrete subspace. First, steering to :


Where is an orientation vector that disregards the modulus of the vector and only focuses on the orientation. Second, along with the determined orientation vector , we search the position of the modulus and ”drive” to the optimal position with minimum modulus loss. The quantized vector is achieved by driving the .


The complete quantization process is the combination of the two stages. The final target is reducing the loss between the original weight and the quantized results . The entire quantization process is represented as


4.4 Steering stage

The purpose of the steering stage is to search for an optimal orientation vector, which has the least orientation loss with to minimize the .

Fig. 4: Linear quantization with interval .

As shown in Fig. 4, is the weight in floating point representation and it would be quantized into -bit representation. It means, there are total values that can be used to represent the values in , where each of them is denoted as . We adopt linear quantization method, where an interval is used to represent the distance between two quantized values.


The vector with floating data is quantized to a vector with discrete data by an rounding () operation for each of the values in the vector. The data are limited to the range of by extended clip () operation. The subtraction of is used to avoid aggregation at 0 position and guarantees the maximum number of rounding values to be .

Given a for the number of bits as the quantization target, the intermediate quantized weight is


which has the minimized orientation loss with the as an interval parameter. When is fixed, decides the orientation loss between and . In order to minimize the loss, we only need to find the optimal :


Finding the optimal requires several processes with high computational complexities; the detailed processes and the corresponding fast solution is presented in Section 5.

4.5 Driving stage

In the driving stage, we minimize the modulus loss between the orientation vector obtained from the steering stage and the original weight vector . Since we focus on the modulus in this stage, only the scaling factor is involved.


Here we only need to find the optimal to minimize the modulus loss.


where can be easily obtained by finding the extreme of with


and the value of .

Finally, with the two stages above, the quantized value of the is determined by and :


5 Fast Quantization

Fig. 5: The illustration of quantization regions and symbols.

With the proposed quantization method in Section 4, the minimum quantization loss is achieved when the optimal in Equ. (14) is found. However, as one of the most critical processes in the model training, the computational overhead of quantization leads to inefficient training of the model. In order to address this issue, in this section, we first analyze the computational complexity to calculate the value of . Then, we propose a fast solution based on our proposed fast probability estimation and computation template. In the end, the detailed implementation of our quantization solution together with the fast solver is integrated into our training flow.

5.1 Analysis of the optimal

The most computational intensive process in our quantization is the steering stage, specifically, the process to solve Equ. (14). However, Equ. (14) can not be solved directly due to the clipping and rounding operations. Instead of directly using the values in , we involve the distribution of the values in to support a more general computation of the . The probability density of the value in is noted as .

According to the steering method in Equ. (12), each value is projected to a value ; the values are linearly distributed into with a uniform distance defined by interval . As shown in Fig. 5, the light blue curve is the distribution of values in and orange dots are the values after quantization, represented as . indicates the probability of in .

Specifically, the data within range  () is replaced by the single value and the data out of the range are forced to be truncated and set to the nearest . As given in Equ. (12), and .

We set and to ease the formulation. Based on the distribution of data in , the expanding terms of in Equ. (14) can be obtained as follows,


The is represented as


Since the linear quantization is adopted with the fixed interval of , and can be easily derived by the following equations,


So is only related to and Equ. (14) can be solved by solving


Concluding from the discussion above, for each , three steps are necessary:

  • Estimating probability density of the values in .

  • Solving Equ. (23) and getting the optimal .

  • Using the optimal to obtain the final quantization results.

However, the first two steps (probability density estimation and derivative calculation) are complex and costly in terms of CPU/GPU time and operations, which limit the training speed.

5.2 Fast solver

For the computation-intensive probability density estimation and derivative calculation, we propose two methods to speed up the processes, which are fast parametric estimation and computing template, respectively.

5.2.1 Fast probability estimation

There are two methods for probability density estimation: parametric estimation and non-parametric estimation.

Non-parametric estimation is usually used for fitting the distribution of data without prior knowledge. It requires all the density and probability of the data to be estimated individually, which will lead to a huge computational overhead.

We take the widely adopted non-parametric estimation method, Kernel Density Estimation (KED) as an example, to illustrate the complexity of non-parametric estimation.


Here is the probability density of . is the number of the samples and is the non-negative kernel function that satisfies and . is the smoothing parameter. The time complexity of computing all probability densities is and the space complexity is because all the probability densities need to be computed and stored.

Parametric estimation is used for the data which have a known distribution and only computes some parameters of the distributions instead. It could be processed fast with the proper assumption of the distribution. Thus, we adopt a parametric estimation method in our solution.

There is prior knowledge of the weights of the layers of DNNs, which assumes that they are obeying normal distribution with the mean value of 0 so that the training could be conducted easily and the model could provide better generalization ability [12, 38]:


is the standard derivation of . Based on this prior knowledge, we can use parametric estimation to estimate the probability density of and the only parameter that is needed to be computed during the training is . The effectiveness of this prior distribution is also proven by the final accuracy we obtain in the evaluations.

With Equ. (25), we have


Therefore, parametric estimation only requires the standard deviation

, which could be calculated with


Here computes the expectation. Hence, the time complexity of computing is reduced to and the space complexity is reduced to .

5.2.2 Computing template

After reducing the complexity of computing the probability, finding optimal is still complex and time-consuming. In order to improve the computing efficiency of this step, we propose a computing template-based method.

Since the weights of a layer obey normal distribution, , they can be transformed to the standard normal distribution:


Then, we could use to compute instead of using , because of:


Here, is the computing template for , because it has the same orientation loss with as . By choosing this computing template, solving Equ. (23) is equivalent to solve the substitute equation Equ. (30).

Fig. 7: Integrated quantization process in DNN training.

Since , the probability of value is:


After the probability is obtained, the orientation loss function can be expressed as a function only relating to and the targeting bitwidth for the quantization.


is a convex function with the condition of . However, it is constant when because the angle between the weight vector and the vector constructed with the signs of values in the weights is constant. Due to its independence at , we set to for the convenience of the following process. We plot the curve of in Fig. LABEL:fig:optimal_orientation_curve with the change of under different bits.

The optimal values for all bitwidth settings obtained by solving Equ. (32) is shown in Table I. The loss is small enough when the targeted bitwidth is greater than 8, so we omit the results for them. With the template above, we only need to solve once to find the optimal , and then apply it to all quantization without repetitively calculating it. In other words, simply looking up the corresponding value in this table can obtain the optimal parameter thus reducing the complexity and intensity of the computation, which significantly speeds up the training process.


1 2 3 4 5 6 7 8 8


0.9957 0.5860 0.3352 0.1881 0.1041 0.0569 0.0308


TABLE I: Optimal value of for with bitwidth

5.3 DNN training integration

We integrate our VecQ quantization into the DNN training flow for both the weight data and the activation data, as shown in Fig. 7.

Weight quantization: For layer , during the forward propagation, we first quantize the weights with full precision () into the quantized values (), then use the quantized weights to compute the output (). During the backward propagation, the gradient is calculated with instead of and propagated. In the final update process, the gradient of is used to update  [43].

Activation quantization: Inspired by the Batch Normalization (BN) technique, instead of using pre-defined distribution, we compute the distribution parameter of the activation outputs

and update it with Exponential Moving Average. During the inference, the distribution parameter is employed as a linear factor to the activation function 

[19]. The is the activation output of layer , and

is the non-linear activation function following the convolution or fully-connected layers, such as Sigmoid, Tanh, ReLU.

6 Evaluations

We choose Keras v2.2.4 as the baseline DNN training framework 

[6]. The layers in Keras are rewritten to support our proposed quantization mechanism as presented in Section 5.3. Specifically, all the weight data in the DNNs are quantized to the same bitwidth in the evaluation of VecQ, including the first and last layers. Our evaluations are conducted on two classic tasks: (1) image classification and (2) salient object detection (SOD). The evaluation results for image classification are compared to state-of-the-art results with the same bitwidth configuration and the SOD results are compared to the state-of-the-art solutions that are conducted with the FP32 data type.

6.1 Classification

Image classification is the basis of many computer vision tasks, so the classification accuracy of the quantized model is representative for the effectiveness of our proposed solution.

6.1.1 Evaluation settings

Datasets and DNN models. The MNIST, CIFAR10 [24] and ImageNet [8] datasets are selected for image classification evaluations; the IMDB movie reviews [30] and THUCNews [35] for Chinese text datasets are selected for the sentiment and text classification evaluations. The detailed information of the datasets are listed in Table II and Table III.


Datasets MNIST CIFAR10 ImageNet


Image size 28281 32323 2242243
# of Classes 10 10 1000
# of Images 60000 50000 1281167
# of Pixels (log10) 7.67 8.19 11.29


TABLE II: The image classification datasets attributes.


Datasets IMDB THUCNews


Objectives Movie reviews Text classification
# of Classes 2 10
# of samples 50000 65000


TABLE III: The sentiment and text classification datasets.


Models AlexNet ResNet-18 MobileNetV2


Convs 5 21 35
DepConvs - - 17
BNs 7 19 52
FCs 3 1 1
Parameters (M) 50.88 11.7 3.54


11footnotemark: 1

Convs indicate the vanilla convolution layers, DepConvs are the depthwise convolution layers [32]. BNs stand for the Batch Normalization layers [19] and FCs are the full-connection layers.

TABLE IV: The models for ImageNet.


W/A11footnotemark: 1 LeNet5 VGG-like [34] Alexnet [25] ResNet-18 [17] MobileNetV2 [32]
Size(M) Acc Size(M) Acc Size(M) Top1/Top5 Size(M) Top1/Top5 Size(M) Top1/Top5


32/32 6.35 99.4 20.44 93.49 194.10 60.01/81.9022footnotemark: 2 44.63 69.60/89.2433footnotemark: 3 13.50 71.30/90.1044footnotemark: 4


1/32 0.21 99.34 0.67 90.39 6.21 55.06/77.78 1.45 65.58/86.24 0.67 53.78/77.07
2/32 0.41 99.53 1.31 92.94 12.27 59.31/81.01 2.68 68.23/88.10 1.09 64.67/85.24
3/32 0.60 99.48 1.94 93.02 18.33 60.36/82.40 4.24 68.79/88.45 1.50 69.13/88.35
4/32 0.80 99.47 2.58 93.27 24.39 61.21/82.94 5.63 68.96/88.52 1.92 71.89/90.38
5/32 1.00 99.47 3.22 93.37 30.45 61.65/83.19 7.02 69.41/88.77 2.33 71.47/90.15
6/32 1.20 99.49 3.86 93.51 36.51 62.01/83.32 8.42 69.81/88.97 2.74 72.23/90.61
7/32 1.40 99.48 4.49 93.52 42.57 62.09/83.44 9.81 70.17/89.09 3.16 72.33/90.62
8/32 1.60 99.48 5.13 93.50 48.63 62.22/83.54 11.20 70.36/89.20 3.57 72.24/90.66


2/8 0.41 99.43 1.31 92.46 12.27 58.04/80.21 2.68 67.91/88.30 1.09 63.34/84.42
4/8 0.80 99.53 2.58 93.37 24.39 61.22/83.24 5.63 68.41/88.76 1.92 71.40/90.41
8/8 1.60 99.44 5.13 93.55 48.63 61.60/83.66 11.20 69.86/88.90 3.57 72.11/90.68


11footnotemark: 1

W/A denotes the quantizing bits of weights and activation respectively. 22footnotemark: 2Results of AlexNet with Batch Normalization layers are cited from [33]. 33footnotemark: 3Results of ResNet18 are cited from [13]. 44footnotemark: 4Results are cited from the document of Keras [6].

TABLE V: The accuracy and model size with different bitwidth targets.

For MNIST dataset, Lenet5 with 32C5-BN-MP2-64C5-BN-MP2-512FC-10Softmax is used, where C stands for the Convolutional layer and the number in front denotes the output feature channel number and the number behind is the kernal size; BN stands for the Batch Normalization layer; FC represents the Fully-connected layer and the output channel number is listed in front of it; MP

indicates the max pooling layer followed with the size of the pooling kernel. The mini-batch size is 200 samples and the initial learning rate is 0.01 and it is divided by 10 at epoch 35 and epoch 50 for a total of 55 training epochs.

For CIFAR10 dataset, a VGG-like network [34] with the architectural configuration as 64C3-BN-64C3-BN-MP2-128C3-BN-128C3-BN-MP2-256C3-BN-256C3-BN-MP2-1024FC-10Softmax

is selected. A simple data augmentation which pads 4 pixels on each side and randomly crops the 32

32 patches from the padded image or its horizontal flip is adopted during the training. Only the original 32 32 images are evaluated in the test phase. The network is trained with mini-batch size 128 for a total of 300 epoch. The initial learning rate is 0.01 and decays 10 times at epoch 250 and 290.

For the ImageNet dataset, we select 3 famous DNN models, which are AlexNet [25], ResNet-18 [17] and MobileNetV2 [32]. ImageNet dataset contains 1000 categories and the size of the image is relatively bigger [8, 25, 34, 17]. We use the Batch Normalization (BN) layer instead of the original Local Response Normalization (LRN) layer in AlexNet for a fair comparison with [45, 26, 38]. The numbers of the different layers and the parameter sizes are listed in Table IV.

The architecture of the model for IMDB movie review [30] sentiment classification and the THUCNews [35] text classification are 128E-64C5-MP4-70LSTM-1Sigmoid and 128E-128LSTM-128FC-10Softmax, where E denotes Embedding layer and the number in front of it represents its dimension; LSTM is the LSTM layer and the number in front is the number of the hidden units. In addition, we quantize all layers including Embedding layer, Convolutional layer, Fully-connected layer and LSTM layer for these two models. Specifically, for the LSTM layer, we quantize the input features, outputs and the weights, but left the intermediate state and activation of each timestamp untouched since the quantization of them will not help to reduce the size of the model.

Evaluation Metrics The final classification accuracy results on the corresponding datasets are taken as the evaluation metrics in image classification tasks as used in many other works [27, 26, 38, 14]. Moreover, the Top1 and Top5 classification accuracy results are presented simultaneously on all the models for ImageNet dataset for comprehensive evaluation as used in [34, 17].

6.1.2 Bitwidth flexibility


Size (M) Acc. Size (M) Acc.


32/32 10.07 84.9811footnotemark: 1 3.01 94.74


2/32 0.63 85.54 0.19 93.99
4/32 1.26 86.24 0.38 94.47
8/32 2.52 85.53 0.75 94.53
2/8 0.63 85.40 0.19 94.00
4/8 1.26 84.67 0.38 94.09
8/8 2.52 85.72 0.75 94.43


TABLE VI: Evaluation results for LSTM based models.

VecQ is designed to support a wide range of targeted bitwidths. We conduct a series of experiments to verify the impact of bitwidth on model accuracy and model size reduction. The bitwidth in the following evaluations ranges from to .

The accuracy results and model sizes for the image classification models are shown in Table V. We first discuss weight quantization only. There is a relatively higher accuracy degradation when the bitwidth is set to 1. But starting from 2 bits and up, the accuracy of the models recovers to less than drop when compared to the FP32 version except MobileNetV2. With the increase of bitwidth, the accuracy of the quantized model is improved. The highest accuracy of LeNet-5 and VGG-like are 99.53% and 93.52% at 2 bits and 7 bits, respectively, which outperform the accuracy results with FP32. The highest accuracy of AlexNet, ResNet-18 and MobileNetV2 are obtained at 8-bit with 62.22% (Top1), 8-bit with 70.36% (Top1) and 7-bit with 72.33% (Top1), respectively, and all of them outperform the values obtained with FP32. Overall, the best accuracy improvement of the models when compared to the full precision versions for the five models are 0.13%, 0.03%, 2.21%, 0.76% and 1.03%, at 2, 7, 8, 8 and 7 bits, respectively when activation is maintained as 32 bits. The table also contains the accuracy of the models with 8-bit activation quantization. Although the activation quantization leads to a degradation of the accuracy for most of the models (except VGG-like), the results are in general comparable with the models with FP32 data type.

The accuracy and model size of the sentiment classification and text classification models are shown in Table VI. Our solution easily outperforms the models trained with FP32. Even with the activation quantization, the accuracy results are still well maintained. The results also indicate that adopting appropriate quantization of the weight data improves the ability of generalization of the DNN models. In another word, right quantization achieves higher classification accuracy for the tests.

6.1.3 Comparison with State-of-the-art results


Methods Weights Activation FConv IFC LFC


ReBNet [11] 1 32 1 3 32 3 - Y N
BC [7] 1 - 0 32 - - - - -
BWN [31] 1 32 1 32 - - - - -
BPWN [27] 1 32 1 32 - - - N N
TWN [27] 2 32 1 32 - - - N N
TC [29] 2 - 0 32 - - - - -
TNN [1] 2 - 0 2 - 0 - - -
TTQ [45] 2 32 2 32 - - N Y N
INQ [42] 1-5 - 0 32 - - - - -
FP [14] 2,4,8 - 0 32 - - - N N
uL2Q [12] 1,2,4,8 32 1 32 - - Y Y 8
ENN [26] 1,2,3 32 1 32 - - - - -
TSQ [38] 2 32 c44footnotemark: 4 2 32 2 N Y N
DC [15]22footnotemark: 2 2,3,4 32 4,8,16 32 - - 8 - 1
HAQ [37]33footnotemark: 3 flexible 32 1 32 - - 8 - 1
QAT [20] 8 32 1 8 32 1 N - N
TQT [21] 8 - 1 8 - 1 8 - 8


VecQ 1-8 32 1 32,8 -,32 -,1 Y Y Y


11footnotemark: 1

Weights and Activation denote the quantized data of the model. Bits refer to quantized bitwidth of methods; SFB is the bitwidth for the scaling-factor; SFN is the number of the scaling-factors. FConv, IFC and LFC represent whether the First Convolution layer, the Internal Fully-Connected layers and the Last Fully-Connected layer are quantized. 22footnotemark: 2The results of DC are from [37]. 33footnotemark: 3HAQ is a mix-precision method; results here are from the experiments that only quantize the weight data. 44footnotemark: 4TSQ introduces a floating-point scaling factor for each convolutional kernel, so the SFN equals to the number of kernels.

TABLE VII: Detailed settings of the quantization methods collected from the literature.

We collected the state-of-the-art accuracy results of DNNs quantized to different bitwidth with different quantization methods, and compared them to the results from VecQ with the same bitwidth target. The detailed bitwidth support of the comparison methods are listed in Table VII. Note here, when the quantization of the activations are not enabled, the SFB and SFN are not applicable to VecQ.

The comparisons are shown in Table VIII. The final accuracy based on VecQ increased by an average of 0.38%, 3.30% for LeNet-5 and VGG-like when compared with other quantization methods, respectively. There is also an average of 0.39%/0.01% (top1/top5) improvement of the accuracy of AlexNet, 2.45%/1.39% on ResNet-18 and 2.57%/1.64% on MobileNetV2 when compared to the state-of-the-art methods. For all the 3 datasets and 5 models, the quantized models with VecQ achieve higher accuracy than almost all state-of-the-art methods with the same bitwidth. However, when bitwidth target is 1, the quantized models of AlexNet and Resnet-18 based on VecQ perform worse due to the reason that we have quantized all the weights into 1 bit, including first and last layers, which are different from the counterparts that are using higher bitwidth for the first or last layers. This also leads to more accuracy degradation at low bitwidth on the lightweight network MobiNetV2. This, however, allows us to provide an even smaller model size. Besides, the solution called BWN [31] and ENN [26] for AlexNet has M parameters [31], while ours is M because eliminating the layer paddings in the intermediate layers leads to less weights for fully-connected layers. When compared to TWNs, L2Q and TSQ, VecQ achieves significantly higher accuracy at the same targeted bitwidth, which also indicates that our vectorized quantization is superior to the L2 loss based solutions.


Bitwidth Methods Datasets&Models
MNIST Cifar10 [24] ImageNet [8]
LeNet5 VGG-like [34] AlexNet [25] ResNet18 [17] MobileNetV2 [32]


32 FP32 99.4 93.49 60.01/81.90 69.60/89.24 71.30/90.10


1 ReBNet[11] - 86.98 - - -
BC[2] 98.71 88.44 - - -
BWN[31] - - 56.80/79.40 60.80/83.00 -
BPWN[27] 99.05 - - 57.50/81.20 -
ENN [26] - - 57.00/79.70 64.80/86.20 -
L2Q [12] 99.06 89.02 - 66.24/86.00 -
VecQ 99.34 90.39 55.06/77.78 65.58/86.24 53.78/77.07
Mean-Imp33footnotemark: 3 0.40 2.24 -1.84/-1.77 3.24/2.14 -/-


2 FP [14] 98.90 - - - -
TWN [27] 99.35 - 57.50/79.8011footnotemark: 1 61.80/84.20 -
TC [29] - 89.07 - - -
TNN [1] - 87.89 - - -
TTQ [45] - - 57.50/79.70 66.60/87.20 -
INQ [42] - - - 66.02/87.12 -
ENN - - 58.20/80.60 67.00/87.50 -
TSQ [38] - - 58.00/80.50 - -
L2Q 99.12 89.5 - 65.60/86.12 -
DC44footnotemark: 4 - - - - 58.07/81.24
VecQ 99.53 92.94 59.31/81.01 68.23/88.10 64.67/85.24
Mean-Imp 0.41 4.12 1.51/0.86 2.83/1.67 6.60/4.00


3 INQ - - - 68.08/88.36 -
ENN22footnotemark: 2 - - 60.00/82.20 68.00/88.30 -
DC - - - - 68.00/87.96
VecQ 99.48 93.02 60.36/82.40 68.79/88.45 69.13/88.35
Mean-Imp - - 0.36/0.20 0.75/0.12 1.13/0.39


4 FP 99.1 - - - -
INQ - - - 68.89/89.01 -
L2Q 99.12 89.8 - 65.92/86.72 -
DC - - - - 71.24/89.93
VecQ 99.47 93.27 61.21/82.94 68.96/88.52 71.89/90.38
Mean-Imp 0.36 3.47 -/- 1.55/0.65 0.65/0.45


5 INQ - - - 68.98/89.10 -
VecQ 99.47 93.37 61.65/83.19 69.15/88.57 71.47/90.15
Mean-Imp - - -/- 0.17/-0.53 -/-


6 HAQ - - - - 66.75/87.32
VecQ 99.49 93.51 62.01/83.32 69.81/88.87 72.23/90.61
Mean-Imp - - -/- -/- 5.48/3.29


7 VecQ 99.48 93.52 62.09/83.44 70.17/89.09 72.33/90.62


8 FP 99.1 - - - -
L2Q 99.16 89.7 - 65.52/86.36 -
QAT-c55footnotemark: 5 - - - - 71.10/-
TQT-wt-th - - - - 71.80/90.60
VecQ 99.48 93.5 62.22/83.54 70.36/89.20 72.24/90.66
Mean-Imp 0.35 3.80 -/- 4.84/2.84 0.79/0.06


11footnotemark: 1

The results of TWN on AlexNet are from [26]. 22footnotemark: 2ENN adopts 2 bits shifting results noted as  [26]. 33footnotemark: 3Mean-Imp denotes Mean of accuracy Improvement results compared to the state-of-the-art methods. 44footnotemark: 4The results are from HAQ [37]. 55footnotemark: 5The results are from [23].

TABLE VIII: The comparison with other state-of-the-art quantization solutions.

6.1.4 Analysis of values


2 3 4 5 6 7 8


AlexNet 0.9700 0.5700 0.3300 0.1900 0.1000 0.0600 0.0300
ResNet18 1.0000 0.6100 0.3600 0.2100 0.1200 0.0600 0.0300
MobileNetV2 1.0000 0.5900 0.3400 0.1900 0.1100 0.0600 0.0300


MeanError 0.0171 -0019 -0.0024 -0.0256 -0.0178 -0.0094 0.0023


TABLE IX: Optimal value of for with bitwidth .

In order to evaluate the accuracy of our theoretical in Table I, we choose the last convolutional layers from different models to calculate the actual values. Since is the quantization interval, the range of (0,3] of it covers more than 99.74% of the layer data, so the actual is obtained by exhaustively searching the values in the range of (0,3] with the precision at 0.001. The comparison is shown in Table IX. As we can learn from Table IX, there are differences between the theoretical s and the actual values. However, the final results in terms of accuracy in the previous subsection is maintained and not effected by the small bias of , which proves the effectiveness of our solution.

6.2 Salient object detection

Salient object detection aims at standing out the region of the salient object in an image. It is an important evaluation that provides good visible results. Previous experiments show that 2 bits can achieve a good trade-off between accuracy and bitwidth reduction. In this section, only 2 bit quantization for the weights with VecQ is used as the quantization method for the DNN models.

6.2.1 Evaluation settings

Datasets All models are trained with the training data in the MSRA10K dataset (80% of the entire dataset) [39]. After training, the evaluation is conducted on multiple datasets, including the MSRA10K (20% of the entire dataset), ECSSD [39], HKU-IS [39], DUTs [39], DUT-OMRON [39] and the images containing target objects and existing ground truth maps in the THUR15K [5]. The details of the selected datasets are shown in Table X. All images are resized to for the training and test.


Datasets Images Contrast


ECSSD 1000 High
HKU-IS 4000 Low
DUTs 15572 Low
DUT-OMRON 5168 Low
MSRA10K 10000 (80/20%) High
THUR15K 6233 Low


TABLE X: The datasets for salient object detection.

Models  The famous end-to-end semantic segmentation models i.e., U-Net [39], FPN [28], LinkNet [2] and UNet++ [44] are selected for the comprehensive comparison. Their detailed information is shown in Table XI. The models are based on the same ResNet-50 backbone as encoder and initialized with the weights trained on the ImageNet dataset.


Models U-Net FPN LinkNet UNet++


Backbone ResNet-50 ResNet-50 ResNet-50 ResNet-50
Convs 64 67 69 76
Parameters (M) 36.54 28.67 28.78 37.7
Model size (M) 139.37 109.38 109.8 143.81
Q. size (M)11footnotemark: 1 9.05 7.20 7.24 9.35


TABLE XI: The models for salient object detection.
Fig. 8: The detection results comparison. Models with indicate the models are trained with 2-bit quantization with VecQ.




Model Size (M) MAE MaxF S E MAE MaxF S E MAE MaxF S E MAE MaxF S E MAE MaxF S E MAE MaxF S E


Unet 139.37 0.030 0.945 0.931 0.962 0.057 0.909 0.886 0.914 0.045 0.907 0.884 0.930 0.060 0.896 0.865 0.874 0.070 0.804 0.803 0.829 0.077 0.769 0.807 0.816
Unet* 9.05 0.032 0.940 0.926 0.959 0.064 0.896 0.871 0.906 0.050 0.893 0.870 0.923 0.065 0.885 0.852 0.871 0.071 0.793 0.795 0.835 0.081 0.749 0.797 0.815


Bias 93.51% -0.002 0.004 0.005 0.003 -0.007 0.013 0.015 0.008 -0.005 0.014 0.015 0.007 -0.005 0.011 0.012 0.003 -0.001 0.011 0.008 -0.006 -0.005 0.020 0.010 0.001


FPN 109.38 0.043 0.920 0.899 0.949 0.070 0.882 0.854 0.901 0.059 0.875 0.848 0.911 0.072 0.875 0.835 0.861 0.081 0.777 0.772 0.812 0.087 0.750 0.778 0.804
FPN* 7.2 0.038 0.935 0.920 0.955 0.070 0.889 0.866 0.897 0.059 0.879 0.859 0.908 0.070 0.878 0.850 0.859 0.080 0.772 0.786 0.809 0.089 0.739 0.789 0.796


Bias 93.42% 0.005 -0.015 -0.022 -0.006 0.000 -0.007 -0.012 0.004 0.001 -0.005 -0.011 0.004 0.002 -0.003 -0.015 0.003 0.001 0.004 -0.014 0.003 -0.002 0.011 -0.011 0.007


Linknet 109.8 0.032 0.942 0.928 0.959 0.060 0.905 0.882 0.911 0.048 0.900 0.878 0.927 0.062 0.892 0.861 0.871 0.071 0.801 0.799 0.825 0.079 0.760 0.803 0.814
Linknet* 7.24 0.034 0.939 0.923 0.959 0.068 0.891 0.865 0.902 0.054 0.887 0.860 0.920 0.068 0.883 0.847 0.870 0.072 0.788 0.787 0.833 0.082 0.746 0.794 0.818


Bias 93.40% -0.002 0.003 0.004 0.001 -0.008 0.014 0.017 0.008 -0.005 0.013 0.018 0.006 -0.005 0.010 0.014 0.001 -0.001 0.014 0.012 -0.008 -0.002 0.014 0.009 -0.004


UNet++ 143.81 0.029 0.948 0.933 0.964 0.056 0.910 0.888 0.915 0.044 0.909 0.887 0.930 0.059 0.897 0.867 0.876 0.070 0.805 0.805 0.829 0.076 0.769 0.810 0.818
UNet++* 9.35 0.033 0.939 0.926 0.958 0.065 0.895 0.872 0.905 0.052 0.890 0.868 0.919 0.066 0.884 0.854 0.867 0.075 0.785 0.792 0.822 0.082 0.750 0.797 0.811


Bias 93.50% -0.004 0.008 0.007 0.007 -0.009 0.015 0.016 0.010 -0.008 0.019 0.018 0.011 -0.007 0.013 0.013 0.009 -0.005 0.020 0.013 0.007 -0.006 0.019 0.013 0.007


TABLE XII: The salient object detection results.

Evaluation Metrics We choose widely used metrics for a comprehensive evaluation: (1) Mean Absolute Error (MAE) [39], (2) Maximal F-measure (MaxF) [39], (3) Structural measure (S-measure) [39], and (4) Enhanced-alignment measure (E-measure) [10].

The MAE measures the average pixel-wise absolute error between the output map and the ground-truth mask .


Here is the number of the samples, and are the height and width of .

F-measure comprehensively evaluates the and with a weight parameter .


is empirically set to [39]. The MaxF is the maximal F value.

S-measure considers the object-aware and region-aware structure similarities.

E-measure is designed for binary map evaluations. It combines the local pixel-wise difference and the global mean value of map for comprehensive evaluation. In our evaluation, the output map is first binarized to [0,1] by comparing with the threshold of twice of its mean value

[39], then the binary map is evaluated with E-measure.

We also involve direct visual comparison of the full precision and quantized weights of the selected model, to provide a more visible comparison.

6.2.2 Results and analysis

Fig. 9: The comparison of weights of convolutional layers in Unet++ and Unet++*.

The sampled detection results are shown in Fig. 8 and quantitative comparison results are in Table XII. The indicates the quantized model based on VecQ besides the full precision model. Note here, the output sizes of the last layer of FPN and FPN* are and then resized to . Overall, the performance degradation of the quantized models based on VecQ is less than 0.01 in most metric for all models, but the size of the quantized model is reduced by more than 93% when compared to the models using FP32.

As shown in Table XII, all the quantized models have a less than 0.01 degradation on MSRA10K with all evaluation metrics. This is because all the models are trained on the training subset of MSRA10K, so the features of the images in it are well extracted. The other 5 data sets are only used as testing datasets and there are more degradation with the evaluation metrics, especially in MaxF and S-measure, but the degradation is maintained within 0.02.

Comparing the quantized models with their corresponding full-precision models, FPN* performs well on almost all the test sets with all the evaluation metrics (shown as bold italics numbers in Table XII), showing a better feature extraction capacity than the full precision version (FPN). Compared to other models, FPN outputs a relatively smaller prediction map (). Although the backbone models are similar, the feature classification tasks in the FPN work on a smaller feature map with a similar size of coefficients. This provides a good chance for data quantization without effecting the detection ability because of the potential redundancy of the coefficients.

In order to further present the effectiveness of our proposed quantization solution, we print the weights of the convolution layers of the backbone model in Unet++ and Unet++*, as shown in Fig. 9. There are three sizes of kernels involved in this model, which are , and . In addition, we also compare the full precision and quantized weights of the last convolution layer for the selected model, which directly outputs the results for the detection.

In the first kernels, we notice a significant structural similarity between the full precision weights and the quantized weights. Since the very first layer of the backbone model extracts the overall features, the quantized weights provide a nice feature extraction ability. When the size of the kernels become smaller, we could still notice a good similarity between the full precision ones and quantized ones, Although the similarities are not significant in the Conv 3x3 sets, they become obvious in the following Conv 1x1 sets. The Last Conv group directly explains the comparable output results with the visible emphasized kernels and locations of the weights.

Overall, the quantized weights show a good similarity to the full precision ones in terms of the value and the location which ensures the high accuracy output when compared to the original full precision model.

7 Conclusion

In this paper, we propose a new quantization solution called VecQ. Different from the existing works, it utilizes the vector loss instead of adopting L2 loss to measure the loss of quantization. VecQ quantizes the full-precision weight vector into a specific bitwidth with the least DQL and, hence, provides a better final model accuracy. We further introduce the fast quantization algorithm based on a reasonable prior knowledge of normally distributed weights and reduces the complexity of the quantization process in model training. The integration of VecQ into Keras [6] is also presented and used for our evaluations. The comprehensive evaluations have shown the effectiveness of VecQ on multiple datasets, models and tasks. The quantized low-bit models based on VecQ show comparable classification accuracy to models with FP32 datatype and outperform all the state-of-the-art quantization methods when the targeted bitwidth of the weights is higher than 2. Moreover, the experiments on salient object detection also show that VecQ can greatly reduce the size of the models while maintaining the performance of feature extraction tasks.

For future work, we will focus on the combination of non-linear quantization and illustrate an automated mixed-precision quantization with VecQ to achieve better performance improvement. The source code of the Keras built with VecQ could be found at

8 Acknowledgment

This work is partially supported by the National Natural Science Foundation (61872200), the National Key Research and Development Program of China (2018YFB2100304, 2018YFB1003405), the Natural Science Foundation of Tianjin (19JCZDJC31600, 19JCQNJC00600), the Open Project Fund of State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences (CARCH201905). It is also partially supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme, and the IBM-Illinois Center for Cognitive Computing System Research (C3SR) - a research collaboration as part of IBM AI Horizons Network.


  • [1] H. Alemdar, V. Leroy, A. Prost-Boucle, and F. Pétrot (2017) Ternary neural networks for resource-efficient AI applications. In IJCNN, pp. 2547–2554. Cited by: §2.3, §2, TABLE VII, TABLE VIII.
  • [2] A. Chaurasia and E. Culurciello (2017) Linknet: exploiting encoder representations for efficient semantic segmentation. In VCIP, pp. 1–4. Cited by: §6.2.1.
  • [3] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen (2019) Cloud-DNN: an open framework for mapping dnn models to cloud FPGAs. In FPGA, pp. 73–82. Cited by: §1.
  • [4] Y. Chen, K. Zhang, C. Gong, C. Hao, X. Zhang, T. Li, and D. Chen (2019)

    T-DLA: an open-source deep learning acceleratorfor ternarized DNN models on embedded FPGA

    ISVLSI. Cited by: §1, §2.3, §2.4, §2.
  • [5] M. Cheng, NiloyJ. Mitra, X. Huang, and S. Hu (2014) SalientShape: group saliency in image collections. The Visual Computer 30 (4), pp. 443–453. External Links: ISSN 0178-2789, Document Cited by: §6.2.1.
  • [6] F. Chollet et al. (2015) Keras. GitHub. Note: Cited by: TABLE V, §6, §7.
  • [7] M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In NIPS, pp. 3123–3131. Cited by: §1, §2.1, §2, TABLE VII.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: A large-scale hierarchical image database. In IEEE CVPR, pp. 248–255. Cited by: §6.1.1, §6.1.1, TABLE VIII.
  • [9] L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li (2018) GXNOR-net: training deep neural networks with ternary weights and activations without full-precision memory under a unified discretization framework. Neural Networks 100, pp. 49–58. Cited by: §2.3, §2.
  • [10] D. Fan, C. Gong, Y. Cao, B. Ren, M. Cheng, and A. Borji (2018) Enhanced-alignment measure for binary foreground map evaluation. In IJCAI, pp. 698–704. Cited by: §6.2.1.
  • [11] M. Ghasemzadeh, M. Samragh, and F. Koushanfar (2018) ReBNet: residual binarized neural network. In IEEE FCCM, pp. 57–64. Cited by: §2.2, §2, TABLE VII, TABLE VIII.
  • [12] C. Gong, Y. Lu, T. Li, X. Zhang, C. Hao, D. Chen, and Y. Chen (2019) l2q: An ultra-low loss quantization method for DNN. IJCNN. Cited by: §1, §1, §2.2, §2.4, §2.4, §2, §5.2.1, TABLE VII, TABLE VIII.
  • [13] S. Gross and M. Wilber (2016) Training and investigating residual nets. Facebook AI Research. Cited by: TABLE V.
  • [14] P. Gysel, M. Motamedi, and S. Ghiasi (2016)

    Hardware-oriented approximation of convolutional neural networks

    arXiv. Cited by: §1, §2.1, §2.3, §2, §6.1.1, TABLE VII, TABLE VIII.
  • [15] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv. Cited by: §1, §2.2, §2.4, §2, TABLE VII.
  • [16] C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W. Hwu, and D. Chen (2019) FPGA/DNN co-design: an efficient design methodology for iot intelligence on the edge. In DAC, Cited by: §1, §1.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE CVPR, Cited by: §1, §6.1.1, §6.1.1, TABLE V, TABLE VIII.
  • [18] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In NIPS, pp. 4107–4115. Cited by: §1, §1, §2.1, §2.2, §2.
  • [19] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §2.1, §5.3, TABLE IV.
  • [20] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In IEEE CVPR, pp. 2704–2713. Cited by: §2.1, §2, TABLE VII.
  • [21] S. R. Jain, A. Gural, M. Wu, and C. Dick (2019) Trained uniform quantization for accurate and efficient neural network inference on fixed-point hardware. arXiv. Cited by: §2.1, §2, TABLE VII.
  • [22] C. Jin, H. Sun, and S. Kimura (2018) Sparse ternary connect: convolutional neural networks using ternarized weights with enhanced sparsity. In ASP-DAC, pp. 190–195. Cited by: §1, §2.1, §2.
  • [23] R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv. Cited by: TABLE VIII.
  • [24] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §6.1.1, TABLE VIII.
  • [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §6.1.1, TABLE V, TABLE VIII.
  • [26] C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin (2018) Extremely low bit neural network: squeeze the last bit out with admm. In AAAI, Cited by: §1, §1, §2.2, §2, §3, §3, §6.1.1, §6.1.1, §6.1.3, TABLE VII, TABLE VIII.
  • [27] F. Li, B. Zhang, and B. Liu (2016) Ternary weight networks. arXiv. Cited by: §1, §1, §2.2, §2.4, §2, §6.1.1, TABLE VII, TABLE VIII.
  • [28] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In IEEE CVPR, pp. 2117–2125. Cited by: §6.2.1.
  • [29] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio (2015) Neural networks with few multiplications. arXiv. Cited by: §1, §2.1, §2, TABLE VII, TABLE VIII.
  • [30] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011-06)

    Learning word vectors for sentiment analysis

    In ACL, pp. 142–150. Cited by: §6.1.1, §6.1.1.
  • [31] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In ECCV, pp. 525–542. Cited by: §2.2, §2, §6.1.3, TABLE VII, TABLE VIII.
  • [32] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In IEEE CVPR, pp. 4510–4520. Cited by: §6.1.1, TABLE IV, TABLE V, TABLE VIII.
  • [33] M. Simon, E. Rodner, and J. Denzler (2016) ImageNet pre-trained models with batch normalization. arXiv. Cited by: TABLE V.
  • [34] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv. Cited by: §1, §6.1.1, §6.1.1, §6.1.1, TABLE V, TABLE VIII.
  • [35] M. Sum, J. Li, Z. Guo, Y. Zhao, Y. Zheng, X. Si, and Z. Liu (2016)

    THUCTC: an efficient chinese text classifier

    GitHub Repository. Cited by: §6.1.1, §6.1.1.
  • [36] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen (2018) Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA. In FPL, pp. 163–1636. Cited by: §1.
  • [37] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) HAQ: hardware-aware automated quantization with mixed precision. In IEEE CVPR, pp. 8612–8620. Cited by: §2.3, §2, TABLE VII, TABLE VIII.
  • [38] P. Wang, Q. Hu, Y. Zhang, C. Zhang, Y. Liu, and J. Cheng (2018) Two-step quantization for low-bit neural networks. In IEEE CVPR, pp. 4376–4384. Cited by: §1, §1, §2.2, §2.4, §2, §5.2.1, §6.1.1, §6.1.1, TABLE VII, TABLE VIII.
  • [39] W. Wang, Q. Lai, H. Fu, J. Shen, and H. Ling (2019) Salient object detection in the deep learning era: an in-depth survey. arXiv. Cited by: §6.2.1, §6.2.1, §6.2.1, §6.2.1, §6.2.1.
  • [40] M. Yu, Z. Lin, K. Narra, S. Li, Y. Li, N. S. Kim, A. Schwing, M. Annavaram, and S. Avestimehr (2018) Gradiveq: vector quantization for bandwidth-efficient gradient aggregation in distributed CNN training. In NIPS, Cited by: §2.2, §2.4, §2.
  • [41] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W. Hwu, and D. Chen (2018) DNNbuilder: an automated tool for building high-performance dnn hardware accelerators for FPGAs. In ICCAD, Cited by: §1.
  • [42] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen (2017) Incremental network quantization: towards lossless CNNs with low-precision weights. arXiv. Cited by: §2.1, §2, TABLE VII, TABLE VIII.
  • [43] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv. Cited by: §1, §2.1, §2.3, §2, §5.3.
  • [44] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang (2018) Unet++: a nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Cited by: §6.2.1.
  • [45] C. Zhu, S. Han, H. Mao, and W. J. Dally (2016) Trained ternary quantization. arXiv. Cited by: §1, §2.1, §2, §6.1.1, TABLE VII, TABLE VIII.