1 Introduction
Deep Neural Networks (DNNs) have been widely adopted in machine learning based applications
[34, 17]. However, besides DNN training, DNN inference is also a computationintensive task which affects the effectiveness of DNN based solutions [3, 16, 41]. Neural network quantization employs low precision and low bitwidth data instead of high precision data for the model execution. Compared to the DNNs with floating point with 32bit width (FP32), the quantized model can achieve up to 32 compression rate with an extremely lowbitwidth quantization [18]. The lowbitwidth processing, which reduces the cost of the inference by using less memory and reducing the complexity of the multiplyaccumulate operation, improves the efficiency of the execution of the model significantly [41, 36].However, lowering the bitwidth of the data often brings accuracy degradation [16, 4, 14]. This requires the quantization solution to balance between computing efficiency and final model accuracy. However, the quantitative tradeoff is nonconvex and hard to optimize – the impact of the quantization to the final accuracy of the DNN models is hard to formulate.
Previous methods neglect the quantitative analysis of the Direct Quantization Loss (DQL) of the weight data and make the quantization decision empirically while directly evaluating the final model accuracy [18, 7, 29, 43, 22] thus only achieving unpredictable accuracy.
In order to achieve higher training accuracy, finding an optimal quantization solution with minimal loss during the training of the learning kernels is effective and practical. One way of finding a local optimal solution is to minimize the DQL of the weight data, which is widely used in the current quantization solutions [15, 27, 26, 38, 12].
As shown in Fig. 1, denotes the fullprecision weight and is the value after quantization. Conventional quantization methods regard as a point (set as origin in Fig. 1) in Euclidean Space, and is a point which is close to in a discrete data space. The discrete data space contains a certain number of data points that can be represented by the selected bitwidth. Therefore, the Square of Euclidean Distance (Square 2norm or called L2 distance [45]), represented as , between the original weight data and the quantized data is simply used as the loss of the quantization process, which is going to be reduced [27, 26, 38, 12].
Although the L2 based solutions are proven to be effective and provide good training results in terms of accuracy of the model and bitwidth of the weight data, such solutions still have some major issues: (1) Solving the L2 based quantization always lead to inaccurate results because of the approximation process in it. As shown in Fig. 1, even with an additional quantization scaling factor , which could help to reduce the difference between the original and quantized data, we still can not avoid the accuracy loss during the quantization process. The quantized results always fall into the suboptimal space (SubQ) instead of the optimal value (OptQ). (2) The process of L2 based quantization focuses on each of the individual weight data and neglects the correlation among these data points in a kernel or a layer.
To address the issues above, instead of using traditional Euclidean Distance, we propose a more accurate quantization loss evaluation metric; we also propose an algorithm to guide the quantization of the weight data quantitatively. We construct the weights into a vector
rather than scalar data, to take advantage of the characteristic that the loss between vectors can be split into orientation loss and modulus loss, which are independent of each other. As a result, for the first time we are able to achieve the minimal loss of the weight quantization for DNN training.In this paper, after the concluded related works, we will prove that using vectorization loss as an optimization target is better than directly optimizing the L2 distance of the weights before and after quantization. Based on our proposed vectorized quantization loss measurement, we further propose a Vectorized Quantization method (VecQ) to better explore the tradeoff between computing efficiency and the accuracy loss of quantization.
In summary, our contributions are as follows:

We propose a new metric, vector loss, as the loss function for DNN weight quantization, which can provide optimal quantization solution.

A new quantization training flow based on the vectorized quantization process is proposed, named VecQ, which achieves better model accuracy for different bitwidth quantization target.

Parametric estimation and computing template are proposed to reduce the cost of probability density estimation and derivative calculation of VecQ to speed up the quantization process in model training.

Extensive experiments show that VecQ achieves a lower accuracy degradation under the same training settings when compared to the stateoftheart quantization methods in the image classification task with the same DNN models. The evaluations on Saliency Object Detection (SOD) task also show that our VecQ maintains comparable feature extraction quality with up to 16 weight size reduction.
This paper is structured as follows. Section 2 introduces the related works. In Section 3, the theoretical analysis of the effectiveness of vector loss compared to L2 loss are presented. Section 4 presents the detailed approach of VecQ. Section 5 proposes the fast solution for our VecQ quantization as well as the integration of VecQ into the DNN training flow. Section 6 presents the experimental evaluations and Section 7 concludes the paper.
2 Related works and Motivation
As an effective way to compress DNNs, many quantization methods have been explored [14, 18, 31, 7, 27, 45, 1, 29, 12, 37, 42, 22, 11, 38, 26, 9, 4, 20, 21, 15, 40, 43]. These quantization methods can be roughly categorized into 3 different types based on their objective functions for the quantization process:

Methods based on heuristic guidance of the quantization, e.g., directly minimizing the final accuracy loss;

Methods based on minimizing Euclidean Distance of weight data before and after quantization;

Other methods such as training with discrete weights and teacherstudent network.
In this section, we first introduce the existing related works based on their different categories and then present our motivation for vectorized quantization.
2.1 Heuristic guidance
The heuristic methods usually directly evaluate the impact of the quantization on the final output accuracy. They often empirically iterate the training process to improve the final accuracy. For example, the BNNs [18] proposed a binary network for fast network inference. It quantizes all the weights and activations in a network to 2 values,
, based on the sign of the data. Although it provides a DNN with 1bit weights and activations, it is hard to converge without Batch Normalization layers
[19] and leads to a significant accuracy degradation when compared to fullprecision networks. The Binary Connect [7] and Ternary Connect [29] sample the original weights into binary or ternary according to a sampling probability defined by the value of the weights (after scaling to [0,1]). All these works do not quantify the loss during the quantization, so that only the final accuracy is the guideline of the quantization.Quantization methods in [43, 14] convert the fullprecision weights to fixedpoint representation by dropping the least significant bits without quantifying the impact.
INQ [42] iteratively processes weight partition, quantization and retraining method until all the weights are quantized into powersoftwo or zeros.
STC [22] introduces a ternary quantization which first scales the weights into the range of , and then quantizes all scaled weights into ternary by uniformly partitioning them. Thus, the values located in and are quantized to 1, 1 and the rest of them are set to 0.
TTQ [45] introduces a ternary quantization which quantizes fullprecision weights to ternary by a heuristic threshold but with two different scaling factors for positive and negative values, respectively. The scaling factors are optimized during the back propagation.
The quantization method in [20] (denoted as QAT) employs the affine mapping of integers to real values with two constant parameters: Scale and Zeropoint. It first subtracts the Zeropoint parameter from data (weights/activation), then divides the data by a scaling factor and obtains the quantized results with rounding operation and affine mapping. The approach of TQT [21] follows QAT but with the improvement of constraining the scalefactors into powerof2 and relates them to trainable thresholds.
2.2 Optimizing Euclidean Distance
In order to provide better accuracy control, reducing the Euclidean Distance of the data before and after quantization becomes a popular solution.
XnorNet [31] adds a scaling factor on the basis of BNNs [18] and calculates the optimal scaling factor to minimize the distance of the weights before and after quantization. The scaling factor boosts the convergence of the model and improves the final accuracy. The following residual quantization method in [11] adopts XnorNet [31] to further compensate the errors produced by single binary quantization to improve the accuracy of the quantized model.
TWNs [27] proposes an additional threshold factor together with the scaling factor for ternary quantization. The optimal parameters (scaling factor and threshold factor) are still based on the optimization of the Euclidean distance of weights before and after quantization. TWNs achieves better final accuracy than XnorNet and BNNs.
Extremely low bit method (ENN) proposed in [26] quantizes the weights into the exponential values of 2 by iteratively optimizing the L2 distance of the weights before and after quantization.
TSQ [38] presents a twostep quantization method, which first quantizes the activation to lowbit values, and then fixes it and quantizes the weights into ternary. TSQ employs scaling factor for each of the kernel, resulting in a limited model size reduction.
L2Q [12]
first shifts the weights of a layer to a standard normal distribution with a shifting parameter and then employs a linear quantization for the data. The uniform parameter considers the distribution of the weight data, which provides better loss control compared to simply optimizing the Euclidean Distance during the quantization.
2.3 Other works
Besides the heuristic and Euclidean Distance approaches, there are still many other works focusing on lowprecision DNN training.
GXNORNet [9] utilizes the discrete weights during training instead of the fullprecision weights. It regards the discrete values as states and projects the gradients in backward propagation as the transition of the probabilities to update the weights directly, hence, providing a network with ternary weights.
TDLA [4] quantizes the scaling factor of ternary weights and fullprecision activation into fixedpoint numbers and constrains the quantization loss of activation values by adopting infinite norms. Compared with [14, 43], it shifts the available bitwidth to the most effective data portion to make full use of the targeted bitwidth.
In TNN [1], the authors design a method using ternary student network, which has the same network architecture as the fullprecision teacher network, aiming to predict the output of the teacher network without training on the original datasets.
In HAQ [37], the authors proposed a range parameter — all weights out of the range are truncated and the weights within the range are linearly mapped to discrete values. The optimal range parameter was obtained by solving the KLdivergence of the weights during the quantization.
However, comparing to the heuristic guidance and Euclidean Distance based methods, the approaches above either focus on a specific bitwidth or perform worse in terms of the accuracy of the trained DNNs.
2.4 Motivation of the VecQ Method
We have witnessed the effectiveness of the L2 distancebased methods among all the existing approaches. However, as explained in the introduction, there are still two defects that lead to inaccurate DQL measurement.
The first defect is that L2 distance based optimization usually cannot be solved accurately [27, 12, 38, 15, 40], even with an additional scaling factor to scale the data into proper range. As shown in Fig. 1, the quantization function with the additional scaling factor to improve the accuracy [12, 27] is denoted as ; the L2 distance curve between and with the change of is drawn in blue. It has a theoretical optimal solution when with a L2 distance of , shown as the green dot. However, only the solutions with the L2 distance ranging in could be obtained due to the lack of solvable expressions [27, 12], leading to an inaccurate quantization result. Additionally, even the methods involve kmeans for clustering of the weights still fall into the suboptimal solution space [15, 40]. Their corresponding approximated optimal quantized weights are located in the SubQ space colored with orange.
The second defect is that L2 based quantization neglects the correlation of the weights within the same kernel or layer, but only focuses on the difference between single values. Even with the kmeans based solutions, the distribution of the weights in the same layer is ignored in the quantization process. However, the consideration of the distribution of the weight data is proven to be effective for the accuracy control in the existing approaches [12, 4].
We discover that when we represent quantization loss of the weight for a kernel or a layer using vector distance instead of L2 distance, it will intrinsically solve the two problems mentioned above. The two attributes of a vector, orientation and modulus, can uniquely determine a vector. Moreover, the distance between vectors is naturally determined by these two attributes. Therefore, we define the DQL with vector loss that involves the distance of the orientation and modulus of the vectors. To the best of our knowledge, there is no previous work that leverages the vector loss for the weight quantization of DNNs.
In this work, we prove that the vector loss can provide optimal quantization solution and hence achieve a smaller DQL for the weight data quantization during the model training, which could lead to a higher model accuracy. Based on this, we propose VecQ, which minimizes the loss of the quantization process based on vector loss. We also propose a fast parameter estimation method and a computation template to speed up our vectorized quantization process for easier deployment of our solution.
3 Vectorized distance vs. L2 distance
Before introducing our vectorized quantization method, we first explain the effectiveness of loss control with vector loss using two data points as an example for simplicity. Assume a DNN layer with only two weights, denoted as {} whose values are {2.5, 1.75}. The weights will be quantized into bits. The quantization loss based on L2 distance is denoted as and the quantization solution set is expressed in the format of {}, where is the floating point scaling factor and are the discrete values in the quantization set , then we get
(1) 
Let be a vector from the origin point , and its quantized value is and . could also be represented as the modulus of the distance between vector and . The is calculated as:
(2) 
As shown in Fig. 2, there are only two dimensions in the solution space, each representing one weight. Each dimension contains possible values. Since the direct optimization based on the L2 distance with bit constraint is nonconvex, a common method to solve this problem is updating the and iteratively till reaching the extreme points [26]. The values of the possible solutions are located on the black dotted and the blue solid lines in the Fig. 2 due to the full precision scaling factor . The intersection angle between and the quantized version is denoted as .
However, due to the nonconvex characteristic of optimizing L2 loss under the bit constraint [26], the result may be found as the red point in the figure with the solution of {} and an angle of , which is the first suboptimal solution point on the curve of the loss to the values in Fig. 2, ignoring the optimal solution which is the second extreme value on the curve. Therefore, instead of using L2 quantization loss, we use the method of measuring the difference between vectors as vector loss to evaluate the difference between the weight vector and the quantized results to guide the convergence process to obtain the solution.
Generalizing the above observation to multiple weights within a DNN layer, we have the following theorem:
Theorem 1.
Assume is a weight vector of a layer of a DNN containing weights, by minimizing the vector loss , the optimal solution for L2 distance based quantization can be obtained.
We first formally prove that there is always an optimal solution for minimizing vector loss ; we then prove that the optimal solution for is also the optimal solution for minimizing the quantization loss based on L2 distance.
Proof.
can be divided into orientation loss and modulus loss , which are represented as
(3) 
where and represent the unit vector for and . Moreover, we obtain the first order derivative of for and hence obtain the Hessian matrix of it as below:
(4) 
and the secondorder derivative of on is
(5) 
Equ. 4 and Equ. 5 indicate that the functions and are both convex and their optimal solutions can be obtained through efficient convex optimization solvers. The optimization of the orientation loss is only defined by the unit vectors and , where value is not affected; the optimization of the modulus loss is to find the optimal value of . Therefore, the optimization of and can be achieved independently. With the optimization of , without the effect of the joint optimization of the modulus, the optimal solution that indicates the optimal intersection angle could be directly located, which could always satisfy because the L2 loss function based solutions can not guarantee to achieve the optimal in the same solution space. The modulus loss can be optimized completely by scaling without affecting the . Furthermore, with the angles we have:
(6) 
using the same solution under vector loss minimization, we have the following equation for L2 distance based loss:
(7) 
which proves that with vector loss, we always achieve the optimal solution for the L2 distance based quantization. ∎
Guided by the theorem above, we have the final optimal solution for the example points in the Fig. 2 as {}, which provides the minimal DQL.
Base on our proposed vector loss metric and theorem, we will discuss our algorithms to obtain the vector loss based quantization solution as well as the methods to speed up the algorithms for practical purpose.
4 Vectorized Quantization
VecQ is designed to follow the theoretical guideline we developed in Section 3. First of all, the vectorization of the weights is introduced. Then the adoption of the vector loss in VecQ is explained in detail. Finally, the process of VecQ quantization with two critical stages is presented.
4.1 Vectorization of weights
For the weight set of layer , we flatten and reshape them as a dimension vector . , indicate the number of input channel and output channel for this layer, and indicates the size of the kernel for this layer. For simplicity, we use to represent the weight vector of a certain layer before quantization.
4.2 Loss function definition
We use the vectorized loss instead of Euclidean distance during the quantization process. Since solving the orientation and modulus loss independently could achieve the optimal solution for each of them, we further define the total quantization loss as the sum of both modulus loss and orientation loss as
(8) 
to provide a more strict constraint during the quantization process.
4.3 Overall process
According to our analysis in Section 3, the orientation loss indicates the optimal intersection angle and the modulus loss indicates the optimal scale at this angle. Therefore, our quantization takes two stages to minimize the two losses independently, which are defined as steering stage and driving stage as shown in Fig. 3. In the steering stage, we adjust the orientation of the weight vector to minimize the orientation loss. Then, we fix the orientation and only scale the modulus of the vector at the driving stage to minimize the modulus loss.
Let be the weight vector of the layer of a DNN in the real space and the be the quantized weight vector in the uniformly discrete subspace. First, steering to :
(9) 
Where is an orientation vector that disregards the modulus of the vector and only focuses on the orientation. Second, along with the determined orientation vector , we search the position of the modulus and ”drive” to the optimal position with minimum modulus loss. The quantized vector is achieved by driving the .
(10) 
The complete quantization process is the combination of the two stages. The final target is reducing the loss between the original weight and the quantized results . The entire quantization process is represented as
(11) 
4.4 Steering stage
The purpose of the steering stage is to search for an optimal orientation vector, which has the least orientation loss with to minimize the .
As shown in Fig. 4, is the weight in floating point representation and it would be quantized into bit representation. It means, there are total values that can be used to represent the values in , where each of them is denoted as . We adopt linear quantization method, where an interval is used to represent the distance between two quantized values.
(12) 
The vector with floating data is quantized to a vector with discrete data by an rounding () operation for each of the values in the vector. The data are limited to the range of by extended clip () operation. The subtraction of is used to avoid aggregation at 0 position and guarantees the maximum number of rounding values to be .
Given a for the number of bits as the quantization target, the intermediate quantized weight is
(13) 
which has the minimized orientation loss with the as an interval parameter. When is fixed, decides the orientation loss between and . In order to minimize the loss, we only need to find the optimal :
(14) 
Finding the optimal requires several processes with high computational complexities; the detailed processes and the corresponding fast solution is presented in Section 5.
4.5 Driving stage
In the driving stage, we minimize the modulus loss between the orientation vector obtained from the steering stage and the original weight vector . Since we focus on the modulus in this stage, only the scaling factor is involved.
(15) 
Here we only need to find the optimal to minimize the modulus loss.
(16) 
where can be easily obtained by finding the extreme of with
(17) 
and the value of .
Finally, with the two stages above, the quantized value of the is determined by and :
(18) 
5 Fast Quantization
With the proposed quantization method in Section 4, the minimum quantization loss is achieved when the optimal in Equ. (14) is found. However, as one of the most critical processes in the model training, the computational overhead of quantization leads to inefficient training of the model. In order to address this issue, in this section, we first analyze the computational complexity to calculate the value of . Then, we propose a fast solution based on our proposed fast probability estimation and computation template. In the end, the detailed implementation of our quantization solution together with the fast solver is integrated into our training flow.
5.1 Analysis of the optimal
The most computational intensive process in our quantization is the steering stage, specifically, the process to solve Equ. (14). However, Equ. (14) can not be solved directly due to the clipping and rounding operations. Instead of directly using the values in , we involve the distribution of the values in to support a more general computation of the . The probability density of the value in is noted as .
According to the steering method in Equ. (12), each value is projected to a value ; the values are linearly distributed into with a uniform distance defined by interval . As shown in Fig. 5, the light blue curve is the distribution of values in and orange dots are the values after quantization, represented as . indicates the probability of in .
Specifically, the data within range () is replaced by the single value and the data out of the range are forced to be truncated and set to the nearest . As given in Equ. (12), and .
We set and to ease the formulation. Based on the distribution of data in , the expanding terms of in Equ. (14) can be obtained as follows,
(19) 
The is represented as
(20) 
Since the linear quantization is adopted with the fixed interval of , and can be easily derived by the following equations,
(21) 
(22) 
So is only related to and Equ. (14) can be solved by solving
(23) 
Concluding from the discussion above, for each , three steps are necessary:

Estimating probability density of the values in .

Solving Equ. (23) and getting the optimal .

Using the optimal to obtain the final quantization results.
However, the first two steps (probability density estimation and derivative calculation) are complex and costly in terms of CPU/GPU time and operations, which limit the training speed.
5.2 Fast solver
For the computationintensive probability density estimation and derivative calculation, we propose two methods to speed up the processes, which are fast parametric estimation and computing template, respectively.
5.2.1 Fast probability estimation
There are two methods for probability density estimation: parametric estimation and nonparametric estimation.
Nonparametric estimation is usually used for fitting the distribution of data without prior knowledge. It requires all the density and probability of the data to be estimated individually, which will lead to a huge computational overhead.
We take the widely adopted nonparametric estimation method, Kernel Density Estimation (KED) as an example, to illustrate the complexity of nonparametric estimation.
(24) 
Here is the probability density of . is the number of the samples and is the nonnegative kernel function that satisfies and . is the smoothing parameter. The time complexity of computing all probability densities is and the space complexity is because all the probability densities need to be computed and stored.
Parametric estimation is used for the data which have a known distribution and only computes some parameters of the distributions instead. It could be processed fast with the proper assumption of the distribution. Thus, we adopt a parametric estimation method in our solution.
There is prior knowledge of the weights of the layers of DNNs, which assumes that they are obeying normal distribution with the mean value of 0 so that the training could be conducted easily and the model could provide better generalization ability [12, 38]:
(25) 
is the standard derivation of . Based on this prior knowledge, we can use parametric estimation to estimate the probability density of and the only parameter that is needed to be computed during the training is . The effectiveness of this prior distribution is also proven by the final accuracy we obtain in the evaluations.
With Equ. (25), we have
(26) 
Therefore, parametric estimation only requires the standard deviation
, which could be calculated with(27) 
Here computes the expectation. Hence, the time complexity of computing is reduced to and the space complexity is reduced to .
5.2.2 Computing template
After reducing the complexity of computing the probability, finding optimal is still complex and timeconsuming. In order to improve the computing efficiency of this step, we propose a computing templatebased method.
Since the weights of a layer obey normal distribution, , they can be transformed to the standard normal distribution:
(28) 
Then, we could use to compute instead of using , because of:
(29) 
Here, is the computing template for , because it has the same orientation loss with as . By choosing this computing template, solving Equ. (23) is equivalent to solve the substitute equation Equ. (30).
(30) 
Since , the probability of value is:
(31) 
After the probability is obtained, the orientation loss function can be expressed as a function only relating to and the targeting bitwidth for the quantization.
(32) 
is a convex function with the condition of . However, it is constant when because the angle between the weight vector and the vector constructed with the signs of values in the weights is constant. Due to its independence at , we set to for the convenience of the following process. We plot the curve of in Fig. LABEL:fig:optimal_orientation_curve with the change of under different bits.
The optimal values for all bitwidth settings obtained by solving Equ. (32) is shown in Table I. The loss is small enough when the targeted bitwidth is greater than 8, so we omit the results for them. With the template above, we only need to solve once to find the optimal , and then apply it to all quantization without repetitively calculating it. In other words, simply looking up the corresponding value in this table can obtain the optimal parameter thus reducing the complexity and intensity of the computation, which significantly speeds up the training process.


1  2  3  4  5  6  7  8  8  


0.9957  0.5860  0.3352  0.1881  0.1041  0.0569  0.0308  

5.3 DNN training integration
We integrate our VecQ quantization into the DNN training flow for both the weight data and the activation data, as shown in Fig. 7.
Weight quantization: For layer , during the forward propagation, we first quantize the weights with full precision () into the quantized values (), then use the quantized weights to compute the output (). During the backward propagation, the gradient is calculated with instead of and propagated. In the final update process, the gradient of is used to update [43].
Activation quantization: Inspired by the Batch Normalization (BN) technique, instead of using predefined distribution, we compute the distribution parameter of the activation outputs
and update it with Exponential Moving Average. During the inference, the distribution parameter is employed as a linear factor to the activation function
[19]. The is the activation output of layer , andis the nonlinear activation function following the convolution or fullyconnected layers, such as Sigmoid, Tanh, ReLU.
6 Evaluations
We choose Keras v2.2.4 as the baseline DNN training framework
[6]. The layers in Keras are rewritten to support our proposed quantization mechanism as presented in Section 5.3. Specifically, all the weight data in the DNNs are quantized to the same bitwidth in the evaluation of VecQ, including the first and last layers. Our evaluations are conducted on two classic tasks: (1) image classification and (2) salient object detection (SOD). The evaluation results for image classification are compared to stateoftheart results with the same bitwidth configuration and the SOD results are compared to the stateoftheart solutions that are conducted with the FP32 data type.6.1 Classification
Image classification is the basis of many computer vision tasks, so the classification accuracy of the quantized model is representative for the effectiveness of our proposed solution.
6.1.1 Evaluation settings
Datasets and DNN models. The MNIST, CIFAR10 [24] and ImageNet [8] datasets are selected for image classification evaluations; the IMDB movie reviews [30] and THUCNews [35] for Chinese text datasets are selected for the sentiment and text classification evaluations. The detailed information of the datasets are listed in Table II and Table III.


Datasets  MNIST  CIFAR10  ImageNet 


Image size  28281  32323  2242243 
# of Classes  10  10  1000 
# of Images  60000  50000  1281167 
# of Pixels (log10)  7.67  8.19  11.29 



Datasets  IMDB  THUCNews 


Objectives  Movie reviews  Text classification 
# of Classes  2  10 
# of samples  50000  65000 



Models  AlexNet  ResNet18  MobileNetV2 


Convs  5  21  35 
DepConvs      17 
BNs  7  19  52 
FCs  3  1  1 
Parameters (M)  50.88  11.7  3.54 

^{1}^{1}footnotemark: 1
Convs indicate the vanilla convolution layers, DepConvs are the depthwise convolution layers [32]. BNs stand for the Batch Normalization layers [19] and FCs are the fullconnection layers.


W/A^{1}^{1}footnotemark: 1  LeNet5  VGGlike [34]  Alexnet [25]  ResNet18 [17]  MobileNetV2 [32]  
Size(M)  Acc  Size(M)  Acc  Size(M)  Top1/Top5  Size(M)  Top1/Top5  Size(M)  Top1/Top5  


32/32  6.35  99.4  20.44  93.49  194.10  60.01/81.90^{2}^{2}footnotemark: 2  44.63  69.60/89.24^{3}^{3}footnotemark: 3  13.50  71.30/90.10^{4}^{4}footnotemark: 4 


1/32  0.21  99.34  0.67  90.39  6.21  55.06/77.78  1.45  65.58/86.24  0.67  53.78/77.07 
2/32  0.41  99.53  1.31  92.94  12.27  59.31/81.01  2.68  68.23/88.10  1.09  64.67/85.24 
3/32  0.60  99.48  1.94  93.02  18.33  60.36/82.40  4.24  68.79/88.45  1.50  69.13/88.35 
4/32  0.80  99.47  2.58  93.27  24.39  61.21/82.94  5.63  68.96/88.52  1.92  71.89/90.38 
5/32  1.00  99.47  3.22  93.37  30.45  61.65/83.19  7.02  69.41/88.77  2.33  71.47/90.15 
6/32  1.20  99.49  3.86  93.51  36.51  62.01/83.32  8.42  69.81/88.97  2.74  72.23/90.61 
7/32  1.40  99.48  4.49  93.52  42.57  62.09/83.44  9.81  70.17/89.09  3.16  72.33/90.62 
8/32  1.60  99.48  5.13  93.50  48.63  62.22/83.54  11.20  70.36/89.20  3.57  72.24/90.66 


2/8  0.41  99.43  1.31  92.46  12.27  58.04/80.21  2.68  67.91/88.30  1.09  63.34/84.42 
4/8  0.80  99.53  2.58  93.37  24.39  61.22/83.24  5.63  68.41/88.76  1.92  71.40/90.41 
8/8  1.60  99.44  5.13  93.55  48.63  61.60/83.66  11.20  69.86/88.90  3.57  72.11/90.68 

^{1}^{1}footnotemark: 1
W/A denotes the quantizing bits of weights and activation respectively. ^{2}^{2}footnotemark: 2Results of AlexNet with Batch Normalization layers are cited from [33]. ^{3}^{3}footnotemark: 3Results of ResNet18 are cited from [13]. ^{4}^{4}footnotemark: 4Results are cited from the document of Keras [6].
For MNIST dataset, Lenet5 with 32C5BNMP264C5BNMP2512FC10Softmax is used, where C stands for the Convolutional layer and the number in front denotes the output feature channel number and the number behind is the kernal size; BN stands for the Batch Normalization layer; FC represents the Fullyconnected layer and the output channel number is listed in front of it; MP
indicates the max pooling layer followed with the size of the pooling kernel. The minibatch size is 200 samples and the initial learning rate is 0.01 and it is divided by 10 at epoch 35 and epoch 50 for a total of 55 training epochs.
For CIFAR10 dataset, a VGGlike network [34] with the architectural configuration as 64C3BN64C3BNMP2128C3BN128C3BNMP2256C3BN256C3BNMP21024FC10Softmax
is selected. A simple data augmentation which pads 4 pixels on each side and randomly crops the 32
32 patches from the padded image or its horizontal flip is adopted during the training. Only the original 32 32 images are evaluated in the test phase. The network is trained with minibatch size 128 for a total of 300 epoch. The initial learning rate is 0.01 and decays 10 times at epoch 250 and 290.For the ImageNet dataset, we select 3 famous DNN models, which are AlexNet [25], ResNet18 [17] and MobileNetV2 [32]. ImageNet dataset contains 1000 categories and the size of the image is relatively bigger [8, 25, 34, 17]. We use the Batch Normalization (BN) layer instead of the original Local Response Normalization (LRN) layer in AlexNet for a fair comparison with [45, 26, 38]. The numbers of the different layers and the parameter sizes are listed in Table IV.
The architecture of the model for IMDB movie review [30] sentiment classification and the THUCNews [35] text classification are 128E64C5MP470LSTM1Sigmoid and 128E128LSTM128FC10Softmax, where E denotes Embedding layer and the number in front of it represents its dimension; LSTM is the LSTM layer and the number in front is the number of the hidden units. In addition, we quantize all layers including Embedding layer, Convolutional layer, Fullyconnected layer and LSTM layer for these two models. Specifically, for the LSTM layer, we quantize the input features, outputs and the weights, but left the intermediate state and activation of each timestamp untouched since the quantization of them will not help to reduce the size of the model.
Evaluation Metrics The final classification accuracy results on the corresponding datasets are taken as the evaluation metrics in image classification tasks as used in many other works [27, 26, 38, 14]. Moreover, the Top1 and Top5 classification accuracy results are presented simultaneously on all the models for ImageNet dataset for comprehensive evaluation as used in [34, 17].
6.1.2 Bitwidth flexibility


W/A  IMDB  THUCNews  
Size (M)  Acc.  Size (M)  Acc.  


32/32  10.07  84.98^{1}^{1}footnotemark: 1  3.01  94.74 


2/32  0.63  85.54  0.19  93.99 
4/32  1.26  86.24  0.38  94.47 
8/32  2.52  85.53  0.75  94.53 
2/8  0.63  85.40  0.19  94.00 
4/8  1.26  84.67  0.38  94.09 
8/8  2.52  85.72  0.75  94.43 

VecQ is designed to support a wide range of targeted bitwidths. We conduct a series of experiments to verify the impact of bitwidth on model accuracy and model size reduction. The bitwidth in the following evaluations ranges from to .
The accuracy results and model sizes for the image classification models are shown in Table V. We first discuss weight quantization only. There is a relatively higher accuracy degradation when the bitwidth is set to 1. But starting from 2 bits and up, the accuracy of the models recovers to less than drop when compared to the FP32 version except MobileNetV2. With the increase of bitwidth, the accuracy of the quantized model is improved. The highest accuracy of LeNet5 and VGGlike are 99.53% and 93.52% at 2 bits and 7 bits, respectively, which outperform the accuracy results with FP32. The highest accuracy of AlexNet, ResNet18 and MobileNetV2 are obtained at 8bit with 62.22% (Top1), 8bit with 70.36% (Top1) and 7bit with 72.33% (Top1), respectively, and all of them outperform the values obtained with FP32. Overall, the best accuracy improvement of the models when compared to the full precision versions for the five models are 0.13%, 0.03%, 2.21%, 0.76% and 1.03%, at 2, 7, 8, 8 and 7 bits, respectively when activation is maintained as 32 bits. The table also contains the accuracy of the models with 8bit activation quantization. Although the activation quantization leads to a degradation of the accuracy for most of the models (except VGGlike), the results are in general comparable with the models with FP32 data type.
The accuracy and model size of the sentiment classification and text classification models are shown in Table VI. Our solution easily outperforms the models trained with FP32. Even with the activation quantization, the accuracy results are still well maintained. The results also indicate that adopting appropriate quantization of the weight data improves the ability of generalization of the DNN models. In another word, right quantization achieves higher classification accuracy for the tests.
6.1.3 Comparison with Stateoftheart results


Methods  Weights  Activation  FConv  IFC  LFC  
Bits  SFB  SFN  Bits  SFB  SFN  


ReBNet [11]  1  32  1  3  32  3    Y  N 
BC [7]  1    0  32           
BWN [31]  1  32  1  32           
BPWN [27]  1  32  1  32        N  N 
TWN [27]  2  32  1  32        N  N 
TC [29]  2    0  32           
TNN [1]  2    0  2    0       
TTQ [45]  2  32  2  32      N  Y  N 
INQ [42]  15    0  32           
FP [14]  2,4,8    0  32        N  N 
uL2Q [12]  1,2,4,8  32  1  32      Y  Y  8 
ENN [26]  1,2,3  32  1  32           
TSQ [38]  2  32  c^{4}^{4}footnotemark: 4  2  32  2  N  Y  N 
DC [15]^{2}^{2}footnotemark: 2  2,3,4  32  4,8,16  32      8    1 
HAQ [37]^{3}^{3}footnotemark: 3  flexible  32  1  32      8    1 
QAT [20]  8  32  1  8  32  1  N    N 
TQT [21]  8    1  8    1  8    8 


VecQ  18  32  1  32,8  ,32  ,1  Y  Y  Y 

^{1}^{1}footnotemark: 1
Weights and Activation denote the quantized data of the model. Bits refer to quantized bitwidth of methods; SFB is the bitwidth for the scalingfactor; SFN is the number of the scalingfactors. FConv, IFC and LFC represent whether the First Convolution layer, the Internal FullyConnected layers and the Last FullyConnected layer are quantized. ^{2}^{2}footnotemark: 2The results of DC are from [37]. ^{3}^{3}footnotemark: 3HAQ is a mixprecision method; results here are from the experiments that only quantize the weight data. ^{4}^{4}footnotemark: 4TSQ introduces a floatingpoint scaling factor for each convolutional kernel, so the SFN equals to the number of kernels.
We collected the stateoftheart accuracy results of DNNs quantized to different bitwidth with different quantization methods, and compared them to the results from VecQ with the same bitwidth target. The detailed bitwidth support of the comparison methods are listed in Table VII. Note here, when the quantization of the activations are not enabled, the SFB and SFN are not applicable to VecQ.
The comparisons are shown in Table VIII. The final accuracy based on VecQ increased by an average of 0.38%, 3.30% for LeNet5 and VGGlike when compared with other quantization methods, respectively. There is also an average of 0.39%/0.01% (top1/top5) improvement of the accuracy of AlexNet, 2.45%/1.39% on ResNet18 and 2.57%/1.64% on MobileNetV2 when compared to the stateoftheart methods. For all the 3 datasets and 5 models, the quantized models with VecQ achieve higher accuracy than almost all stateoftheart methods with the same bitwidth. However, when bitwidth target is 1, the quantized models of AlexNet and Resnet18 based on VecQ perform worse due to the reason that we have quantized all the weights into 1 bit, including first and last layers, which are different from the counterparts that are using higher bitwidth for the first or last layers. This also leads to more accuracy degradation at low bitwidth on the lightweight network MobiNetV2. This, however, allows us to provide an even smaller model size. Besides, the solution called BWN [31] and ENN [26] for AlexNet has M parameters [31], while ours is M because eliminating the layer paddings in the intermediate layers leads to less weights for fullyconnected layers. When compared to TWNs, L2Q and TSQ, VecQ achieves significantly higher accuracy at the same targeted bitwidth, which also indicates that our vectorized quantization is superior to the L2 loss based solutions.


Bitwidth  Methods  Datasets&Models  
MNIST  Cifar10 [24]  ImageNet [8]  
LeNet5  VGGlike [34]  AlexNet [25]  ResNet18 [17]  MobileNetV2 [32]  


32  FP32  99.4  93.49  60.01/81.90  69.60/89.24  71.30/90.10 


1  ReBNet[11]    86.98       
BC[2]  98.71  88.44        
BWN[31]      56.80/79.40  60.80/83.00    
BPWN[27]  99.05      57.50/81.20    
ENN [26]      57.00/79.70  64.80/86.20    
L2Q [12]  99.06  89.02    66.24/86.00    
VecQ  99.34  90.39  55.06/77.78  65.58/86.24  53.78/77.07  
MeanImp^{3}^{3}footnotemark: 3  0.40  2.24  1.84/1.77  3.24/2.14  /  


2  FP [14]  98.90         
TWN [27]  99.35    57.50/79.80^{1}^{1}footnotemark: 1  61.80/84.20    
TC [29]    89.07        
TNN [1]    87.89        
TTQ [45]      57.50/79.70  66.60/87.20    
INQ [42]        66.02/87.12    
ENN      58.20/80.60  67.00/87.50    
TSQ [38]      58.00/80.50      
L2Q  99.12  89.5    65.60/86.12    
DC^{4}^{4}footnotemark: 4          58.07/81.24  
VecQ  99.53  92.94  59.31/81.01  68.23/88.10  64.67/85.24  
MeanImp  0.41  4.12  1.51/0.86  2.83/1.67  6.60/4.00  


3  INQ        68.08/88.36   
ENN^{2}^{2}footnotemark: 2      60.00/82.20  68.00/88.30    
DC          68.00/87.96  
VecQ  99.48  93.02  60.36/82.40  68.79/88.45  69.13/88.35  
MeanImp      0.36/0.20  0.75/0.12  1.13/0.39  


4  FP  99.1         
INQ        68.89/89.01    
L2Q  99.12  89.8    65.92/86.72    
DC          71.24/89.93  
VecQ  99.47  93.27  61.21/82.94  68.96/88.52  71.89/90.38  
MeanImp  0.36  3.47  /  1.55/0.65  0.65/0.45  


5  INQ        68.98/89.10   
VecQ  99.47  93.37  61.65/83.19  69.15/88.57  71.47/90.15  
MeanImp      /  0.17/0.53  /  


6  HAQ          66.75/87.32 
VecQ  99.49  93.51  62.01/83.32  69.81/88.87  72.23/90.61  
MeanImp      /  /  5.48/3.29  


7  VecQ  99.48  93.52  62.09/83.44  70.17/89.09  72.33/90.62 


8  FP  99.1         
L2Q  99.16  89.7    65.52/86.36    
QATc^{5}^{5}footnotemark: 5          71.10/  
TQTwtth          71.80/90.60  
VecQ  99.48  93.5  62.22/83.54  70.36/89.20  72.24/90.66  
MeanImp  0.35  3.80  /  4.84/2.84  0.79/0.06  

^{1}^{1}footnotemark: 1
The results of TWN on AlexNet are from [26]. ^{2}^{2}footnotemark: 2ENN adopts 2 bits shifting results noted as [26]. ^{3}^{3}footnotemark: 3MeanImp denotes Mean of accuracy Improvement results compared to the stateoftheart methods. ^{4}^{4}footnotemark: 4The results are from HAQ [37]. ^{5}^{5}footnotemark: 5The results are from [23].
6.1.4 Analysis of values


2  3  4  5  6  7  8  


AlexNet  0.9700  0.5700  0.3300  0.1900  0.1000  0.0600  0.0300 
ResNet18  1.0000  0.6100  0.3600  0.2100  0.1200  0.0600  0.0300 
MobileNetV2  1.0000  0.5900  0.3400  0.1900  0.1100  0.0600  0.0300 


MeanError  0.0171  0019  0.0024  0.0256  0.0178  0.0094  0.0023 

In order to evaluate the accuracy of our theoretical in Table I, we choose the last convolutional layers from different models to calculate the actual values. Since is the quantization interval, the range of (0,3] of it covers more than 99.74% of the layer data, so the actual is obtained by exhaustively searching the values in the range of (0,3] with the precision at 0.001. The comparison is shown in Table IX. As we can learn from Table IX, there are differences between the theoretical s and the actual values. However, the final results in terms of accuracy in the previous subsection is maintained and not effected by the small bias of , which proves the effectiveness of our solution.
6.2 Salient object detection
Salient object detection aims at standing out the region of the salient object in an image. It is an important evaluation that provides good visible results. Previous experiments show that 2 bits can achieve a good tradeoff between accuracy and bitwidth reduction. In this section, only 2 bit quantization for the weights with VecQ is used as the quantization method for the DNN models.
6.2.1 Evaluation settings
Datasets All models are trained with the training data in the MSRA10K dataset (80% of the entire dataset) [39]. After training, the evaluation is conducted on multiple datasets, including the MSRA10K (20% of the entire dataset), ECSSD [39], HKUIS [39], DUTs [39], DUTOMRON [39] and the images containing target objects and existing ground truth maps in the THUR15K [5]. The details of the selected datasets are shown in Table X. All images are resized to for the training and test.


Datasets  Images  Contrast 


ECSSD  1000  High 
HKUIS  4000  Low 
DUTs  15572  Low 
DUTOMRON  5168  Low 
MSRA10K  10000 (80/20%)  High 
THUR15K  6233  Low 

Models The famous endtoend semantic segmentation models i.e., UNet [39], FPN [28], LinkNet [2] and UNet++ [44] are selected for the comprehensive comparison. Their detailed information is shown in Table XI. The models are based on the same ResNet50 backbone as encoder and initialized with the weights trained on the ImageNet dataset.


Models  UNet  FPN  LinkNet  UNet++ 


Backbone  ResNet50  ResNet50  ResNet50  ResNet50 
Convs  64  67  69  76 
Parameters (M)  36.54  28.67  28.78  37.7 
Model size (M)  139.37  109.38  109.8  143.81 
Q. size (M)^{1}^{1}footnotemark: 1  9.05  7.20  7.24  9.35 



Datasets  MSRA10Ktest  ECSSD  HKUIS  DUTs  DUTOMRON  THUR15K  


Model  Size (M)  MAE  MaxF  S  E  MAE  MaxF  S  E  MAE  MaxF  S  E  MAE  MaxF  S  E  MAE  MaxF  S  E  MAE  MaxF  S  E 


Unet  139.37  0.030  0.945  0.931  0.962  0.057  0.909  0.886  0.914  0.045  0.907  0.884  0.930  0.060  0.896  0.865  0.874  0.070  0.804  0.803  0.829  0.077  0.769  0.807  0.816 
Unet*  9.05  0.032  0.940  0.926  0.959  0.064  0.896  0.871  0.906  0.050  0.893  0.870  0.923  0.065  0.885  0.852  0.871  0.071  0.793  0.795  0.835  0.081  0.749  0.797  0.815 


Bias  93.51%  0.002  0.004  0.005  0.003  0.007  0.013  0.015  0.008  0.005  0.014  0.015  0.007  0.005  0.011  0.012  0.003  0.001  0.011  0.008  0.006  0.005  0.020  0.010  0.001 


FPN  109.38  0.043  0.920  0.899  0.949  0.070  0.882  0.854  0.901  0.059  0.875  0.848  0.911  0.072  0.875  0.835  0.861  0.081  0.777  0.772  0.812  0.087  0.750  0.778  0.804 
FPN*  7.2  0.038  0.935  0.920  0.955  0.070  0.889  0.866  0.897  0.059  0.879  0.859  0.908  0.070  0.878  0.850  0.859  0.080  0.772  0.786  0.809  0.089  0.739  0.789  0.796 


Bias  93.42%  0.005  0.015  0.022  0.006  0.000  0.007  0.012  0.004  0.001  0.005  0.011  0.004  0.002  0.003  0.015  0.003  0.001  0.004  0.014  0.003  0.002  0.011  0.011  0.007 


Linknet  109.8  0.032  0.942  0.928  0.959  0.060  0.905  0.882  0.911  0.048  0.900  0.878  0.927  0.062  0.892  0.861  0.871  0.071  0.801  0.799  0.825  0.079  0.760  0.803  0.814 
Linknet*  7.24  0.034  0.939  0.923  0.959  0.068  0.891  0.865  0.902  0.054  0.887  0.860  0.920  0.068  0.883  0.847  0.870  0.072  0.788  0.787  0.833  0.082  0.746  0.794  0.818 


Bias  93.40%  0.002  0.003  0.004  0.001  0.008  0.014  0.017  0.008  0.005  0.013  0.018  0.006  0.005  0.010  0.014  0.001  0.001  0.014  0.012  0.008  0.002  0.014  0.009  0.004 


UNet++  143.81  0.029  0.948  0.933  0.964  0.056  0.910  0.888  0.915  0.044  0.909  0.887  0.930  0.059  0.897  0.867  0.876  0.070  0.805  0.805  0.829  0.076  0.769  0.810  0.818 
UNet++*  9.35  0.033  0.939  0.926  0.958  0.065  0.895  0.872  0.905  0.052  0.890  0.868  0.919  0.066  0.884  0.854  0.867  0.075  0.785  0.792  0.822  0.082  0.750  0.797  0.811 


Bias  93.50%  0.004  0.008  0.007  0.007  0.009  0.015  0.016  0.010  0.008  0.019  0.018  0.011  0.007  0.013  0.013  0.009  0.005  0.020  0.013  0.007  0.006  0.019  0.013  0.007 

Evaluation Metrics We choose widely used metrics for a comprehensive evaluation: (1) Mean Absolute Error (MAE) [39], (2) Maximal Fmeasure (MaxF) [39], (3) Structural measure (Smeasure) [39], and (4) Enhancedalignment measure (Emeasure) [10].
The MAE measures the average pixelwise absolute error between the output map and the groundtruth mask .
(33) 
Here is the number of the samples, and are the height and width of .
Fmeasure comprehensively evaluates the and with a weight parameter .
(34) 
is empirically set to [39]. The MaxF is the maximal F value.
Smeasure considers the objectaware and regionaware structure similarities.
Emeasure is designed for binary map evaluations. It combines the local pixelwise difference and the global mean value of map for comprehensive evaluation. In our evaluation, the output map is first binarized to [0,1] by comparing with the threshold of twice of its mean value
[39], then the binary map is evaluated with Emeasure.We also involve direct visual comparison of the full precision and quantized weights of the selected model, to provide a more visible comparison.
6.2.2 Results and analysis
The sampled detection results are shown in Fig. 8 and quantitative comparison results are in Table XII. The indicates the quantized model based on VecQ besides the full precision model. Note here, the output sizes of the last layer of FPN and FPN* are and then resized to . Overall, the performance degradation of the quantized models based on VecQ is less than 0.01 in most metric for all models, but the size of the quantized model is reduced by more than 93% when compared to the models using FP32.
As shown in Table XII, all the quantized models have a less than 0.01 degradation on MSRA10K with all evaluation metrics. This is because all the models are trained on the training subset of MSRA10K, so the features of the images in it are well extracted. The other 5 data sets are only used as testing datasets and there are more degradation with the evaluation metrics, especially in MaxF and Smeasure, but the degradation is maintained within 0.02.
Comparing the quantized models with their corresponding fullprecision models, FPN* performs well on almost all the test sets with all the evaluation metrics (shown as bold italics numbers in Table XII), showing a better feature extraction capacity than the full precision version (FPN). Compared to other models, FPN outputs a relatively smaller prediction map (). Although the backbone models are similar, the feature classification tasks in the FPN work on a smaller feature map with a similar size of coefficients. This provides a good chance for data quantization without effecting the detection ability because of the potential redundancy of the coefficients.
In order to further present the effectiveness of our proposed quantization solution, we print the weights of the convolution layers of the backbone model in Unet++ and Unet++*, as shown in Fig. 9. There are three sizes of kernels involved in this model, which are , and . In addition, we also compare the full precision and quantized weights of the last convolution layer for the selected model, which directly outputs the results for the detection.
In the first kernels, we notice a significant structural similarity between the full precision weights and the quantized weights. Since the very first layer of the backbone model extracts the overall features, the quantized weights provide a nice feature extraction ability. When the size of the kernels become smaller, we could still notice a good similarity between the full precision ones and quantized ones, Although the similarities are not significant in the Conv 3x3 sets, they become obvious in the following Conv 1x1 sets. The Last Conv group directly explains the comparable output results with the visible emphasized kernels and locations of the weights.
Overall, the quantized weights show a good similarity to the full precision ones in terms of the value and the location which ensures the high accuracy output when compared to the original full precision model.
7 Conclusion
In this paper, we propose a new quantization solution called VecQ. Different from the existing works, it utilizes the vector loss instead of adopting L2 loss to measure the loss of quantization. VecQ quantizes the fullprecision weight vector into a specific bitwidth with the least DQL and, hence, provides a better final model accuracy. We further introduce the fast quantization algorithm based on a reasonable prior knowledge of normally distributed weights and reduces the complexity of the quantization process in model training. The integration of VecQ into Keras [6] is also presented and used for our evaluations. The comprehensive evaluations have shown the effectiveness of VecQ on multiple datasets, models and tasks. The quantized lowbit models based on VecQ show comparable classification accuracy to models with FP32 datatype and outperform all the stateoftheart quantization methods when the targeted bitwidth of the weights is higher than 2. Moreover, the experiments on salient object detection also show that VecQ can greatly reduce the size of the models while maintaining the performance of feature extraction tasks.
For future work, we will focus on the combination of nonlinear quantization and illustrate an automated mixedprecision quantization with VecQ to achieve better performance improvement. The source code of the Keras built with VecQ could be found at https://github.com/GongCheng1919/VecQ.
8 Acknowledgment
This work is partially supported by the National Natural Science Foundation (61872200), the National Key Research and Development Program of China (2018YFB2100304, 2018YFB1003405), the Natural Science Foundation of Tianjin (19JCZDJC31600, 19JCQNJC00600), the Open Project Fund of State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences (CARCH201905). It is also partially supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme, and the IBMIllinois Center for Cognitive Computing System Research (C3SR)  a research collaboration as part of IBM AI Horizons Network.
References
 [1] (2017) Ternary neural networks for resourceefficient AI applications. In IJCNN, pp. 2547–2554. Cited by: §2.3, §2, TABLE VII, TABLE VIII.
 [2] (2017) Linknet: exploiting encoder representations for efficient semantic segmentation. In VCIP, pp. 1–4. Cited by: §6.2.1.
 [3] (2019) CloudDNN: an open framework for mapping dnn models to cloud FPGAs. In FPGA, pp. 73–82. Cited by: §1.

[4]
(2019)
TDLA: an opensource deep learning acceleratorfor ternarized DNN models on embedded FPGA
. ISVLSI. Cited by: §1, §2.3, §2.4, §2.  [5] (2014) SalientShape: group saliency in image collections. The Visual Computer 30 (4), pp. 443–453. External Links: ISSN 01782789, Document Cited by: §6.2.1.
 [6] (2015) Keras. GitHub. Note: https://github.com/fchollet/keras Cited by: TABLE V, §6, §7.
 [7] (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In NIPS, pp. 3123–3131. Cited by: §1, §2.1, §2, TABLE VII.
 [8] (2009) Imagenet: A largescale hierarchical image database. In IEEE CVPR, pp. 248–255. Cited by: §6.1.1, §6.1.1, TABLE VIII.
 [9] (2018) GXNORnet: training deep neural networks with ternary weights and activations without fullprecision memory under a unified discretization framework. Neural Networks 100, pp. 49–58. Cited by: §2.3, §2.
 [10] (2018) Enhancedalignment measure for binary foreground map evaluation. In IJCAI, pp. 698–704. Cited by: §6.2.1.
 [11] (2018) ReBNet: residual binarized neural network. In IEEE FCCM, pp. 57–64. Cited by: §2.2, §2, TABLE VII, TABLE VIII.
 [12] (2019) l2q: An ultralow loss quantization method for DNN. IJCNN. Cited by: §1, §1, §2.2, §2.4, §2.4, §2, §5.2.1, TABLE VII, TABLE VIII.
 [13] (2016) Training and investigating residual nets. Facebook AI Research. Cited by: TABLE V.

[14]
(2016)
Hardwareoriented approximation of convolutional neural networks
. arXiv. Cited by: §1, §2.1, §2.3, §2, §6.1.1, TABLE VII, TABLE VIII.  [15] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv. Cited by: §1, §2.2, §2.4, §2, TABLE VII.
 [16] (2019) FPGA/DNN codesign: an efficient design methodology for iot intelligence on the edge. In DAC, Cited by: §1, §1.
 [17] (2016) Deep residual learning for image recognition. In IEEE CVPR, Cited by: §1, §6.1.1, §6.1.1, TABLE V, TABLE VIII.
 [18] (2016) Binarized neural networks. In NIPS, pp. 4107–4115. Cited by: §1, §1, §2.1, §2.2, §2.
 [19] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §2.1, §5.3, TABLE IV.
 [20] (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In IEEE CVPR, pp. 2704–2713. Cited by: §2.1, §2, TABLE VII.
 [21] (2019) Trained uniform quantization for accurate and efficient neural network inference on fixedpoint hardware. arXiv. Cited by: §2.1, §2, TABLE VII.
 [22] (2018) Sparse ternary connect: convolutional neural networks using ternarized weights with enhanced sparsity. In ASPDAC, pp. 190–195. Cited by: §1, §2.1, §2.
 [23] (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv. Cited by: TABLE VIII.
 [24] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §6.1.1, TABLE VIII.
 [25] (2012) Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §6.1.1, TABLE V, TABLE VIII.
 [26] (2018) Extremely low bit neural network: squeeze the last bit out with admm. In AAAI, Cited by: §1, §1, §2.2, §2, §3, §3, §6.1.1, §6.1.1, §6.1.3, TABLE VII, TABLE VIII.
 [27] (2016) Ternary weight networks. arXiv. Cited by: §1, §1, §2.2, §2.4, §2, §6.1.1, TABLE VII, TABLE VIII.
 [28] (2017) Feature pyramid networks for object detection. In IEEE CVPR, pp. 2117–2125. Cited by: §6.2.1.
 [29] (2015) Neural networks with few multiplications. arXiv. Cited by: §1, §2.1, §2, TABLE VII, TABLE VIII.

[30]
(201106)
Learning word vectors for sentiment analysis
. In ACL, pp. 142–150. Cited by: §6.1.1, §6.1.1.  [31] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In ECCV, pp. 525–542. Cited by: §2.2, §2, §6.1.3, TABLE VII, TABLE VIII.
 [32] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In IEEE CVPR, pp. 4510–4520. Cited by: §6.1.1, TABLE IV, TABLE V, TABLE VIII.
 [33] (2016) ImageNet pretrained models with batch normalization. arXiv. Cited by: TABLE V.
 [34] (2014) Very deep convolutional networks for largescale image recognition. arXiv. Cited by: §1, §6.1.1, §6.1.1, §6.1.1, TABLE V, TABLE VIII.

[35]
(2016)
THUCTC: an efficient chinese text classifier
. GitHub Repository. Cited by: §6.1.1, §6.1.1.  [36] (2018) Design flow of accelerating hybrid extremely low bitwidth neural network in embedded FPGA. In FPL, pp. 163–1636. Cited by: §1.
 [37] (2019) HAQ: hardwareaware automated quantization with mixed precision. In IEEE CVPR, pp. 8612–8620. Cited by: §2.3, §2, TABLE VII, TABLE VIII.
 [38] (2018) Twostep quantization for lowbit neural networks. In IEEE CVPR, pp. 4376–4384. Cited by: §1, §1, §2.2, §2.4, §2, §5.2.1, §6.1.1, §6.1.1, TABLE VII, TABLE VIII.
 [39] (2019) Salient object detection in the deep learning era: an indepth survey. arXiv. Cited by: §6.2.1, §6.2.1, §6.2.1, §6.2.1, §6.2.1.
 [40] (2018) Gradiveq: vector quantization for bandwidthefficient gradient aggregation in distributed CNN training. In NIPS, Cited by: §2.2, §2.4, §2.
 [41] (2018) DNNbuilder: an automated tool for building highperformance dnn hardware accelerators for FPGAs. In ICCAD, Cited by: §1.
 [42] (2017) Incremental network quantization: towards lossless CNNs with lowprecision weights. arXiv. Cited by: §2.1, §2, TABLE VII, TABLE VIII.
 [43] (2016) DoReFaNet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv. Cited by: §1, §2.1, §2.3, §2, §5.3.
 [44] (2018) Unet++: a nested unet architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Cited by: §6.2.1.
 [45] (2016) Trained ternary quantization. arXiv. Cited by: §1, §2.1, §2, §6.1.1, TABLE VII, TABLE VIII.
Comments
There are no comments yet.