Quantized Neural Networks via -1, +1 Encoding Decomposition and Acceleration

06/18/2021 ∙ by Qigong Sun, et al. ∙ Peking University Xidian University NetEase, Inc 0

The training of deep neural networks (DNNs) always requires intensive resources for both computation and data storage. Thus, DNNs cannot be efficiently applied to mobile phones and embedded devices, which severely limits their applicability in industrial applications. To address this issue, we propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks, which can be efficiently implemented by bitwise operations (i.e., xnor and bitcount) to achieve model compression, computational acceleration, and resource saving. By using our method, users can achieve different encoding precisions arbitrarily according to their requirements and hardware resources. The proposed mechanism is highly suitable for the use of FPGA and ASIC in terms of data storage and computation, which provides a feasible idea for smart chips. We validate the effectiveness of our method on large-scale image classification (e.g., ImageNet), object detection, and semantic segmentation tasks. In particular, our method with low-bit encoding can still achieve almost the same performance as its high-bit counterparts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Neural Networks (DNNs) have been successfully applied in many fields, such as image classification, object detection and natural language processing. Because of a great number of parameters in multilayer networks and complex model architectures, huge storage space and considerable power consumption are needed. For instance, the VGG-16 model has a total of

million parameters, and requires more than MB storage space and

billion float point operations in the inference process. With the rapid development of chip (e.g., GPU and TPU) technologies, computing power has been dramatically improved. In the rapid evolution of deep learning, most researchers use multiple GPUs or computer clusters to train deeper and even more complex neural networks. Nevertheless, the energy consumption and limitation of computing resources are still significant factors in industrial applications, which are generally ignored in scientific research. In other words, breathtaking results of many DNN algorithms under the condition of applying GPUs are lagging behind the demand of the industry. DNNs can hardly be applied in mobile phones and embedded devices directly due to their limited memory and calculation resources. Therefore, the model compression and acceleration in DNNs are especially important in the future industrial and commercial applications.

In recent years, many solutions have been proposed to improve the energy efficiency of hardware and achieve model compression or computational acceleration, such as network sparsification and pruning [15, 70, 14], low-rank approximation [5, 26, 61], designing architectures [18, 52] and model quantization [22, 48, 35]. Network sparsification and pruning can dramatically decrease the number of connections in the network, and thus considerably reduce the computational load in the inference process without a pretty high loss in accuracy. Luo et al. [39] presented a unified framework, and exploited a filter level pruning method to discard some less important filters. [61]

used low-rank tensor decomposition to remove the redundancy in convolution kernels, which can serve as a generic tool for speed-up. Since there is some redundant information in the networks, most straightforward approaches of cutting down those information are to optimize the structure and yield smaller networks

[24, 23]. For example, [18] proposed to use bitwise separable convolutions to build light networks for mobile applications. Most of those networks still utilize floating-point representations (i.e., full-precision values). However, [13] discussed that the full-precision representations of the weights and activations in DNNs are not necessary during the training, and a nearly identical or slightly better accuracy may be obtained under lower-precision representation and calculation. Thus, many scholars are working on the topic of model quantization to achieve model compression and resource saving. Model quantization can be applied to many network architectures and applications. In particular, it can also combine with other strategies (e.g., pruning) to achieve more efficient performance.

Model quantization methods mainly include low-bit quantization approaches (e.g., BNN [21], XNOR-Net [48], TWN [28], and TBN [66]) and -bit (e.g., or ) quantization approaches (e.g., DoReFa-Net [76], INQ [74], ABC-Net [35], and LQ-Nets [72]). Those low-bit quantization methods can achieve extreme model compression, computational acceleration and resource saving. For example, XNOR-Net [21] uses bitwise operations (i.e., xnor and bitcount) to replace full-precision matrix multiplication, achieving 58 speedups and 32 memory saving in CPU [48]. As discussed in [32], FP-BNN attains a higher acceleration ratio on FPGA, which can speed up to about 705 in the peak condition compared with CPU, and is 70 faster than GPU. However, most models were proposed for a fixed precision, and cannot be extended to other precision models. The representation capability of low-bit parameters is insufficient for many practical applications, especially for large-scale image classification (e.g., ImageNet) and regression tasks. Therefore, the low-bit quantization methods suffer from significant performance degradation. In contrast, -bit quantization networks have better representation capability than low-bit quantization ones, and thus can be applied to more complex real-world applications. Some scholars use -bit quantization to improve both accuracy and compression ratio of their networks, but seldom consider their computational acceleration [13, 65, 74]. Both ABC-Net [35] and LQ-Nets [72] used the linear combination of multiple binary parameters constrained to to approximate full-precision weights and activations. Therefore, the complex full-precision matrix multiplication can be decomposed into some simpler operations. [12] and [71]

used the same technique to accelerate the training of convolutional neural networks (CNNs) and recurrent neural networks (RNNs). However, those methods not only increase the number of parameters many times, but also introduce a scale factor to transform the original problem into an NP-hard problem, which makes the solution difficult and highly complex. The slow convergence rate in training process and the lack of efficient acceleration strategies in inference process are two typical issues in

-bit quantization networks.

In order to extend the efficient bitwise operations (i.e., xnor and bitcount) of binary quantization to -bit quantization without excessively increasing time complexity and space complexity, we propose a novel encoding scheme using to decompose trained quantized neural networks (QNNs) into multi-branch binary networks. Our method bridges the gap between binary and full-precision quantizations, and can be applied to many cases (from -bit to

-bit). Thus, our encoding mechanism can improve the utilization of hardware resources, and achieve parameter compression and computation acceleration. In our experiments, we not only validate the performance of our multi-precision quantized networks for image classification on CIFAR-10 and large-scale datasets (e.g., ImageNet), but also implement object detection and semantic segmentation tasks. The main advantages of our method are summarized as follows:

  • Training easier. For efficiently training our networks, we use the parameters of high-bit trained models sequentially to initialize a low-bit model, which converges faster than those straightforward approaches (e.g., DoReFa-Net, ABC-Net, and INQ). Hence, our networks can be trained within a short time, and only dozens of times fine-tuning is needed to achieve the accuracies in our experiments. Thus, our multi-branch binary networks can be easily applied to various practice engineering applications.

  • More options. Different from BWN, BNN, XNOR-Net and TWN, our method can provide available encoding options for different encoding precision networks. Naturally, each option with a special encoding precision has a different calculation speed, resource requirement and experimental precision. Therefore, users can choose the appropriate option for their requirements.

  • Suitable for hardware implementations

    . After the process of decomposition, instead of storing all encoding bits in data types, e.g., char, int, float or double, the parameters can be individually stored by bit vectors. Thus, the smallest unit of data in electronic equipments can be reduced to

    -bit from -bit, -bit, -bit or -bit, which raises the utilization rate of resources and compression ratios of the model. Then the data can be encoded, calculated and stored in various encoding precisions.

  • Suitable for diverse networks and applications. In our experiments, we evaluate our method on many widely used deep networks (e.g., AlexNet, ResNet-18, ResNet-50, InceptionV3, MobileNetV2 and DeepLabV3) and real-world applications (e.g., image classification, object detection and semantic segmentation). Our encoding scheme could also be applied to many other networks and applications.

Ii Related Work

DNN quantization has attracted a lot of attention to promote the industrial applications of deep learning, especially for mobile phones, video surveillance, unmanned aerial vehicle and unmanned driving. Many researchers use quantitative methods to get lighter and more efficient networks. All the research mainly focuses on the following three issues.

How to quantize the values of weights or activation in DNNs to achieve model compression? Quantization methods play a significant role in QNNs, and can be used to convert their floating-point weights and activations into their fixed-point (e.g., from -bit to -bit) counterparts for achieving model compression. [13] used the notation of integer and fractional bits to denote a -bit fixed-point representation, and proposed a stochastic rounding method to quantify weights and activations. [65] used -bit quantization to convert weights into signed char and activation values into unsigned char, and all the values are integer. For multi-state quantification (from -bit to -bit), linear quantization is used in [22, 68, 78]. Besides, [43] proposed logarithmic quantization to represent data and used bitshift operations in log-domain to compute dot products. For ternary weight networks [28], the weights are quantized to , where , is the model weights and is its number. In [77], the positive and negative states are trained together with other parameters. When the states are constrained to -bit, [21] applied the sign function to quantize both weights and activations to and . [48] also used to represent the binary states, where .

How to achieve computational acceleration in the inference process? After quantized training, weights and/or activation values are represented in a low-bit form (e.g., -bit and -bit), which has the potential of acceleration computation and memory saving. The most direct quantization is to convert floating-point parameters into corresponding fixed-point (e.g., -bit) ones, and we can achieve hardware acceleration for fixed-point based computation [13, 65, 25, 9]. When the weight is extremely quantized to 1-bit (i.e., binary or ) as in [4] or 2-bit (i.e., ternary or ) as in [28, 67]

, the matrix multiplication can be transformed into full-precision matrix addition and subtraction to accelerate computation. Especially when the weight and activation values are binarized, matrix multiplication operations can be transformed into highly efficient

xnor and bitcount operations [21, 48]. Bitwise operations are based on -bit units, and thus they cannot be directly applied to multi-bit computation. [12, 35] used a series of linear combinations of to approximate the parameters of full-precision convolution models, and then converted floating point operations into multiple binary weight operations to achieve model compression and computation acceleration. Miyashita et al. [43] applied logarithmic data representation on activations and weights that transform the dot product from linear domain into log-domain. Therefore, the dot product can be computed by the bitshift operation to achieve computational acceleration.

How to train QNNs in the quantized domain? It is clear that some discrete functions, which may be non-differentiable or have zero derivatives everywhere, are required to quantize weights or activation values in QNNs. The traditional stochastic gradient descent (SGD) methods are unsuitable to train the deep networks. Therefore, many researchers are devoting themselves to addressing this issue. All the optimization methods can be divided into two categories: quantizing pre-trained models with or without retraining such as

[33, 75, 28, 35] and directly training quantized networks such as [4, 21, 48, 68]. [22], [21, 8], and [71]

use the strategy of straight-through estimator (STE)

[1] in training process to optimize deep networks. STE uses nonzero gradient to approximate the function gradient, which is not-differentiable or whose derivative is zero, and then applies SGD to update the parameters. Zhuang et al. [78] proposed a two-stage optimization procedure, which quantizes the weights and activations in a two-step manner. [42] and [47] use high-precision teacher networks to guide low-precision student networks to improve performance, which is called knowledge distillation. [35, 72, 20], and [19] apply alternating optimization methods to train network weights and their corresponding scaling factors.

Recently, the research of model quantization has appeared mixed-precision quantization [2, 45, 64, 56], robustness quantization [34, 53, 57], multiscale quantization [58] and post quantization training [44, 31].

Iii Multi-Precision Quantized Neural Networks

In this section, we propose a novel encoding scheme using to decompose QNNs into multi-branch binary networks. In each branch binary network, we use and as the basic elements to efficiently achieve model compression and forward inference acceleration for QNNs. Different from fixed-precision neural networks (e.g., binary and ternary), our method can yield multi-precision networks and make full use of the advantage of bitwise operations to accelerate training of our network.

Iii-a Computational Decomposition

As the basic computation in most layers of neural networks, matrix multiplication costs lots of resources and is also the most time-consuming operation. Modern computers store and process data in binary format, thus non-negative integers can be directly encoded by . Therefore, we propose a novel decomposition scheme to accelerate matrix multiplication as follows: Let and be two vectors of non-negative integers, where for . The dot product of those two vectors is written as follows:

(1)

All of the above operations consist of multiplications and additions. Based on the above encoding scheme, the vector can be encoded to a binary form using bits, i.e.,

(2)

Then the right-hand side of the formulation (2) can be converted into the following form:

(3)

where

(4)
(5)

In such an encoding scheme, the number of represented states is not greater than . We also encode another vector with -bit representation in the same way. Therefore, the dot product of the two vectors can be computed as follows:

(6)

where is the encoding element of . From the above formula, the dot product is decomposed into sub-operations, in which every element is or

. Because of the restriction of encoding and without using the sign bit, the above representation can only be used to encode non-negative integers. However, it is impossible to limit the weights and the values of the activation functions to non-negative integers.

Fig. 1: Our architecture of the 2-bit encoding in a fully connected layer. Here, the 2BitEncoder (i.e., and ) is used to encode input data and weights in our Encoder stage, and the weighted sum of those four results by fixing scale factors is the final output in our Decoder stage.

In order to extend encoding space to negative integer and reduce the computational complexity, we propose a new encoding scheme, which uses as the basic elements of our encoder rather than . Except for the difference of basic elements, the encoding scheme is similar to the rules in Eq. (4), and is formulated as follows:

(7)

where denotes the number of encode bits, and can represent states. And the encoding space is extended to . We can use multiple bitwise operations (i.e., xnor and bitcount) to effectively achieve the above vector multiplications. Therefore, the dot product of those two vectors can be computed as follows:

(8)

where denotes the bitwise operation, and its details will be defined below. Therefore, this operation mechanism is suitable for all vector/matrix multiplications.

Iii-B Model Decomposition

In various neural networks, matrix multiplication is one basic operation in some commonly used layers (e.g., fully connected layers and convolution layers). Based on the above computational decomposition mechanism of matrix multiplication, we can decompose the quantized networks into a more efficient architecture.

Let be the number of encoding bits, and thus the encode space can be defined as: , in which each state is an integer value. The boundaries of the values are determined by the number of encoding bits, . However, the parameters in DNNs are usually distributed within the interval . In order to apply our decomposition mechanism to network model decomposition, we use the following formulation for the computation of common layers in neural networks.

(9)

where is the basic operations for multiple layers (e.g., fully connected operations and convolution operations), is the number of bits to encode input , is the number of bits to encode weights , and is the encoding process. Here, can be viewed as a scaling factor, and is usually set to

. If there is a batch normalization layer after the above layer, the scaling factor is not used in the computation of our decomposition scheme.

For instance, we use a -bit encoding scheme (i.e., ) for a fully connected layer to introduce the mechanism of our decomposition model, as shown in Fig. 1. We define an “Encoder” that can be used in the 2BitEncoder function (i.e., and defined below) for encoding input data. Here, is encoded by and , where is high bit data and is low bit data. Therefore, we have the following formula: . In the same way, the weight can be converted into and . After cross multiplications, we get the four intermediate variables {}. In other words, each multiplication can be considered as a binarized fully connected layer, whose elements are or . This decomposition can produce multi-branch layers, and thus we call our networks as multi-branch binary networks (MBBNs). For instance, we decompose the -bit fully connection operation into four branches binary operations, which can be accelerated by bitwise operations, and then the weighted sum of those four results by the fixed scaling factor (e.g., in Fig. 1) is the final output. Besides fully connected layers, our decomposition scheme is suitable for the convolution and deconvolution layers in various deep neural networks.

Iii-C M-bit Encoding Functions

As an important part of various DNNs, activation functions can enhance the nonlinear characterization of networks. In our model decomposition method, the encoding function also plays a critical role and can encode input data to multi-bits ( or ). Those numbers represent the encoding of input data. For some other QNNs, several quantization functions have been proposed. However, it is not clear what the nonlinear mapping is between quantized numbers and encode bits. In this subsection, a list of -bit encoding functions are proposed to produce the element of each bit that follows the rules for encoding data.

TABLE I: Activation functions that are used to limit input data to a fixed numerical range.

Before encoding, the data should be limited to a fixed numerical range. Table 1 lists the four activation functions. is used to project input data to , and it consists of the sign function to achieve binary encoding of weights and activations [21, 32]. Since the convergence of SGD for training the networks with is faster than other activation functions, we propose a new activation function , which retains the linear characteristics in the specific range and limits the range of input data to . Different from general activation functions mentioned above, the output of our -bit encoding function defined below should be numbers, which are or . Those numbers represent the encoding of input data. Therefore, the dot product can be computed by Eq. (9). In addition, in the above described experimental condition, when we use -bit to encode the data and constrain to , there are 4 encoded states, as shown in Table 2. The nonlinear mapping between quantized real numbers and their corresponding encoded states is given in Table 2. From the table, we can see that there is a linear factor between quantized real numbers and encoded states (i.e., for Table 2). When we use Eq. (9) to compute the multiplication of two encoded vectors, the value will be expanded times. Therefore, the result can multiply its scaling factor to get the final result, e.g., as in Fig. 1.

Number Space
Quantized Number
Encoded States
TABLE II: Quantized real numbers and their encoded states.
Fig. 2: Encoding functions. (a) and (b) denote the encoding functions of the second bit and the first bit of 2BitEncoder. (c), (d) and (e) denote the encoding functions of the third bit, the second bit and the first bit of 3BitEncoder, respectively.

From the above results, we obviously need an activation function to achieve the transformation from the quantized number to its corresponding encoded state. The most straightforward transformation method is to determine the encoded states according to the number value through the look-up table, which is suitable for many hardware platforms (e.g., FPGA and ASIC). However, its derivative is hardly be defined, and thus this method can not be applied in training process. Therefore, we need a kind of derivable encoding function as activation functions to train the network.

This is a typical single-input multiple-output problem. The Fourier neural network (FNN) [54, 40] can be used to transform a nonlinear optimization problem into a combination of multiple linear ones [41]. It has been successfully used for stock prediction [40], aircraft engine fault diagnostics [63], harmonic analysis [10], lane departure prediction [62], and so on. In[46], it has verified that the trigonometric activation function can achieve faster and better results than those using established monotonic functions for some certain tasks. Fig. 2 shows the illustration of the -bit and -bit encoding functions. We can see that those encoding functions are required to be periodic, and they should have different periods. Inspired by FNN, we apply trigonometric functions as the basic encoding functions, whose lines are highlighted in red. Finally, we use the sign function to determine whether the value is or . Hence, we use several trigonometric functions to approximate the nonlinear mapping between quantized numbers and their encoded states, which is called as the function, and is formulated as follows:

(10)

when (i.e., ) is the encoding function of the highest bit of the function. For example, for the 2-bit case, the function is formulated as follows:

(11)

where is the encoding function of the first bit () of the function, and denotes the encoder function of the second bit () of the function. The periodicity is clearly different from the others because it needs to denote more states.

Iv Training Networks

To train QNNs, we usually face the problem that the derivative is not defined, and thus traditional gradient optimization methods are not applicable. [21] presented the HTanh function to binarily quantize both weights and activations, and also defined the derivative to support the back-propagation (BP) training process. They used the loss computed by binarized parameters to update full precision parameters. Similarly, [48] also proposed to update the weights with the help of the second parameters. [1] discussed that using STE to train network models containing discrete variables can obtain faster training speed and better performance. By utilizing equivalent transformation of our method, QNNs can be decomposed to MBBNs. The goal of training networks is to get the optimum parameters, which are quantized to or . Obviously, we have two ways to train our model: The first way is to train the decomposed MBBNs directly, and the second one is that we first to train the quantized networks to get the quantized weights, and then we use our equivalent transformation method to transform the quantized weights to the bit encoded binary weights. Next, we will describe the two ways in detail.

Iv-a Training Multi-Branch Binary Networks

Generated by the decomposition of QNNs, MBBNs need to use -bit encoding functions as activation functions to get the elements of each bit, which can be used by more efficient bitwise operations to replace arithmetic operations. As described in Section 3.3, -bit encoding functions are the combination of a series of trigonometric and sign functions. The sign function of the encoder makes it difficult to implement the BP process. Thus, we approximate the derivative of the encoding function with respect to as follows:

(12)

We take the 2-bit encoding as an example to describe the optimization method of MBBNs.

(13)

Besides activations, all the weights of networks also need to be quantized to binary values. Therefore, MBBN can be viewed as a special binarized neural network [21]. In this training method, we use the same mechanisms to constrain and update parameters, and use and to represent the real-valued weights and its binarized weights, respectively. is constrained between and by the HTanh() function to avoid excessive growth. Instead of -bit encoding functions that are complex, we use the sign function directly to binarize parameters. Therefore, the binarization function for transforming to can be directly formulated as follows:

(14)

For this function, we define the gradient function of each component to constrain the search space. In the training process, we apply to compute the loss and gradients, which are used to update . That is, the input of the sign function can be constrained to by , and it can also speed up convergence. The parameters of the whole network are updated in the condition of differentiability.

We give the detailed training algorithm of MBBN in Algorithm 1, where we describe the main computing process of the network, and some details about batch normalization, pooling and loss layers are omitted. Note that the forward() operation is the basic operation in DNNs, such as convolutional and full-connected operations. Those operations can be accelerated by binary bitwise operations. We use backward() and BackMBitEncoder

() to specify how to backpropagate through

forward() and MBitEncoder() operations. Update() specifies how to update the parameters with their known gradients. For example, we usually use Adam or SGD as our optimization algorithm.

Require: A mini-batch of inputs and targets , previous weights , and learning rate .
Ensure: Updated weights .
     1. Computing the parameter gradients:
     1.1 Forward propagation:

1:  for  to  do
2:     for  to  do
3:        ;
4:     end for
5:     ;
6:     ;
7:     if  then
8:        ;
9:     end if
10:  end for     1.2 Backward propagation:      Compute with the given and ;
11:  for  to  do
12:     ;
13:     ;
14:     for  to  do
15:        ;
16:     end for
17:     for  to  do
18:        ;
19:     end for
20:  end for     2. Accumulating the parameter gradients:
21:  for  to  do
22:     for  to  do
23:        ;
24:        .
25:     end for
26:  end for
Algorithm 1 Training an -layer MBBN with -bit weights and -bit activations. Weights and activations are quantized according to Eqs. (14) and (10), respectively.

Iv-B Training Quantized Networks

The above training scheme is proposed to optimize binary networks, which are transformed by our model decomposition method. However, this decomposition method can produce multiple times as many parameters as the original network. If we optimize the binary network, it may easily fall into local optimal solutions and has slow convergence speed. Based on the nonlinear mapping between quantized numbers and their encoded states, we can directly optimize the quantized network and then utilize our equivalent transformation method to decompose quantized network. The decomposed parameters can be used by MBBN in the inference process to achieve computational acceleration.

Two quantization schemes are usually applied in QNNs [21, 76, 43], and include linear quantization and logarithmic quantization. Due to the requirement of our encoding mechanism, linear quantization is used to quantize our networks. We quantize a tensor linearly into -bit, which can be formulated as follows:

(15)

where the clamp function is used to truncate all values into the range of , is a learned parameter, and the scaling factor is defined as: . The rounding operation can be used to quantize a real number to a certain state, we call it a hard ladder function, which can segment input space to multi-states. Table 2 lists the four states quantized by Eq. (11). However, the derivative of this function is almost zero everywhere, and thus it cannot be applied in the training process. Inspired by STE [1], we use the same technique to speed up computing process and yield better performance. We use the loss computed by quantized parameters to update full precision parameters. Note that for our encoding scheme with low-precision quantization (e.g., binary), we use Adam to train our model, otherwise SGD is used. Algorithm 2 shows the main training process of our quantized network, where () specifies how to backpropagate through the () function with -bit quantization.

Require: A mini-batch of inputs and targets , previous weights , and learning rate .
Ensure: Trained weights .
     1. Computing the parameter gradients:
     1.1 Forward propagation:

1:  for  to  do
2:     ;
3:     ;
4:     if  then
5:        ;
6:     end if
7:  end for     1.2 Backward propagation:      Compute with the given and ;
8:  for  to  do
9:     ;
10:     ;
11:     ;
12:  end for     2. Accumulating the parameter gradients:
13:  for  to  do
14:     ;
15:     ;
16:  end for
Algorithm 2 Training an -layer QNN with -bit weights and -bit activations. Weights and activations are quantized by using Eq. (15).

V Experiments

In this section, we compare the performance of our method with those of some typical methods for image classification tasks on CIFAR-10 and ImageNet, object detection tasks on PASCAL VOC2007/2012, and semantic segmentation tasks on PASCAL VOC2012. In recent years, many scholars are devoted to improving the performance (e.g., accuracy and compression ratio) of QNNs, while very few researchers have studied their engineering acceleration, which is an important reason for hindering industrial promotion. Therefore, we mainly focus on an acceleration method, which is especially suitable for engineering applications.

V-a Image Classification

Models Lenet ResNet-18 ResNet-50
Methods Compression ratio CIFAR-10 ImageNet (Top-1) ImageNet (Top-5) ImageNet (Top-1) ImageNet (Top-5)
BWN [4] 32 90.10% 60.80% 83.00% - -
BNN [21] 32 88.60% 42.20% 67.10% - -
XNOR-Net [48] 32 - 51.20% 73.20% - -
TWN [28] 16 92.56% 61.80% 84.20% - -
DoReFa-Net[1/4-bit] [76] 32 - 59.20% 81.50% - -
ABC-Net[5-bit] [35] 6.4 - 65.00% 85.90% 70.10% 89.70%
INQ[5-bit] [74] 6.4 - 68.92% 89.10% 74.81% 92.45%
SYQ[2/8-bit] [7] 16 - 67.70% 87.80% 72.30% 90.90%
Full-Precision 1 91.40% 70.10% 89.81% 76.20% 93.21%
Encoded activations and weights
MBBN() 32 90.39% 52.33% 79.51% 56.65% 80.78%
MBBN() 16 91.06% 66.07% 87.28% 73.04% 91.33%
MBBN() 10.7 91.27% 69.17% 89.24% 75.21% 92.54%
MBBN() 8 91.15% 70.31% 89.89% 76.22% 93.09%
MBBN() 6.4 90.92% 70.98% 90.17% 76.53% 93.27%
MBBN() 5.3 91.01% 71.42% 90.51% 76.61% 93.44%
MBBN() 4.6 90.20% 71.46% 90.66% 76.65% 93.54%
MBBN() 4 90.43% 71.46% 90.54% 76.71% 93.40%
TABLE III: Classification accuracies of Lenet on CIFAR-10, and ResNet-18 and ResNet-50 on ImageNet.

CIFAR-10 is an image classification dataset, which has training and testing images. All the images are color images representing airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships and trucks.

We validated our method with different bit encoding schemes, in which activations and weights are equally treated, that is, both of them use the same number of encoding bits. Table 3 lists the results of our method and many state-of-the-art models, including BWN [4], BNN [21], XNOR-Net [48], XNOR-Net [48], TWN [28], DoReFa-Net [76], ABC-Net [35], INQ [74], and SYQ [7]. Here we used the same network architecture as in [4] and [21], except for the encoding functions. We used as the activation function and employed Adam to optimize all the parameters of the networks. From all the results, we can see that the representation capabilities of our networks encoded by 1-bit or 2-bit are enough for small-scale datasets, e.g., CIFAR-10. Our method with low-precision encoding (e.g., -bit) achieved nearly the same classification accuracy as its high precision versions (e.g., -bit) and full-precision models, while we can attain memory saving compared with its full-precision counterpart. When both activations and weights are constrained to -bit, our network structure is similar to BNN [21], and our method yielded even better accuracy mainly because of our proposed encoding functions.

ImageNet: We further compared the performance of our method with all the other methods mentioned above on the ImageNet (ILSVRC-2012) dataset [51]. This dataset consists of 1K categories images, and has over M images in the training set and K images in the validation set. Here, we used ResNet-18 and ResNet-50 to illustrate the effectiveness of our method. and used Top-1 and Top-5 accuracies to report classification performance. For large-scale training sets (e.g., ImageNet), it usually costs plenty of time and requires sufficient computing resources for classical full-precision models. Moreover, it will be more difficult to train quantized networks, and thus the initialization of parameters is particularly important. Here, we used as the activation function to constrain activations. In particular, the full-precision model parameters activated by can be directly used as the initialization parameters for our -bit quantized network. After a little number of fine-tuning, our -bit quantized networks can be well-trained. Similarly, we used the -bit model parameters as the initialization parameters to train -bit quantized networks, and so on. There is a special case, when we use and the 1BitEncoder function to encode activations, and all the activations are constrained to . Thus, we used as the activation function for -bit encoding. Note that we used SGD to optimize parameters when the number of encoding bits is not less than , and the learning rate was set to . When the number of the encoding bits is or , the convergent speed of Adam is faster than SGD, as discussed in [21, 48].

Table 3 lists the performance (e.g., accuracies and compression ratios) of our method and many state-of-the-art models mentioned above. Experimental results show that our method performs much better than its counterparts. Similarly, our method with -bit encoding significantly outperforms ABC-Net[-bit] [35]. Moreover, our networks can be trained in a very short time to achieve the accuracies in our experiments, and only dozens of fine-tuning are needed. Different from BWN and TWN, whose weights are only quantized rather than activation values, our method quantifies both weights and activation values, simultaneously. Although BWN and TWN can obtain little higher accuracies than our -bit quantization model, we obtain significantly better speedups, and the speedup ratios of existing methods such as BWN and TWN are limited to . Due to limited and fixed expression ability, existing methods (such as BWN, TWN, BNN and XNOR-Net) cannot satisfy various quantization precision requirements. In particular, our method can provide available encoding choices, and hence our encoded networks with different encoding precisions have different speedup ratios, memory requirements and experimental accuracies.

Layer name ResNet-18 Precision#1 Precision#2
Conv1

, 64, stride 2

, ,
Conv2x , ,
Conv3x , ,
Conv4x , ,
Conv5x , ,
FC 1000 , ,
Compression ratio 5.0x 14.1x
Accuracy Top-1=71.45% Top-1=69.32%
Top-5=90.60% Top-5=89.37%
TABLE IV: Classification accuracies of ResNet-18 with different bit precisions on ImageNet.

In the above experiments, all activation values and weights are constrained to the same bit precision (e.g., ). However, our method also has an advantage that it can support mixed precisions (e.g., and ), that is, weights and activations for each layer can have different bit precisions. This makes model quantification and acceleration be more flexible, and thus meets more requirements for model design. We used ResNet-18 as an example to verify the effect of mixed precisions. Table 4 lists the network structures, bit precision of each layer, compression ratios, and classification accuracies on ImageNet. For the convenience, we set the same bit precision for each residual block of ResNet-18. The experimental results show that the mixed precisions can obtain higher classification accuracies. As shown in Table 4, the classification results of Precision#1 are higher than those of their full-precision counterparts, while obtaining a 5 compression ratio. We also set the bit precision of all the activation values to 4, and different bit precisions for each weight in Table 4, and achieve compression ratio and Top-1 accuracy.

V-B Object Detection

Besides image classification, QNNs can also be used for object detection tasks [69, 30]. Hence, we use our method to quantize some trained networks with the Single Shot MultiBox Detector (SSD) framework [37] for object detection tasks, (i.e., a coordinate regression task and a classification task). The regression task has higher demands on the encoding precision, and thus the application of object detection presents a new challenge for QNNs.

Backbone Precision mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
InceptionV3 full(32-bit) 78.8 80.8 86.8 81.5 72.1 53.6 86.9 87.7 87.4 62.0 81.4 75.3 88.0 87.1 86.0 79.3 58.7 81.5 72.9 87.2 80.2
78.0 79.0 85.8 79.8 72.4 49.8 83.5 86.8 86.5 60.7 84.2 75.4 87.5 86.6 84.4 78.9 57.5 81.9 73.3 86.7 78.8
77.2 79.1 85.3 78.9 67.8 51.4 85.1 86.4 87.8 59.3 82.7 73.3 87.0 85.5 83.7 77.7 56.8 78.8 70.7 87.9 78.7
66.6 71.1 76.2 62.7 59.2 36.0 75.0 80.9 76.2 40.8 71.6 67.4 73.8 78.8 74.6 71.1 41.9 64.9 63.2 79.3 66.9
BWN 44.8 56.5 57.2 26.8 34.6 17.4 51.6 59.6 51.2 22.3 41.5 55.3 42.1 63.4 54.9 53.3 15.7 44.0 40.2 60.2 50.2
ResNet-50 full(32-bit) 77.2 79.5 85.0 81.2 71.1 55.3 83.7 87.3 87.3 60.4 84.5 72.6 85.0 83.5 82.9 80.3 50.7 78.7 72.4 85.4 77.6
78.4 80.7 86.5 79.8 70.8 59.4 85.6 87.5 88.7 64.1 84.1 72.1 85.4 85.4 86.5 80.6 53.1 79.1 73.8 85.4 78.9
77.1 78.6 84.8 77.9 70.0 55.6 86.8 87.0 86.9 61.6 82.6 72.5 81.1 84.1 83.6 80.5 50.4 80.1 73.1 85.5 78.3
66.4 70.7 79.2 58.4 56.9 41.3 73.5 81.5 77.0 47.6 66.6 67.6 68.4 77.0 76.6 73.2 40.0 64.3 64.1 77.3 67.4
BWN 45.6 53.1 49.8 30.0 37.9 20.4 53.7 69.4 47.8 23.4 44.0 51.5 47.8 53.6 59.3 57.9 21.9 42.6 36.4 61.2 49.9
MobileNetV2 full(32-bit) 72.6 77.8 82.6 71.2 64.4 48.5 79.0 86.0 81.4 54.0 77.9 65.1 79.3 79.3 78.7 75.5 49.5 77.3 65.3 84.5 74.0
71.5 77.3 81.3 70.8 64.3 46.0 77.6 84.5 79.9 51.9 75.8 66.5 78.4 78.4 78.2 75.6 47.5 74.0 68.4 82.2 70.4
70.5 77.2 81.0 68.5 60.4 45.3 77.1 84.4 79.1 51.8 73.4 65.2 77.9 77.9 77.3 75.1 46.2 72.3 66.5 81.1 71.4
55.1 68.7 69.6 42.7 43.9 27.7 62.9 76.5 56.9 29.6 56.2 56.1 51.8 51.8 68.1 64.1 26.5 55.8 49.6 72.1 58.4
BWN 21.8 34.7 18.6 9.3 16.6 9.1 34.3 34.3 20.4 7.5 12.9 26.3 16.0 35.5 32.2 26.2 9.1 15.9 18.3 34.1 24.0
TABLE V: Comparison of the object detection performance of our method with different encoding bits.
Fig. 3: Detection results of our method with different precisions based on the InceptionV3 network on some examples of the VOC2007 test set.
Backbone Precision mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
InceptionV3 , 69.9 70.4 78.0 68.5 59.8 40.7 79.2 82.1 83.3 48.7 72.8 74.6 77.6 80.0 81.1 72.3 41.4 67.0 68.2 80.3 71.3
, 69.4 74.1 79.3 67.3 61.3 41.2 74.7 82.3 79.6 48.1 74.7 67.8 75.2 80.2 75.4 73.5 41.8 72.6 66.6 81.8 70.4
, 67.6 80.8 86.6 79.7 73.2 53.8 84.4 86.9 87.6 60.4 83.5 76.3 87.8 86.2 85.5 78.1 56.0 79.9 70.8 87.0 77.8
, 64.7 71.1 74.9 53.8 55.9 33.7 72.3 79.4 77.0 43.2 66.8 68.8 69.1 76.1 74.0 70.3 37.1 61.7 67.1 78.4 64.0
TABLE VI: Comparison of object detection results of our method with different encoding bits.

In this experiment, all the models were trained on the PASCAL VOC2007 and VOC2012 trainval set, and tested on the VOC2007 test set. Here, we used MobileNetV2 [52], InceptionV3 [60] and ResNet-50 [17] as the basic networks to extract features in the SSD framework. Note that we quantize the basic network layers, and keep the other layers in full-precision. All the networks are trained on a large input image of size . In order to quickly achieve convergence of the network parameters, we adopted the same training strategy in the above experiments, in which the parameters of high-bit models are used to initialize low-bit models. We used the average precision (AP) to measure the performance of the detector on each category and mean average precision (mAP) to evaluate multi-target detector performance. Table IV shows the object detection results of our method and its counterparts, including its full-precision counterpart and BWN. It is clear that our method with -bit encoding scheme can yield very similar performance as its full-precision counterparts. Especially for ResNet-50, we achieve the best results of mAP in the case of -bit precision that is better than its full-precision counterpart by mAP. When we use -bit to encode both activation values and weights, the evaluation index drops slightly. If the number of encode bits is constrained to , the performance of this task has visibly deteriorated. Especially for MobileNetV2, when the number of the encoding bits is converted to from , the mAP result drops monotonically from to . For some categories (e.g., bottle, chair and plant), the AP results dramatically drop. In our experiments, we also implemented the binary weight form of different basic networks. Obviously, our method with -bit encoding has a higher accuracy advantage over BWN [4]. For object detection tasks, fixed extremely low-bit precision network seems powerless. Especially for MobileNetV2, BWN achieves the result of mAP that is not suitable for many industrial applications.

To compare the performance of different models with different precisions more conveniently, Fig. 3 shows some detection results on the examples of the VOC2007 test set. Detections are shown with scores higher than . The detection output pictures, which are tested by different precisions, are shown in the same column. From all the results, we can see that -bit and -bit encoding precisions have similar performance as their full-precision counterparts. There is almost no obvious difference between those detection examples. However, when the encoding precision is constrained to -bit, we can see that some offsets are produced on the bounding box (e.g., bike). Meanwhile, some small objects and obscured targets may be missed (e.g., cow and sheep).

In fact, it is not necessary to constrain both activation values and weights to the same bit precision. Table 5 shows the results of our method with four different bit precisions based on the InceptionV3 network. For the first case, we constrain activation values and weights to -bit and -bit, respectively. For the second case, we constrain activation values and weights to -bit and -bit, respectively. The two cases have the same computational complexity, however, they have different compression ratios and results. The results of the third case, where activation values and weights are constrained to -bit and -bit, respectively, are obviously better than the fourth case, where activation values and weights are constrained to -bit and -bit, respectively. Therefore, the diversity of activation values is more important than weights. In order to mitigate the performance degradation of QNNs, we should encode activation values with a higher precision than weights.

For the object detection task, our method with a relatively low encoding precision (e.g., -bit) yields good performance on the SSD framework. Similarly, it can be applied to other frameworks, e.g., R-CNN [11], Fast R-CNN [50], SPP-Net [16] and YOLO [49].

V-C Semantic Segmentation

As a typical task in the field of unmanned driving, image semantic segmentation has attracted many scholars and companies to invest in research. The application scenario for this task usually relies on small devices, and therefore, the model compression and computational acceleration seems particularly important for real-world applications. We evaluate the performance of our method on the PASCAL VOC2012 semantic segmentation benchmark, which is consisted of foreground object classes and one background class. We use pixel accuracy (pixAcc) and mean IoU (mIoU) as the metrics to report the performance.

We used DeepLabV3 [3] based on ResNet-101 [17] as the basic network in this experiment. Similarly, we use different encoding precisions to evaluate the semantic segmentation performance of our method. Fig. 5 shows the evaluating indicators of different encoding precisions, where the dotted line denotes the mIoU of the full-precision network, and the solid line denotes the pixAcc of the full-precision network. From the other curves, we can see that with the increase of encoding precisions, the semantic segmentation performance is becoming better. For instance, when the encoding precision is -bit, our method achieves the best performance ( mIoU and pixAcc), which is much better than its full-precision counterpart ( mIoU and pixAcc). Fig. 4 shows some visualization results on VOC2012 with different encoding precisions. Obviously, the segmentation results can become more precise with higher encoding bits. For example, from the fourth image we can see that as the increase of encoding precisions, the detail of segmentation of horse’s legs is becoming more and more clear.

In this experiment, we evaluated the performance of our method based on the DeepLabV3 framework. However, it can also be applied to FCN [38] and PSP [73]. Besides semantic segmentation tasks, our method can be applied to other real-word tasks, e.g., instance segmentation.

Fig. 4: Visualization results of our networks with different encoding precisions on the VOC2012 val set.
Fig. 5: The evaluating indicators of different encoding precisions for semantic segmentation. The dotted line and solid line denote the results of mIoU and pixAcc, respectively.

V-D Efficiency Analysis

Courbariaux et al. [21] used a matrix multiplication on a GTX750 NVIDIA GPU to illustrate the performance of XNOR kernel, and can achieve speedup ratio than the baseline kernel. Therefore, we will use the same matrix multiplication to validate the performance of our method by theoretical analysis and experimental study.

In this subsection, we analyze the speed-up ratios of our method with different encoding bits. Suppose there are two vectors and , which are composed by real numbers. If we use the -bit floating-point data type to represent each element of the two vectors, the element-wise vector multiplication should require multiplications. However, after each element of the two vectors is quantized to a binary state, we use -bit to represent their elements, and can use the -bit data types to represent the two vectors. Then the element-wise multiplication of the two vectors can be computed at once. This method is sometimes called SIMD (single, instuction, multiple data) within a register. The details are shown in Fig. 6. The general computing platform (i.e., CPU or GPU) can perform a -bit binary operation in one clock cycle. In addition, an Intel SSE (e.g., AVX or AVX-512) instruction can perform (or 256 and 512) bits binary operations. This mechanism is very suitable for the FPGA platform, which is designed for arithmetic and logical operations. Therefore, our method can achieve model compression, computational acceleration and resource saving in the inference process.

Fig. 6: The computing process of dot produce based on common multiplication (top) and bitwise operations (bottom).

Suppose the length of two vectors is . We use to denote the speedup ratio of our method by constraining activation values to -bit and weights to -bit, where , and can be formulated as follows:

(16)

where is the speedup ratio between the multiply-accumulate operation (MAC) and bitwise operations, denotes the speedup ratio between addition compared and bitwise operations, and represents the length of register (e.g., 8, 16, 32, and 64). If we use to represent the time complexity of -bit operations, we can use to denote the time complexity of our method by constraining activation values to -bit and the values of weights to -bit. Following TBN [66], we can know that and . According to the speedup ratio achieved by BWN [4], we can safely assume . Thus, the matrix multiplication after encoded by -bit can obtain at most speedups over its full-precision counterpart.

Fig. 7: The speedup ratios of our networks with different encoding precisions (from 1-bit to 6-bit).

Besides the theoretical analysis, we also implemented the multiplication of two matrices via our encoding scheme on GTX 1080 GPU. Please refer to [21] for the experimental design and analysis. In our experiments, we first get the encoding numbers of two matrices, and then store those numbers by bits. The matrix multiplication after encoded by -bit has obtained at most speedup ratio than the baseline kernel (a quite unoptimized matrix multiplication kernel). Because of the parallel architecture of GPU, the experimental result is slightly higher than its theoretical speedup ratio. Fig. 7 shows the speedup ratios of different encoding precisions (from -bit to -bit), where both and denote the number of encoding bits for weights and activations, respectively. From all the results, we can see that the computational complexity is gradually increasing with the increase of encoding bits. Thus, users can easily achieve different encoding precisions arbitrarily according to their requirements (e.g., accuracy and speed) and hardware resources (e.g., memory). The code can be available at: https://github.com/qigongsun/BMD.

Vi Discussion and Conclusion

Vi-a Encoding and Encoding Schemes

As described in [76], there exists a nonlinear mapping between quantized numbers and their encoded states. The quantized values are usually restricted to a closed interval . For example, the mapping is formulated as follows:

(17)

where denotes a quantized number, and denotes the fixed-point integer encoded by and . We use a -bit fixed-point integer to represent a quantized number . The product can be formulated as follows:

(18)

As shown in (VI-A), the product form is a polynomial, which has four terms. Note that each term has its own scaling factor. The computation of can be accelerated by bitwise operations. However, the polynomial and scaling factors increase the computational complexity.

For our proposed quantized binary encoding scheme, the product of and is defined as

(19)

where and denote the fixed-point integers encoded by and . Obviously, compared with the above encoding of , the product can be more efficiently calculated by using our proposed encoding scheme.

Vi-B Linear Approximation and Quantization

As described in [35, 12, 71], the weight can be approximated by the linear combination of binary subitems {} and , which can be computed by more efficient bitwise operations. In order to obtain the combination, we need to solve the following problem

(20)

When this approximation is used in DNNs, can be considered as model weights. However, the scaling factors are introduced into this approximation, and such a scheme expands times the number of parameters. Therefore, this approximation can convert the original model into a complicated binary network, which leads to be hard to train [29] and easily falls into local optimal solutions.

For our method, we use the quantized parameters to approximate as follows:

(21)

where

is a positive or negative odd number, and its absolute value is not larger than

. Unlike the above linear approximation, our method can achieve the quantized weights, and directly get the corresponding encoding elements. Thus, our networks can be more efficiently trained via our quantization scheme than the linear approximation.

Vi-C Conclusions

In this paper, we proposed a novel encoding scheme using to decompose QNNs into multi-branch binary networks, in which we used bitwise operations (xnor and bitcount) to achieve model compression, acceleration and resource saving. In particular, users can easily achieve different encoding precisions arbitrarily according to their requirements (e.g., accuracy and speed) and hardware resources (e.g., memory). This special data storage and calculation mechanism can yield great performance in FPGA and ASIC, and thus our mechanism is feasible for smart chips. Our future work will focus on improving the hardware implementation, and exploring some ways (e.g., neural architecture search [79, 6, 36]) to automatically select proper bits for various network architectures, e.g., VGG [27] and ResNet [55].

References

  • [1] Y. Bengio, N. Leonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    .
    Cited by: §II, §IV-B, §IV.
  • [2] Z. Cai and N. Vasconcelos (2020) Rethinking differentiable search for mixed-precision neural networks. External Links: Link Cited by: §II.
  • [3] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. Cited by: §V-C.
  • [4] M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Proc. Adv. Neural Inf. Process. Syst., pp. 3123–3131. Cited by: §II, §II, §V-A, §V-B, §V-D, TABLE III.
  • [5] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Proc. Adv. Neural Inf. Process. Syst., pp. 1269–1277. Cited by: §I.
  • [6] T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: a survey..

    Journal of Machine Learning Research

    20 (55), pp. 1–21.
    Cited by: §VI-C.
  • [7] J. Faraone, N. Fraser, M. Blott, and P. H. Leong (2018) Syq: learning symmetric quantization for efficient deep neural networks. In

    Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

    ,
    pp. 4300–4309. Cited by: §V-A, TABLE III.
  • [8] J. Feng, J. Chen, Q. Sun, R. Shang, X. Cao, X. Zhang, and L. Jiao (2020) Convolutional neural network based on bandwise-independent convolution and hard thresholding for hyperspectral band selection. IEEE Transactions on Cybernetics. Cited by: §II.
  • [9] X. Geng, J. Fu, B. Zhao, J. Lin, M. M. S. Aly, C. Pal, and V. Chandrasekhar (2019) Dataflow-based joint quantization of weights and activations for deep neural networks. arXiv preprint arXiv:1901.02064. Cited by: §II.
  • [10] K. E. Germec (2009) Fourier neural networks for real-time harmonic analysis. In 2009 IEEE 17th Signal Processing and Communications Applications Conference, pp. 333–336. Cited by: §III-C.
  • [11] R. Girshick, J. Donahue, T. Darrell, and J. Malik