Blended Coarse Gradient Descent for Full Quantization of Deep Neural Networks

08/15/2018 ∙ by Penghang Yin, et al. ∙ University of California, Irvine Qualcomm 2

Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular full precision counterparts. To maintain the same performance level especially at low bit-widths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights, hence mathematical challenges arise. We introduce the notion of coarse derivative and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial descent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bit-width such as binarization. In full quantization of ResNet-18 for ImageNet classification task, BCGD gives 64.36 weights across all layers and 4-bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46 coarse gradient descent for a two-layer neural network model with Gaussian input data, and prove that the expected coarse gradient correlates positively with the underlying true gradient.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have seen enormous success in image and speech classification, natural language processing, health sciences among other big data driven applications in recent years. However, DNNs typically require hundreds of megabytes of memory storage for the trainable full-precision floating-point parameters, and billions of FLOPs (floating point operations per second) to make a single inference. This makes the deployment of DNNs on mobile devices a challenge. Some considerable recent efforts have been devoted to the training of low precision (quantized) models for substantial memory savings and computation/power efficiency, while nearly maintaining the performance of full-precision networks. Most works to date are concerned with weight quantization (WQ)

bc_15 ; twn_16 ; xnornet_16 ; lbwn_16 ; carreira1 ; BR_18 . In he2018relu , He et al. theoretically justified for the applicability of WQ models by investigating their expressive power. Some also studied activation function quantization (AQ) bnn_16 ; xnornet_16 ; Hubara2017QuantizedNN ; halfwave_17 ; entropy_17 ; dorefa_16 , which utilize an external process outside of the network training. This is different from WQ at 4 bit or under, which must be achieved through network training. Learning activation function as a parametrized family () and part of network training has been studied in paramrelu

for parametric rectified linear unit, and was recently extended to uniform AQ in

pact . In uniform AQ, is a step (or piecewise constant) function in , and the parameter determines the height and length of the steps. In terms of the partial derivative of in , a two-valued proxy derivative of the parametric activation function (PACT) was proposed pact , although we will present an almost everywhere (a.e.) exact one in this paper.

The mathematical difficulty in training activation quantized networks is that the loss function becomes a piecewise constant function with sampled stochastic gradient a.e. zero, which is undesirable for back-propagation. A simple and effective way around this problem is to use a (generalized) straight-through (ST) estimator or derivative of a related (sub)differentiable function

hinton2012neural ; bengio2013estimating ; bnn_16 ; Hubara2017QuantizedNN

such as clipped rectified linear unit (clipped ReLU)

halfwave_17

. The idea of ST estimator dates back to the perceptron algorithm

rosenblatt1957perceptron ; rosenblatt1962principles

proposed in 1950s for learning single-layer perceptrons with binary output. For multi-layer networks with hard threshold activation (a.k.a. binary neuron), Hinton

hinton2012neural

proposed to use the derivative of identity function as a proxy in back-propagation or chain rule, similar to the perceptron algorithm. The proxy derivative used in backward pass only was referred as straight-through estimator in

bengio2013estimating , and several variants of ST estimator bnn_16 ; Hubara2017QuantizedNN ; halfwave_17 have been proposed for handling quantized activation functions since then. A similar situation, where the derivative of certain layer composited in the loss function is unavailable for back-propagation, has also been brought up by wang2018deep

recently while improving accuracies of DNNs by replacing the softmax classifier layer with an implicit weighted nonlocal Laplacian layer. For the training of the latter, the derivative of a pre-trained fully-connected layer was used as a surrogate

wang2018deep .

On the theoretical side, while the convergence of the single-layer perception algorithm has been extensively studied widrow199030 ; freund1999large , there is almost no theoretical understanding of the unusual ‘gradient’ output from the modified chain rule based on ST estimator. Since this unusual ‘gradient’ is certainly not the gradient of the objective function, then a question naturally arises: how does it correlate to the objective function? One of the contributions in this paper is to answer this question. Our main contributions are threefold:

  1. Firstly, we introduce the notion of coarse derivative and cast the early ST estimators or proxy partial derivatives of in including the two-valued PACT of pact as examples. The coarse derivative is non-unique. We propose a three-valued coarse partial derivative of the quantized activation function in that can outperform the two-valued one pact in network training. We find that unlike the partial derivative which vanishes, the a.e. partial derivative of in is actually multi-valued (piecewise constant). Surprisingly, this a.e. accurate derivative is empirically less useful than the coarse ones in fully quantized network training.

  2. Secondly, we propose a novel accelerated training algorithm for fully quantized networks, termed blended coarse gradient descent method (BCGD). Instead of correcting the current full precision weights with coarse gradient at their quantized values like in the popular BinaryConnect scheme bc_15 ; bnn_16 ; xnornet_16 ; twn_16 ; dorefa_16 ; halfwave_17 ; Goldstein_17 ; BR_18 , the BCGD weight update goes by coarse gradient correction of a suitable average of the full precision weights and their quantization. We shall show that BCGD satisfies the sufficient descent property for objectives with Lipschitz gradients, while BinaryConnect does not unless an approximate orthogonality condition holds for the iterates BR_18 .

  3. Our third contribution is the mathematical analysis of coarse gradient descent for a two-layer network with binarized ReLU activation function and i.i.d. unit Gaussian data. We provide an explicit form of coarse gradient based on proxy derivative of regular ReLU, and show that when there are infinite training data, the negative expected coarse gradient gives a descent direction for minimizing the expected training loss. Moreover, we prove that a normalized coarse gradient descent algorithm only converges to either a global minimum or a potential spurious local minimum. This answers the question.

The rest of the paper is organized as follows. In section 2, we discuss the concept of coarse derivative and give examples for quantized activation functions. In section 3, we present the joint weight and activation quantization problem, and BCGD algorithm satisfying the sufficient descent property. For readers’ convenience, we also review formulas on 1-bit, 2-bit and 4-bit weight quantization used later in our numerical experiments. In section 4, we give details of fully quantized network training, including the disparate learning rates on weight and . We illustrate the enhanced validation accuracies of BCGD over BinaryConnect, and 3-valued coarse partial derivative of over 2-valued and a.e. partial derivative in case of 4-bit activation, and (1,2,4)-bit weights on CIFAR-10 image datasets. We show top-1 and top-5 validation accuracies of ResNet-18 with all convolutional layers quantized at 1-bit weight/4-bit activation (1W4A), 4-bit weight/4-bit activation (4W4A), and 4-bit weight/8-bit activation (4W8A), using 3-valued and 2-valued partial derivatives. The 3-valued partial derivative out-performs the two-valued with larger margin in the low bit regime. The accuracies degrade gracefully from 4W8A to 1W4A while all the convolutional layers are quantized. The 4W8A accuracies with either the 3-valued or the 2-valued partial derivatives are within 1% of those of the full precision network. If the first and last convolutional layers are in full precision, our top-1 (top-5) accuracy of ResNet-18 at 1W4A with 3-valued coarse -derivative is 4.7 % (3%) higher than that of HWGQ halfwave_17 on ImageNet dataset. This is in part due to the value of parameter being learned without any statistical assumption.


Notations.

denotes the Euclidean norm of a vector or the spectral norm of a matrix;

denotes the -norm. represents the vector of zeros, whereas the vector of all ones. We denote vectors by bold small letters and matrices by bold capital ones. For any , is their inner product. denotes the Hadamard product whose -th entry is given by .

2 Activation Quantization

In a network with quantized activation, given a training sample of input and label , the associated sample loss is a composite function of the form:

(1)

where contains the weights in the -th linear (fully-connected or convolutional) layer, ‘’ denotes either matrix-vector product or convolution operation; reshaping is necessary to avoid mismatch in dimensions. The -th quantized ReLU

acts element-wise on the vector/tensor

output from the previous linear layer, which is parameterized by a trainable scalar known as the resolution. For practical hardware-level implementation, we are most interested in uniform quantization:

(2)

where is the scalar input, the resolution, the bit-width of activation and the quantization level. For example, in 4-bit activation quantization (4A), we have and quantization levels including the zero.

Given training samples, we train the network with quantized ReLU by solving the following empirical risk minimization

(3)

In gradient-based training framework, one needs to evaluate the gradient of the sample loss (1) using the so-called back-propagation (a.k.a. chain rule), which involves the computation of partial derivatives and . Apparently, the partial derivative of in is almost everywhere (a.e.) zero. After composition, this results in a.e. zero gradient of with respect to (w.r.t.) and in (1), causing their updates to become stagnant. To see this, we abstract the partial gradients and , for instances, through the chain rule as follows:

and

where we recursively define , and for as the output from the -th linear layer, and ‘’ denotes some sort of proper composition in the chain rule. It is clear that the two partial gradients are zeros a.e. because of the term

. In fact, the automatic differentiation embedded in deep learning platforms such as PyTorch

pytorch would produce precisely zero gradients.

To get around this, we use a proxy derivative or so-called ST estimator for back-propagation. By overloading the notation ‘’, we denote the proxy derivative by

The proxy partial derivative has a non-zero value in the middle to reflect the overall variation of , which can be viewed as the derivative of the large scale (step-back) view of in , or the derivative of the clipped ReLU halfwave_17 :

(4)
Figure 1: Left: plot of 2-bit quantized ReLU in . Right: plot of the associated clipped ReLU in .

On the other hand, we find the a.e. partial derivative of w.r.t. to be

Surprisingly, this a.e. derivative is not the best in terms of accuracy or computational cost in training, as will be reported in section 4. We propose an empirical three-valued proxy partial derivative in as follows

The middle value is the arithmetic mean of the intermediate values of the a.e. partial derivative above. Similarly, a more coarse two-valued proxy, same as PACT pact which was derived differently, follows by zeroing out all the nonzero values except their maximum:

This turns out to be exactly the partial derivative of the clipped ReLU defined in (4).

We shall refer to the resultant composite ‘gradient’ of through the modified chain rule and averaging as coarse gradient. While given the name ‘gradient’, we believe it is generally not the gradient of any smooth function. It, nevertheless, somehow exploits the essential information of the piecewise constant function , and its negation provides a descent direction for the minimization. In section 5, we will validate this claim by examining a two-layer network with i.i.d. Gaussian data. We find that when there are infinite number of training samples, the overall training loss (i.e., population loss) becomes pleasantly differentiable whose gradient is non-trivial and processes certain Lipschitz continuity. More importantly, we shall show an example of expected coarse gradient that provably forms an acute angle with the underlying true gradient of and only vanishes at the possible local minimizers of the original problem.

During the training process, the vector (one component per activation layer) should be prevented from being either too small or too large. Due to the sensitivity of , we propose a two-scale training and set the learning rate of to be the learning rate of weight multiplied by a rate factor far less than 1, which may be varied depending on network architectures. That rate factor effectively helps quantized network converge steadily and prevents from vanishing.

3 Full Quantization

Imposing the quantized weights amounts to adding a discrete set-constraint to the optimization problem (3). Suppose is the total number of weights in the network. For commonly used -bit layer-wise quantization, takes the form of , meaning that the weight tensor in the -th linear layer is constrained in the form for some adjustable scaling factor shared by weights in the same layer. Each component of is drawn from the quantization set given by for (binarization) and for . This assumption on generalizes those of the 1-bit BWN xnornet_16 and the 2-bit TWN twn_16 . As such, the layer-wise weight and activation quantization problem here can be stated abstractly as follows

(5)

where the training loss is defined in (3). Different from activation quantization, one bit is taken to represent the signs. For ease of presentation, we only consider the network-wise weight quantization throughout this section, i.e., weights across all the layers share the same (trainable) floating scaling factor , or simply, for and for .

3.1 Weight Quantization

Given a float weight vector , the quantization of is basically the following optimization problem for computing the projection of onto set

(6)

Note that is a non-convex set, so the solution of (6) may not be unique. When , we have the binarization problem

(7)

For , the projection/quantization problem (6) can be reformulated as

(8)

It has been shown that the closed form (exact) solution of (7) can be computed at complexity for (1-bit) binarization xnornet_16 and at complexity for (2-bit) ternarization lbwn_16 . An empirical ternarizer of complexity has also been proposed twn_16 . At wider bit-width , accurately solving (8) becomes computationally intractable due to the combinatorial nature of the problem lbwn_16 .

The problem (8) is basically a constrained -means clustering problem of 1-D points BR_18 with the centroids being -spaced. It in principle can be solved by a variant of the classical Lloyd’s algorithm lloyd via an alternating minimization procedure. It iterates between the assignment step (-update) and centroid step (-update). In the -th iteration, fixing the scaling factor , each entry of is chosen from the quantization set, so that is as close as possible to . In the -update, the following quadratic problem

is solved by . Since quantization (6) is required in every iteration, to make this procedure practical, we just perform a single iteration of Lloyd’s algorithm by empirically initializing to be , which is derived by setting

This makes the large components in well clustered.

First introduced in bc_15 by Courbariaux et al., the BinaryConnect (BC) scheme has drawn much attention in training DNNs with quantized weight and regular ReLU. It can be summarized as

where denotes the sequence of the desired quantized weights, and is an auxiliary sequence of floating weights. BC can be readily extended to full quantization regime by including the update of and replacing the true gradient with the coarse gradients from section 2. With a subtle change to the standard projected gradient descent algorithm (PGD) combettes2015stochastic , namely

BC significantly outperforms PGD and effectively bypasses spurious the local minima in Goldstein_17 . An intuitive explanation is that the constraint set is basically a finite union of isolated one-dimensional subspaces (i.e., lines that pass through the origin) BR_18 . Since is obtained near the projected point , the sequence generated by PGD can get stuck in some line subspace easily when updated with a small learning rate ; see Figure 2 for graphical illustrations.

Figure 2: The ternarization of two weights. The one-dimensional subspaces ’s constitute the constraint set . When updated with small learning rate, BC keeps searching among the subspaces (left), whereas PGD can get stagnated in (right).

.

3.2 Blended Gradient Descent and Sufficient Descent Property

Despite the superiority of BC over PGD, we point out a drawback in regard to its convergence. While Yin et al. provided the convergence proof of BC scheme in the recent papers BR_18 , their analysis hinges on an approximate orthogonality condition which may not hold in practice; see Lemma 4.4 and Theorem 4.10 of BR_18 . Suppose has -Lipschitz gradient111This assumption is valid for the population loss function; we refer readers to Lemma 2 in section 5.. In light of the convergence proof in Theorem 4.10 of BR_18 , we have

(9)

For the objective sequence to be monotonically decreasing and converging to a critical point, it is crucial to have the sufficient descent property gilbert1992global hold for sufficiently small learning rate :

(10)

with some positive constant .

Since and , it holds in (9) that

Due to non-convexity of the set , the above term can be as small as zero even when and are distinct. So it is not guaranteed to dominate the right hand side of (9). Consequently given (9), the inequality (10) does not necessarily hold. Without sufficient descent, even if converges, the iterates may not converge well to a critical point. To fix this issue, we blend the ideas of PGD and BC, and propose the following blended gradient descent (BGD)

(11)

for some blending parameter . In contrast, the blended gradient descent satisfies (10) for small enough .

Proposition 1.

For , the BGD (11) satisfies

Choosing the learning rate small enough so that . Then inequality (10) follows from the above proposition, which will guarantee the convergence of (11) to a critical point by using similar arguments as in the proofs from BR_18 .

Corollary 1.

The blended gradient descent iteration (11) satisfies the sufficient descent property (10).

4 Experiments

We tested BCGD, as summarized in Algorithm 1, on the CIFAR-10 cifar_09 and ImageNet imagenet_09 ; imagnet_12 color image datasets. We coded up the BCGD in PyTorch platform pytorch . In all experiments, we fix the blending factor in (11) to be . All runs with quantization are warm started with a float pre-trained model, and the resolutions are initialized by of the maximal values in the corresponding feature maps generated by a random mini-batch. The learning rate for weight starts from . Rate factor for the learning rate of is , i.e., the learning rate for starts from . The decay factor for the learning rates is . The weights and resolutions

are updated jointly. In addition, we used momentum and batch normalization

bn_15 to promote training efficiency. We mainly compare the performances of the proposed BCGD and the state-of-the-art BC (adapted for full quantization) on layer-wise quantization. The experiments were carried out on machines with 4 Nvidia GeForce GTX 1080 Ti GPUs.

Input: mini-batch loss function , blending parameter , learning rate for the weights , learning rate for the resolutions of AQ (one component per activation layer).
Do:

  Evaluate the mini-batch coarse gradient at according to section 2.
   blended gradient update for weights
  
   quantize the weights as per section 3.1
Algorithm 1 One iteration of BCGD for full quantization

The CIFAR-10 dataset consists of 60,000 color images of 10 classes, with 6,000 images per class. There dataset is split into 50,000 training images and 10,000 test images. In the experiments, we used the testing images for validation. The mini-batch size was set to be and the models were trained for epochs with learning rate decaying at epoch 80 and 140. In addition, we used weight decay of and momentum of . The a.e derivative, 3-valued and 2-valued coarse derivatives of are compared on the VGG-11 vgg_14 and ResNet-20 resnet_15 architectures, and the results are listed in Tables 1, 2 and 3, respectively. It can be seen that the 3-valued coarse derivative gives the best overall performance in terms of accuracy. Figure 3 shows that in weight binarization, BCGD converges faster and better than BC.

Network Float 32W4A 1W4A 2W4A 4W4A
VGG-11 + BC 92.13 91.74 88.12 89.78 91.51
VGG-11+BCGD 88.74 90.08 91.38
ResNet-20 + BC 92.41 91.90 89.23 90.89 91.53
ResNet-20+BCGD 90.10 91.15 91.56
Table 1: CIFAR-10 validation accuracies in % with the a.e. derivative.
Figure 3: CIFAR-10 validation accuracies vs. epoch numbers with a.e. derivative and 1W4A quantization on VGG-11 (left) and ResNet-20 (right), with (orange) and without (blue) blending which speeds up training towards higher accuracies.
Network Float 32W4A 1W4A 2W4A 4W4A
VGG-11 + BC 92.13 92.08 89.12 90.52 91.89
VGG-11+BCGD 89.59 90.71 91.70
ResNet-20 + BC 92.41 92.14 89.37 91.02 91.71
ResNet-20+BCGD 90.05 91.03 91.97
Table 2: CIFAR-10 validation accuracies with the 3-valued derivative.
Network Float 32W4A 1W4A 2W4A 4W4A
VGG-11 + BC 92.13 91.66 88.50 89.99 91.31
VGG-11+BCGD 89.12 90.00 91.31
ResNet-20 + BC 92.41 91.73 89.22 90.64 91.37
ResNet-20+BCGD 89.98 90.75 91.65
Table 3: CIFAR-10 validation accuracies with the 2-valued derivative (PACT pact ).

ImageNet (ILSVRC12) dataset imagenet_09 is a benchmark for large-scale image classification task, which has million images for training and for validation of 1,000 categories. We set mini-batch size to and trained the models for 80 epochs with learning rate decaying at epoch 50 and 70. The weight decay of and momentum of were used. The ResNet-18 accuracies 65.46%/86.36% at 1W4A in Table 4 outperformed HWGQ halfwave_17 where top-1/top-5 accuracies are 60.8%/83.4% with non-quantized first/last convolutional layers. The results in the Table 4 and Table 5 show that using the 3-valued coarse partial derivative appears more effective than the 2-valued as quantization bit precision is lowered. We also observe that the accuracies degrade gracefully from 4W8A to 1W4A for ResNet-18 while quantizing all convolutional layers. Again, BCGD converges much faster than BC towards higher accuracy as illustrated by Figure 4.

Float 1W4A 4W4A 4W8A
3 valued 2 valued 3 valued 2 valued 3 valued 2 valued
top-1 69.64 64.36/ 63.37/ 67.36 66.97 68.85 68.83
top-5 88.98 85.65/ 84.93/ 87.76 87.41 88.71 88.84
Table 4: ImageNet validation accuracies with BCGD on ResNet-18. Starred accuracies are with first and last convolutional layers in float precision as in halfwave_17 . The accuracies are for quantized weights across all layers otherwise.
Float 1W4A 4W4A 4W8A
3 valued 2 valued 3 valued 2 valued 3 valued 2 valued
top-1 73.27 68.43 67.51 70.81 70.01 72.07 72.18
top-5 91.43 88.29 87.72 90.00 89.49 90.71 90.73
Table 5: ImageNet validation accuracies with BCGD on ResNet-34. The accuracies are for quantized weights across all layers.
Figure 4: ImageNet validation accuracies (left: top-1, right: top-5) vs. number of epochs with 3-valued derivative on 1W4A quantization on ResNet-18 with (orange) and without (blue) blending which substantially speeds up training towards higher accuracies.

5 Analysis of Coarse Gradient Descent for Activation Quantization

As a proof of concept, we analyze a simple two-layer network with binarized ReLU activation. Let be the binarized ReLU function, same as hard threshold activation hinton2012neural , with the bit-width and the resolution in (2):

We define the training sample loss by

where and are the underlying (nonzero) teacher parameters in the second and first layers, respectively. Same as in the literature that analyze the conventional ReLU nets Lee ; li2017convergence ; tian2017analytical ; brutzkus2017globally , we assume the entries of

are i.i.d. sampled from the standard normal distribution

. Note that for any scalar . Without loss of generality, we fix .

5.1 Population Loss Minimization

Suppose we have independent training samples , then the associated empirical risk minimization reads

(12)

The major difficulty of analysis here is that the empirical risk function in (12) is still piecewise constant and has a.e. zero partial gradient. This issue can be resolved by instead considering the following population loss minimization li2017convergence ; brutzkus2017globally ; Lee ; tian2017analytical :

(13)

Specifically, in the limit , the objective function becomes favorably smooth with non-trivial gradient. For nonzero vector , let us define the angle between and by

then we have

Lemma 1.

If every entry of is i.i.d. sampled from , , and , then the population loss is

(14)

Moreover, the gradients of w.r.t. and are

(15)

and

(16)

respectively.

When , the possible (local) minimizers of problem (13) are located at

  1. Stationary points where the gradients defined in (15) and (16) vanish simultaneously (which may not be possible), i.e.,

    (17)
  2. Non-differentiable points where and , or and .

Among them, are the global minimizers with .

Proposition 2.

If , then

gives the stationary points obeying (17). Otherwise, problem (13) has no stationary points.

The gradient of the population loss, , holds Lipschitz continuity under a boundedness condition.

Lemma 2.

For any and with and , there exists a constant depending on and , such that

5.2 Convergence Analysis of Normalized Coarse Gradient Descent

The partial gradients and , however, are not available in the training. What we really have access to are the expectations of the sample gradients, namely,

If was differentiable, then the back-propagation reads

(18)

and

(19)

Now that has zero derivative a.e., which makes (19) inapplicable. We study the coarse gradient descent with in (19) being replaced by the (sub)derivative of regular ReLU . More precisely, we use the following surrogate of :

(20)

with , and consider the following coarse gradient descent with weight normalization:

(21)
Lemma 3.

The expected gradient of w.r.t. is

(22)

The expected coarse gradient w.r.t. is

(23)

where . In particular, and vanish simultaneously only in one of the following cases

  1. (17) is satisfied according to Proposition 2.

  2. , , or , .

What is interesting is that the coarse partial gradient is properly defined at global minimizers of the population loss minimization problem (13) with , , whereas the true gradient does not exist there. Our key finding is that the coarse gradient has positive correlation with the true gradient , and consequently, together with give a descent direction in algorithm (21).

Lemma 4.

If , and , then the inner product between the expected coarse and true gradients w.r.t. is

Moreover, the following lemma asserts that is sufficiently correlated with , which will secure sufficient descent in objective values and thus the convergence of .

Lemma 5.

Suppose and . There exists a constant depending on , such that

Equipped with Lemma 2 and Lemma 5, we are able to show the convergence result of iteration (21).

Theorem 1.

Given the initialization with , and let be the sequence generated by iteration (21). There exists , such that for any step size , is monotonically decreasing, and both and converge to 0, as .

Remark 1.

Combining the treatment of Lee for analyzing two-layer networks with regular ReLU and the positive correlation between and , one can further show that if the initialization satisfies , and , then converges to the global minimizer .

6 Concluding Remarks

We introduced the concept of coarse gradient for activation quantization problem of DNNs, for which the a.e. gradient is inapplicable. Coarse gradient is generally not a gradient but an artificial ascent direction. We further proposed BCGD algorithm, for training fully quantized neural networks. The weight update of BCGD goes by coarse gradient correction of a weighted average of the float weights and their quantization, which yields sufficient descent in objective and thus acceleration. Our experiments demonstrated that BCGD is very effective for quantization at extremely low bit-width such as binarization. Finally, we analyzed the coarse gradient descent for a two-layer neural network model with Gaussian input data, and proved that the expected coarse gradient essentially correlates positively with the underlying true gradient.


Acknowledgements. This work was partially supported by NSF grants DMS-1522383, IIS-1632935; ONR grant N00014-18-1-2527, AFOSR grant FA9550-18-0167, DOE grant DE-SC0013839 and STROBE STC NSF grant DMR-1548924.

Conflict of Interest Statement

On behalf of all authors, the corresponding author states that there is no conflict of interest.

References

Appendix

A. Additional Preliminaries

Lemma 6.

Let be a Gaussian random vector with entries i.i.d. sampled from . Given nonzero vectors and with angle , we have

and

Proof.

The third identity was proved in Lemma A.1 of Lee

. To show the first one, since Gaussian distribution is rotation-invariant, without loss of generality we assume

with , then .

We further assume . It is easy to see

which is the probability that

forms an acute angle with both and .

To prove the last identity, we use polar representation of 2-D Gaussian random variables, where

is the radius and is the angle with and . Then for . Moreover,

and

Therefore,