Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets

03/13/2019 ∙ by Penghang Yin, et al. ∙ University of California, Irvine Qualcomm 6

Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass, so that the "gradient" through the modified chain rule becomes non-trivial. Since this unusual "gradient" is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual "gradient" given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNN) have achieved the remarkable success in many machine learning applications such as computer vision

(Krizhevsky et al., 2012; Ren et al., 2015)

, natural language processing

(Collobert & Weston, 2008)

and reinforcement learning

(Mnih et al., 2015; Silver et al., 2016). However, the deployment of DNN typically require hundreds of megabytes of memory storage for the trainable full-precision floating-point parameters, and billions of floating-point operations to make a single inference. To achieve substantial memory savings and energy efficiency at inference time, many recent efforts have been made to the training of coarsely quantized DNN, meanwhile maintaining the performance of their float counterparts (Courbariaux et al., 2015; Rastegari et al., 2016; Cai et al., 2017; Hubara et al., 2018; Yin et al., 2018b).

Training fully quantized DNN amounts to solving a very challenging optimization problem. It calls for minimizing a piecewise constant and highly nonconvex empirical risk function subject to a discrete set-constraint that characterizes the quantized weights. In particular, weight quantization of DNN have been extensively studied in the literature; see for examples (Li et al., 2016; Zhu et al., 2016; Li et al., 2017; Yin et al., 2016, 2018a; Hou & Kwok, 2018; He et al., 2018; Li & Hao, 2018). On the other hand, the gradient

in training activation quantized DNN is almost everywhere (a.e.) zero, which makes the standard back-propagation inapplicable. The arguably most effective way around this issue is nothing but to construct a non-trivial search direction by properly modifying the chain rule. Specifically, one can replace the a.e. zero derivative of quantized activation function composited in the chain rule with a related surrogate. This proxy derivative used in the backward pass only is referred as the straight-through estimator (STE)

(Bengio et al., 2013). In the same paper, Bengio et al. (2013)

proposed an alternative approach based on stochastic neurons. In addition,

Friesen & Domingos (2017) proposed the feasible target propagation algorithm for learning hard-threshold (or binary activated) networks (Lee et al., 2015)

via convex combinatorial optimization.

1.1 Related Works

The idea of STE originates to the celebrated perceptron algorithm

(Rosenblatt, 1957, 1962) in 1950s for learning single-layer perceptrons. The perceptron algorithm essentially does not calculate the “gradient” through the standard chain rule, but instead through a modified chain rule in which the derivative of identity function serves as the proxy of the original derivative of binary output function . Its convergence has been extensive discussed in the literature; see for examples, (Widrow & Lehr, 1990; Freund & Schapire, 1999) and the references therein. Hinton (2012) extended this idea to train multi-layer networks with binary activations (a.k.a. binary neuron), namely, to back-propagate as if the activation had been the identity function. Bengio et al. (2013)

proposed a STE variant which uses the derivative of the sigmoid function instead. In the training of DNN with weights and activations constrained to

, (Hubara et al., 2016) substituted the derivative of the signum activation function with in the backward pass, known as the saturated STE. Later the idea of STE was readily employed to the training of DNN with general quantized ReLU activations (Hubara et al., 2018; Zhou et al., 2016; Cai et al., 2017; Choi et al., 2018; Yin et al., 2018b), where some other proxies took place including the derivatives of vanilla ReLU and clipped ReLU. Despite all the empirical success of STE, there is very limited theoretical understanding of it in training DNN with stair-case activations.

Goel et al. (2018) considers leaky ReLU activation of a one-hidden-layer network. They showed the convergence of the so-called Convertron algorithm, which uses the identity STE in the backward pass through the leaky ReLU layer. Other similar scenarios, where certain layers are not desirable for back-propagation, have been brought up recently by (Wang et al., 2018) and (Athalye et al., 2018)

. The former proposed an implicit weighted nonlocal Laplacian layer as the classifier to improve the generalization accuracy of DNN. In the backward pass, the derivative of a pre-trained fully-connected layer was used as a surrogate. To circumvent adversarial defense

(Szegedy et al., 2013), (Athalye et al., 2018) introduced the backward pass differentiable approximation, which shares the same spirit as STE, and successfully broke defenses at ICLR 2018 that rely on obfuscated gradients.

1.2 Main Contributions

Throughout this paper, we shall refer to the “gradient” of loss function w.r.t. the weight variables through the STE-modified chain rule as coarse gradient. Since the backward and forward passes do not match, the coarse gradient is certainly not the gradient of loss function, and it is generally not the gradient of any function. Why searching in its negative direction minimizes the training loss, as this is not the standard gradient descent algorithm? Apparently, the choice of STE is non-unique, then what makes a good STE? From the optimization perspective, we take a step towards understanding STE in training quantized ReLU nets by attempting these questions.

On the theoretical side, we consider three representative STEs for learning a two-linear-layer network with binary activation and Gaussian data: the derivatives of the identity function (Rosenblatt, 1957; Hinton, 2012; Goel et al., 2018), vanilla ReLU and the clipped ReLUs (Cai et al., 2017; Hubara et al., 2016). We adopt the model of population loss minimization (Brutzkus & Globerson, 2017; Tian, 2017; Li & Yuan, 2017; Du et al., 2018). For the first time, we prove that proper choices of STE give rise to training algorithms that are descent. Specifically, the negative expected coarse gradients based on STEs of the vanilla and clipped ReLUs are provably descent directions for the minimizing the population loss, which yield monotonically decreasing energy in the training. In contrast, this is not true for the identity STE. We further prove that the corresponding training algorithm can be unstable near certain local minima, because the coarse gradient may simply not vanish there.

Complementary to the analysis, we examine the empirical performances of the three STEs on MNIST and CIFAR-10 classifications with general quantized ReLU. While both vanilla and clipped ReLUs work very well on the relatively shallow LeNet-5, clipped ReLU STE is arguably the best for the deeper VGG-11 and ResNet-20. In our CIFAR experiments in section 4.2, we observe that the training using identity or ReLU STE can be unstable at good minima and repelled to an inferior one with substantially higher training loss and decreased generalization accuracy. This is an implication that poor STEs generate coarse gradients incompatible with the energy landscape, which is consistent with our theoretical finding about the identity STE.

To our knowledge, convergence guarantees of perceptron algorithm (Rosenblatt, 1957, 1962) and Convertron algorithm (Goel et al., 2018) were proved for the identity STE. It is worth noting that Convertron (Goel et al., 2018) makes weaker assumptions than in this paper. These results, however, do not generalize to the network with two trainable layers studied here. As aforementioned, the identity STE is actually a poor choice in our case. Moreover, it is not clear if their analyses can be extended to other STEs. Similar to Convertron with leaky ReLU, the monotonicity of quantized activation function plays a role in coarse gradient descent. Indeed, all three STEs considered here exploit this property. But this is not the whole story. A great STE like the clipped ReLU matches quantized ReLU at the extrema, otherwise the instability/incompatibility issue may arise.

Organization. In section 2, we study the energy landscape of a two-linear-layer network with binary activation and Gaussian data. We present the main results and sketch the mathematical analysis for STE in section 3. In section 4, we compare the empirical performances of different STEs in 2-bit and 4-bit activation quantization, and report the instability phenomena of the training algorithms associated with poor STEs observed in CIFAR experiments. Due to space limitation, all the technical proofs as well as some figures are deferred to the appendix.


denotes the Euclidean norm of a vector or the spectral norm of a matrix.

represents the vector of all zeros, whereas the vector of all ones.

is the identity matrix of order

. For any , is their inner product. denotes the Hadamard product whose entry is given by .

2 Learning Two-Linear-Layer CNN with Binary Activation

We consider a model similar to (Du et al., 2018) that outputs the prediction

for some input . Here and are the trainable weights in the first and second linear layer, respectively; denotes the th row vector of ; the activation function acts component-wise on the vector , i.e., . The first layer serves as a convolutional layer, where each row can be viewed as a patch sampled from and the weight filter is shared among all patches, and the second linear layer is the classifier. The label is generated according to for some true (non-zero) parameters and . Moreover, we use the following squared sample loss


Unlike in (Du et al., 2018), the activation function here is not ReLU, but the binary function .

We assume that the entries of

are i.i.d. sampled from the Gaussian distribution

(Zhong et al., 2017; Brutzkus & Globerson, 2017). Since for any scalar , without loss of generality, we take and cast the learning task as the following population loss minimization problem:


where the sample loss is given by (1).

2.1 Back-propagation and Coarse Gradient Descent

With the Gaussian assumption on , as will be shown in section 2.2, it is possible to find the analytic expressions of and its gradient

The gradient of objective function, however, is not available for the network training. In fact, we can only access the expected sample gradient, namely,

We remark that is not the same as . By the standard back-propagation or chain rule, we readily check that




Note that is zero a.e., which makes (4) inapplicable to the training. The idea of STE is to simply replace the a.e. zero component in (4) with a related non-trivial function (Hinton, 2012; Bengio et al., 2013; Hubara et al., 2016; Cai et al., 2017), which is the derivative of some (sub)differentiable function . More precisely, back-propagation using the STE gives the following non-trivial surrogate of , to which we refer as the coarse (partial) gradient


Using the STE

to train the two-linear-layer convolutional neural network (CNN) with binary activation gives rise to the (full-batch) coarse gradient descent described in Algorithm


Input: initialization , , learning rate .

  for  do
  end for
Algorithm 1 Coarse gradient descent for learning two-linear-layer CNN with STE .

2.2 Preliminaries

Let us present some preliminaries about the landscape of the population loss function . To this end, we define the angle between and as for any . Recall that the label is given by from (1), we elaborate on the analytic expressions of and .

Lemma 1.

If , the population loss is given by

In addition, for .

Lemma 2.

If and , the partial gradients of w.r.t. and are





For any , is impossible to be a local minimizer. The only possible (local) minimizers of the model (2) are located at

  1. Stationary points where the gradients given by (6) and (7) vanish simultaneously (which may not be possible), i.e.,

  2. Non-differentiable points where and , or and .

Among them, are obviously the global minimizers of (2). We show that the stationary points, if exist, can only be saddle points, and are the only potential spurious local minimizers.

Proposition 1.

If the true parameter satisfies , then


give the saddle points obeying (8), and are the spurious local minimizers. Otherwise, the model (2) has no saddle points or spurious local minimizers.

We further prove that the population gradient given by (6) and (7), is Lipschitz continuous when restricted to bounded domains.

Lemma 3.

For any differentiable points and with and , there exists a Lipschitz constant depending on and , such that

3 Main Results

We are most interested in the complex case where both the saddle points and spurious local minimizers are present. Our main results are concerned with the behaviors of the coarse gradient descent summarized in Algorithm 1 when the derivatives of the vanilla and clipped ReLUs as well as the identity function serve as the STE, respectively. We shall prove that Algorithm 1 using the derivative of vanilla or clipped ReLU converges to a critical point, whereas that with the identity STE does not.

Theorem 1 (Convergence).

Let be the sequence generated by Algorithm 1 with ReLU or clipped ReLU . Suppose for all with some . Then if the learning rate is sufficiently small, for any initialization , the objective sequence is monotonically decreasing, and converges to a saddle point or a (local) minimizer of the population loss minimization (2). In addition, if and , the descent and convergence properties do not hold for Algorithm 1 with the identity function near the local minimizers satisfying and .

Remark 1.

The convergence guarantee for the coarse gradient descent is established under the assumption that there are infinite training samples. When there are only a few data, in a coarse scale, the empirical loss roughly descends along the direction of negative coarse gradient, as illustrated by Figure 1

. As the sample size increases, the empirical loss gains monotonicity and smoothness. This explains why (proper) STE works so well with massive amounts of data as in deep learning.

Remark 2.

The same results hold, if the Gaussian assumption on the input data is weakened to that their rows i.i.d. follow some rotation-invariant distribution. The proof will be substantially similar.

In the rest of this section, we sketch the mathematical analysis for the main results.

sample size = 10 sample size = 50 sample size = 1000
Figure 1: The plots of the empirical loss moving by one step in the direction of negative coarse gradient v.s. the learning rate (step size) for different sample sizes.

3.1 Derivative of the Vanilla ReLU as STE

If we choose the derivative of ReLU as the STE in (5), it is easy to see , and we have the following expressions of and for Algorithm 1.

Lemma 4.

The expected partial gradient of w.r.t. is


Let in (5). The expected coarse gradient w.r.t. is


where .

As stated in Lemma 5 below, the key observation is that the coarse partial gradient has non-negative correlation with the population partial gradient , and together with form a descent direction for minimizing the population loss.

Lemma 5.

If and , then the inner product between the expected coarse and population gradients w.r.t. is

Moreover, if further and , there exists a constant depending on and , such that


Clearly, when , is roughly in the same direction as . Moreover, since by Lemma 4, , we expect that the coarse gradient descent behaves like the gradient descent directly on . Here we would like to highlight the significance of the estimate (12) in guaranteeing the descent property of Algorithm 1. By the Lipschitz continuity of specified in Lemma 3, it holds that


where a) is due to (12). Therefore, if is small enough, we have monotonically decreasing energy until convergence.

Lemma 6.

When Algorithm 1 converges, and vanish simultaneously, which only occurs at the

  1. Saddle points where (8) is satisfied according to Proposition 1.

  2. Minimizers of (2) where , , or , .

Lemma 6 states that when Algorithm 1 using ReLU STE converges, it can only converge to a critical point of the population loss function.

3.2 Derivative of the Clipped ReLU as STE

For the STE using clipped ReLU, and . We have results similar to Lemmas 5 and 6. That is, the coarse partial gradient using clipped ReLU STE generally has positive correlation with the true partial gradient of the population loss (Lemma 7)). Moreover, the coarse gradient vanishes and only vanishes at the critical points (Lemma 8).

Lemma 7.

If and , then

where same as in Lemma 5, and

with . The inner product between the expected coarse and true gradients w.r.t.

Moreover, if further and , there exists a constant depending on and , such that

Lemma 8.

When Algorithm 1 converges, and vanish simultaneously, which only occurs at the

  1. Saddle points where (8) is satisfied according to Proposition 1.

  2. Minimizers of (2) where , , or , .

3.3 Derivative of the Identity Function as STE

Now we consider the derivative of identity function. Similar results to Lemmas 5 and 6 are not valid anymore. It happens that the coarse gradient derived from the identity STE does not vanish at local minima, and Algorithm 1 may never converge there.

Lemma 9.

Let in (5). Then the expected coarse partial gradient w.r.t. is


If and ,

i.e., does not vanish at the local minimizers if and .

Lemma 10.

If and , then the inner product between the expected coarse and true gradients w.r.t. is


When , , if and , we have


Lemma 9 suggests that if , the coarse gradient descent will never converge near the spurious minimizers with and , because does not vanish there. By the positive correlation implied by (15) of Lemma 10, for some proper , the iterates may move towards a local minimizer in the beginning. But when approaches it, the descent property (3.1) does not hold for because of (16), hence the training loss begins to increase and instability arises.

4 Experiments

While our theory implies that both vanilla and clipped ReLUs learn a two-linear-layer CNN, their empirical performances on deeper nets are different. In this section, we compare the performances of the identity, ReLU and clipped ReLU STEs on MNIST (LeCun et al., 1998) and CIFAR-10 (Krizhevsky, 2009) benchmarks for 2-bit or 4-bit quantized activations. As an illustration, we plot the 2-bit quantized ReLU and its associated clipped ReLU in Figure 3 in the appendix. Intuitively, the clipped ReLU should be the best performer, as it best approximates the original quantized ReLU. We also report the instability issue of the training algorithm when using an improper STE in section 4.2. In all experiments, the weights are kept float.

The resolution for the quantized ReLU needs to be carefully chosen to maintain the full-precision level accuracy. To this end, we follow (Cai et al., 2017)

and resort to a modified batch normalization layer

(Ioffe & Szegedy, 2015) without the scale and shift, whose output components approximately follow a unit Gaussian distribution. Then the that fits the input of activation layer the best can be pre-computed by a variant of Lloyd’s algorithm (Lloyd, 1982; Yin et al., 2018a) applied to a set of simulated 1-D half-Gaussian data. After determining the , it will be fixed during the whole training process. Since the original LeNet-5 does not have batch normalization, we add one prior to each activation layer. We emphasize that we are not claiming the superiority of the quantization approach used here, as it is nothing but the HWGQ (Cai et al., 2017), except we consider the uniform quantization.

The optimizer we use is the stochastic (coarse) gradient descent with momentum = 0.9 for all experiments. We train 50 epochs for LeNet-5

(LeCun et al., 1998) on MNIST, and 200 epochs for VGG-11 (Simonyan & Zisserman, 2014) and ResNet-20 (He et al., 2016) on CIFAR-10. The parameters/weights are initialized with those from their pre-trained full-precision counterparts. The schedule of the learning rate is specified in Table 2 in the appendix.

4.1 Comparison Results

The experimental results are summarized in Table 1, where we record both the training losses and validation accuracies. Among the three STEs, the derivative of clipped ReLU gives the best overall performance, followed by vanilla ReLU and then by the identity function. For deeper networks, clipped ReLU is the best performer. But on the relatively shallow LeNet-5 network, vanilla ReLU exhibits comparable performance to the clipped ReLU, which is somewhat in line with our theoretical finding that ReLU is a great STE for learning the two-linear-layer (shallow) CNN.

Network BitWidth Straight-through estimator
identity vanilla ReLU clipped ReLU
MNIST LeNet5 2 / / /
4 / / /
CIFAR10 VGG11 2 0.19/ / /
4 / / /
ResNet20 2 / / /
4 / / /
Table 1: Training loss/validation accuracy () on MNIST and CIFAR-10 with quantized activations and float weights, for STEs using derivatives of the identity function, vanilla ReLU and clipped ReLU at bit-widths 2 and 4.

4.2 Instability

We report the phenomenon of being repelled from a good minimum on ResNet-20 with 4-bit activations when using the identity STE, to demonstrate the instability issue as predicted in Theorem 1. By Table 1, the coarse gradient descent algorithms using the vanilla and clipped ReLUs converge to the neighborhoods of the minima with validation accuracies (training losses) of (0.25) and (0.04), respectively, whereas that using the identity STE gives (1.38). Note that the landscape of the empirical loss function does not depend on which STE is used in the training. Then we initialize training with the two improved minima and use the identity STE. To see if the algorithm is stable there, we start the training with a tiny learning rate of . For both initializations, the training loss and validation error significantly increase within the first 20 epochs; see Figure 4.2. To speedup training, at epoch 20, we switch to the normal schedule of learning rate specified in Table 2 and run 200 additional epochs. The training using the identity STE ends up with a much worse minimum. This is because the coarse gradient with identity STE does not vanish at the good minima in this case (Lemma 9). Similarly, the poor performance of ReLU STE on 2-bit activated ResNet-20 is also due to the instability of the corresponding training algorithm at good minima, as illustrated by Figure 4 in Appendix C, although it diverges much slower.

Figure 2: When initialized with weights (good minima) produced by the vanilla (orange) and clipped (blue) ReLUs on ResNet-20 with 4-bit activations, the coarse gradient descent using the identity STE ends up being repelled from there. The learning rate is set to until epoch 20.

5 Concluding Remarks

We provided the first theoretical justification for the concept of STE that it gives rise to descent training algorithm. We considered three STEs: the derivatives of the identity function, vanilla ReLU and clipped ReLU, for learning a two-linear-layer CNN with binary activation. We derived the explicit formulas of the expected coarse gradients corresponding to the STEs, and showed that the negative expected coarse gradients based on vanilla and clipped ReLUs are descent directions for minimizing the population loss, whereas the identity STE is not since it generates a coarse gradient incompatible with the energy landscape. The instability/incompatibility issue was confirmed in CIFAR experiments for improper choices of STE. In the future work, we aim further understanding of coarse gradient descent for large-scale optimization problems with intractable gradients.


This work was partially supported by NSF grants DMS-1522383, IIS-1632935, ONR grant N00014-18-1-2527, AFOSR grant FA9550-18-0167, DOE grant DE-SC0013839 and STROBE STC NSF grant DMR-1548924.



A.  The Plots of Quantized and Clipped ReLUs

quantized ReLU clipped ReLU
Figure 3: The plots of 2-bit quantized ReLU (with quantization levels including 0) and the associated clipped ReLU . is the resolution determined in advance of the network training.

B.  The Schedule of Learning Rate

Network # epochs Batch size Learning rate
initial decay rate milestone
LeNet5 50 64 0.1 0.1 [20,40]
VGG11 200 128 0.01 0.1 [80,140]
ResNet20 200 128 0.01 0.1 [80,140]
Table 2: The schedule of the learning rate.

C. Instability of ReLU STE on ResNet-20 with 2-bit Activations

Figure 4: When initialized with the weights produced by the clipped ReLU STE on ResNet-20 with 2-bit activations (88.38% validation accuracy), the coarse gradient descent using the ReLU STE with learning rate is not stable there, and both classification and training errors begin to increase.

D.  Additional Supporting Lemmas

Lemma 11.

Let be a Gaussian random vector with entries i.i.d. sampled from . Given nonzero vectors with the angle , we have