# Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets

There are no comments yet.

## Authors

• 10 publications
• 7 publications
• 38 publications
• 12 publications
• 11 publications
• 16 publications
• ### Blended Coarse Gradient Descent for Full Quantization of Deep Neural Networks

Quantized deep neural networks (QDNNs) are attractive due to their much ...
08/15/2018 ∙ by Penghang Yin, et al. ∙ 2

• ### Variational Neural Networks: Every Layer and Neuron Can Be Unique

The choice of activation function can significantly influence the perfor...
10/14/2018 ∙ by Yiwei Li, et al. ∙ 0

• ### Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

We consider the problem of learning a one-hidden-layer neural network wi...
12/03/2017 ∙ by Simon S. Du, et al. ∙ 0

• ### Hebbian-Descent

In this work we propose Hebbian-descent as a biologically plausible lear...
05/25/2019 ∙ by Jan Melchior, et al. ∙ 0

• ### Learning low-precision neural networks without Straight-Through Estimator(STE)

The Straight-Through Estimator (STE) is widely used for back-propagating...
03/04/2019 ∙ by Zhi-Gang Liu, et al. ∙ 0

• ### BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations

Binary Neural Networks (BNNs) have been garnering interest thanks to the...
02/16/2020 ∙ by Hyungjun Kim, et al. ∙ 4

• ### Global Convergence of Sobolev Training for Overparametrized Neural Networks

Sobolev loss is used when training a network to approximate the values a...
06/14/2020 ∙ by Jorio Cocola, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep neural networks (DNN) have achieved the remarkable success in many machine learning applications such as computer vision

(Krizhevsky et al., 2012; Ren et al., 2015)(Collobert & Weston, 2008)(Mnih et al., 2015; Silver et al., 2016). However, the deployment of DNN typically require hundreds of megabytes of memory storage for the trainable full-precision floating-point parameters, and billions of floating-point operations to make a single inference. To achieve substantial memory savings and energy efficiency at inference time, many recent efforts have been made to the training of coarsely quantized DNN, meanwhile maintaining the performance of their float counterparts (Courbariaux et al., 2015; Rastegari et al., 2016; Cai et al., 2017; Hubara et al., 2018; Yin et al., 2018b).

Training fully quantized DNN amounts to solving a very challenging optimization problem. It calls for minimizing a piecewise constant and highly nonconvex empirical risk function subject to a discrete set-constraint that characterizes the quantized weights. In particular, weight quantization of DNN have been extensively studied in the literature; see for examples (Li et al., 2016; Zhu et al., 2016; Li et al., 2017; Yin et al., 2016, 2018a; Hou & Kwok, 2018; He et al., 2018; Li & Hao, 2018). On the other hand, the gradient

in training activation quantized DNN is almost everywhere (a.e.) zero, which makes the standard back-propagation inapplicable. The arguably most effective way around this issue is nothing but to construct a non-trivial search direction by properly modifying the chain rule. Specifically, one can replace the a.e. zero derivative of quantized activation function composited in the chain rule with a related surrogate. This proxy derivative used in the backward pass only is referred as the straight-through estimator (STE)

(Bengio et al., 2013). In the same paper, Bengio et al. (2013)

proposed an alternative approach based on stochastic neurons. In addition,

Friesen & Domingos (2017) proposed the feasible target propagation algorithm for learning hard-threshold (or binary activated) networks (Lee et al., 2015)

via convex combinatorial optimization.

### 1.1 Related Works

The idea of STE originates to the celebrated perceptron algorithm

(Rosenblatt, 1957, 1962) in 1950s for learning single-layer perceptrons. The perceptron algorithm essentially does not calculate the “gradient” through the standard chain rule, but instead through a modified chain rule in which the derivative of identity function serves as the proxy of the original derivative of binary output function . Its convergence has been extensive discussed in the literature; see for examples, (Widrow & Lehr, 1990; Freund & Schapire, 1999) and the references therein. Hinton (2012) extended this idea to train multi-layer networks with binary activations (a.k.a. binary neuron), namely, to back-propagate as if the activation had been the identity function. Bengio et al. (2013)

proposed a STE variant which uses the derivative of the sigmoid function instead. In the training of DNN with weights and activations constrained to

, (Hubara et al., 2016) substituted the derivative of the signum activation function with in the backward pass, known as the saturated STE. Later the idea of STE was readily employed to the training of DNN with general quantized ReLU activations (Hubara et al., 2018; Zhou et al., 2016; Cai et al., 2017; Choi et al., 2018; Yin et al., 2018b), where some other proxies took place including the derivatives of vanilla ReLU and clipped ReLU. Despite all the empirical success of STE, there is very limited theoretical understanding of it in training DNN with stair-case activations.

Goel et al. (2018) considers leaky ReLU activation of a one-hidden-layer network. They showed the convergence of the so-called Convertron algorithm, which uses the identity STE in the backward pass through the leaky ReLU layer. Other similar scenarios, where certain layers are not desirable for back-propagation, have been brought up recently by (Wang et al., 2018) and (Athalye et al., 2018)

. The former proposed an implicit weighted nonlocal Laplacian layer as the classifier to improve the generalization accuracy of DNN. In the backward pass, the derivative of a pre-trained fully-connected layer was used as a surrogate. To circumvent adversarial defense

(Szegedy et al., 2013), (Athalye et al., 2018) introduced the backward pass differentiable approximation, which shares the same spirit as STE, and successfully broke defenses at ICLR 2018 that rely on obfuscated gradients.

### 1.2 Main Contributions

Throughout this paper, we shall refer to the “gradient” of loss function w.r.t. the weight variables through the STE-modified chain rule as coarse gradient. Since the backward and forward passes do not match, the coarse gradient is certainly not the gradient of loss function, and it is generally not the gradient of any function. Why searching in its negative direction minimizes the training loss, as this is not the standard gradient descent algorithm? Apparently, the choice of STE is non-unique, then what makes a good STE? From the optimization perspective, we take a step towards understanding STE in training quantized ReLU nets by attempting these questions.

On the theoretical side, we consider three representative STEs for learning a two-linear-layer network with binary activation and Gaussian data: the derivatives of the identity function (Rosenblatt, 1957; Hinton, 2012; Goel et al., 2018), vanilla ReLU and the clipped ReLUs (Cai et al., 2017; Hubara et al., 2016). We adopt the model of population loss minimization (Brutzkus & Globerson, 2017; Tian, 2017; Li & Yuan, 2017; Du et al., 2018). For the first time, we prove that proper choices of STE give rise to training algorithms that are descent. Specifically, the negative expected coarse gradients based on STEs of the vanilla and clipped ReLUs are provably descent directions for the minimizing the population loss, which yield monotonically decreasing energy in the training. In contrast, this is not true for the identity STE. We further prove that the corresponding training algorithm can be unstable near certain local minima, because the coarse gradient may simply not vanish there.

Complementary to the analysis, we examine the empirical performances of the three STEs on MNIST and CIFAR-10 classifications with general quantized ReLU. While both vanilla and clipped ReLUs work very well on the relatively shallow LeNet-5, clipped ReLU STE is arguably the best for the deeper VGG-11 and ResNet-20. In our CIFAR experiments in section 4.2, we observe that the training using identity or ReLU STE can be unstable at good minima and repelled to an inferior one with substantially higher training loss and decreased generalization accuracy. This is an implication that poor STEs generate coarse gradients incompatible with the energy landscape, which is consistent with our theoretical finding about the identity STE.

To our knowledge, convergence guarantees of perceptron algorithm (Rosenblatt, 1957, 1962) and Convertron algorithm (Goel et al., 2018) were proved for the identity STE. It is worth noting that Convertron (Goel et al., 2018) makes weaker assumptions than in this paper. These results, however, do not generalize to the network with two trainable layers studied here. As aforementioned, the identity STE is actually a poor choice in our case. Moreover, it is not clear if their analyses can be extended to other STEs. Similar to Convertron with leaky ReLU, the monotonicity of quantized activation function plays a role in coarse gradient descent. Indeed, all three STEs considered here exploit this property. But this is not the whole story. A great STE like the clipped ReLU matches quantized ReLU at the extrema, otherwise the instability/incompatibility issue may arise.

Organization. In section 2, we study the energy landscape of a two-linear-layer network with binary activation and Gaussian data. We present the main results and sketch the mathematical analysis for STE in section 3. In section 4, we compare the empirical performances of different STEs in 2-bit and 4-bit activation quantization, and report the instability phenomena of the training algorithms associated with poor STEs observed in CIFAR experiments. Due to space limitation, all the technical proofs as well as some figures are deferred to the appendix.

Notations.

denotes the Euclidean norm of a vector or the spectral norm of a matrix.

represents the vector of all zeros, whereas the vector of all ones.

is the identity matrix of order

. For any , is their inner product. denotes the Hadamard product whose entry is given by .

## 2 Learning Two-Linear-Layer CNN with Binary Activation

We consider a model similar to (Du et al., 2018) that outputs the prediction

 y(Z,v,w):=m∑i=1viσ(Z⊤iw)=v⊤σ(Zw)

for some input . Here and are the trainable weights in the first and second linear layer, respectively; denotes the th row vector of ; the activation function acts component-wise on the vector , i.e., . The first layer serves as a convolutional layer, where each row can be viewed as a patch sampled from and the weight filter is shared among all patches, and the second linear layer is the classifier. The label is generated according to for some true (non-zero) parameters and . Moreover, we use the following squared sample loss

 ℓ(v,w;Z):=12(y(Z,v,w)−y∗(Z))2=12(v⊤σ(Zw)−y∗(Z))2. (1)

Unlike in (Du et al., 2018), the activation function here is not ReLU, but the binary function .

We assume that the entries of

are i.i.d. sampled from the Gaussian distribution

(Zhong et al., 2017; Brutzkus & Globerson, 2017). Since for any scalar , without loss of generality, we take and cast the learning task as the following population loss minimization problem:

 minv∈Rm,w∈Rnf(v,w):=EZ[ℓ(v,w;Z)], (2)

where the sample loss is given by (1).

### 2.1 Back-propagation and Coarse Gradient Descent

With the Gaussian assumption on , as will be shown in section 2.2, it is possible to find the analytic expressions of and its gradient

 ∇f(v,w):=⎡⎢⎣∂f∂v(v,w)∂f∂w(v,w)⎤⎥⎦.

The gradient of objective function, however, is not available for the network training. In fact, we can only access the expected sample gradient, namely,

 EZ[∂ℓ∂v(v,w;Z)] and EZ[∂ℓ∂w(v,w;Z)].

We remark that is not the same as . By the standard back-propagation or chain rule, we readily check that

 ∂ℓ∂v(v,w;Z)=σ(Zw)(v⊤σ(Zw)−y∗(Z)) (3)

and

 (4)

Note that is zero a.e., which makes (4) inapplicable to the training. The idea of STE is to simply replace the a.e. zero component in (4) with a related non-trivial function (Hinton, 2012; Bengio et al., 2013; Hubara et al., 2016; Cai et al., 2017), which is the derivative of some (sub)differentiable function . More precisely, back-propagation using the STE gives the following non-trivial surrogate of , to which we refer as the coarse (partial) gradient

 gμ(v,w;Z)=Z⊤(μ′(Zw)⊙v)(v⊤σ(Zw)−y∗(Z)). (5)

Using the STE

to train the two-linear-layer convolutional neural network (CNN) with binary activation gives rise to the (full-batch) coarse gradient descent described in Algorithm

1.

### 2.2 Preliminaries

Let us present some preliminaries about the landscape of the population loss function . To this end, we define the angle between and as for any . Recall that the label is given by from (1), we elaborate on the analytic expressions of and .

###### Lemma 1.

If , the population loss is given by

 18[v⊤(Im+1m1⊤m)v−2v⊤((1−2πθ(w,w∗))Im+1m1⊤m)v∗+(v∗)⊤(Im+1m1⊤m)v∗].

In addition, for .

###### Lemma 2.

If and , the partial gradients of w.r.t. and are

 ∂f∂v(v,w)=14(Im+1m1⊤m)v−14((1−2πθ(w,w∗))Im+1m1⊤m)v∗ (6)

and

 ∂f∂w(v,w)=−v⊤v∗2π∥w∥(In−ww⊤∥w∥2)w∗∥∥(In−ww⊤∥w∥2)w∗∥∥, (7)

respectively.

For any , is impossible to be a local minimizer. The only possible (local) minimizers of the model (2) are located at

1. Stationary points where the gradients given by (6) and (7) vanish simultaneously (which may not be possible), i.e.,

 v⊤v∗=0 and v=(Im+1m1⊤m)−1((1−2πθ(w,w∗))Im+1m1⊤m)v∗. (8)
2. Non-differentiable points where and , or and .

Among them, are obviously the global minimizers of (2). We show that the stationary points, if exist, can only be saddle points, and are the only potential spurious local minimizers.

###### Proposition 1.

If the true parameter satisfies , then

 {(v,w):v=(Im+1m1⊤m)−1 (−(1⊤mv∗)2(m+1)∥v∗∥2−(1⊤mv∗)2Im+1m1⊤m)v∗, θ(w,w∗)=π2(m+1)∥v∗∥2(m+1)∥v∗∥2−(1⊤mv∗)2} (9)

give the saddle points obeying (8), and are the spurious local minimizers. Otherwise, the model (2) has no saddle points or spurious local minimizers.

We further prove that the population gradient given by (6) and (7), is Lipschitz continuous when restricted to bounded domains.

###### Lemma 3.

For any differentiable points and with and , there exists a Lipschitz constant depending on and , such that

 ∥∇f(v,w)−∇f(~v,~w)∥≤L∥(v,w)−(~v,~w)∥.

## 3 Main Results

We are most interested in the complex case where both the saddle points and spurious local minimizers are present. Our main results are concerned with the behaviors of the coarse gradient descent summarized in Algorithm 1 when the derivatives of the vanilla and clipped ReLUs as well as the identity function serve as the STE, respectively. We shall prove that Algorithm 1 using the derivative of vanilla or clipped ReLU converges to a critical point, whereas that with the identity STE does not.

###### Theorem 1 (Convergence).

Let be the sequence generated by Algorithm 1 with ReLU or clipped ReLU . Suppose for all with some . Then if the learning rate is sufficiently small, for any initialization , the objective sequence is monotonically decreasing, and converges to a saddle point or a (local) minimizer of the population loss minimization (2). In addition, if and , the descent and convergence properties do not hold for Algorithm 1 with the identity function near the local minimizers satisfying and .

###### Remark 1.

The convergence guarantee for the coarse gradient descent is established under the assumption that there are infinite training samples. When there are only a few data, in a coarse scale, the empirical loss roughly descends along the direction of negative coarse gradient, as illustrated by Figure 1

. As the sample size increases, the empirical loss gains monotonicity and smoothness. This explains why (proper) STE works so well with massive amounts of data as in deep learning.

###### Remark 2.

The same results hold, if the Gaussian assumption on the input data is weakened to that their rows i.i.d. follow some rotation-invariant distribution. The proof will be substantially similar.

In the rest of this section, we sketch the mathematical analysis for the main results.

### 3.1 Derivative of the Vanilla ReLU as STE

If we choose the derivative of ReLU as the STE in (5), it is easy to see , and we have the following expressions of and for Algorithm 1.

###### Lemma 4.

The expected partial gradient of w.r.t. is

 EZ[∂ℓ∂v(v,w;Z)]=∂f∂v(v,w). (10)

Let in (5). The expected coarse gradient w.r.t. is

 EZ[grelu(v,w;Z)]=h(v,v∗)2√2πw∥w∥−cos(θ(w,w∗)2)v⊤v∗√2πw∥w∥+w∗∥∥w∥w∥+w∗∥∥,\lx@notefootnoteWeredefinethesecondtermas$0n$inthecase$θ(w,w∗)=π$,orequivalently,$w∥w∥+w∗=0n$. (11)

where .

As stated in Lemma 5 below, the key observation is that the coarse partial gradient has non-negative correlation with the population partial gradient , and together with form a descent direction for minimizing the population loss.

###### Lemma 5.

If and , then the inner product between the expected coarse and population gradients w.r.t. is

 ⟨EZ[grelu(v,w;Z)],∂f∂w(v,w)⟩=sin(θ(w,w∗))2(√2π)3∥w∥(v⊤v∗)2≥0.

Moreover, if further and , there exists a constant depending on and , such that

 ∥∥EZ[grelu(v,w;Z)]∥∥2≤Arelu(∥∥∥∂f∂v(v,w)∥∥∥2+⟨EZ[grelu(v,w;Z)],∂f∂w(v,w)⟩). (12)

Clearly, when , is roughly in the same direction as . Moreover, since by Lemma 4, , we expect that the coarse gradient descent behaves like the gradient descent directly on . Here we would like to highlight the significance of the estimate (12) in guaranteeing the descent property of Algorithm 1. By the Lipschitz continuity of specified in Lemma 3, it holds that

 f(vt+1,wt+1)−f(vt,wt)≤ ⟨∂f∂v(vt,wt),vt+1−vt⟩+⟨∂f∂w(vt,wt),wt+1−wt⟩ +L2(∥vt+1−vt∥2+∥wt+1−wt∥2) = −(η−Lη22)∥∥∥∂f∂v(vt,wt)∥∥∥2+Lη22∥∥EZ[grelu(vt,wt;Z)]∥∥2 −η⟨∂f∂w(vt,wt),EZ[grelu(vt,wt;Z)]⟩ a)≤ −(η−(1+Arelu)Lη22)∥∥∥∂f∂v(vt,wt)∥∥∥2 (13)

where a) is due to (12). Therefore, if is small enough, we have monotonically decreasing energy until convergence.

###### Lemma 6.

When Algorithm 1 converges, and vanish simultaneously, which only occurs at the

1. Saddle points where (8) is satisfied according to Proposition 1.

2. Minimizers of (2) where , , or , .

Lemma 6 states that when Algorithm 1 using ReLU STE converges, it can only converge to a critical point of the population loss function.

### 3.2 Derivative of the Clipped ReLU as STE

For the STE using clipped ReLU, and . We have results similar to Lemmas 5 and 6. That is, the coarse partial gradient using clipped ReLU STE generally has positive correlation with the true partial gradient of the population loss (Lemma 7)). Moreover, the coarse gradient vanishes and only vanishes at the critical points (Lemma 8).

###### Lemma 7.

If and , then

 EZ[gcrelu(v,w;Z)]= p(0,w)h(v,v∗)2w∥w∥−(v⊤v∗)csc(θ/2)⋅q(θ,w)w∥w∥+w∗∥∥w∥w∥+w∗∥∥ −(v⊤v∗)(p(θ,w)−cot(θ/2)⋅q(θ,w))w∥w∥,

where same as in Lemma 5, and

 p(θ,w):=12π∫π2−π2+θcos(ϕ)ξ(sec(ϕ)∥w∥)dϕ,q(θ,w):=12π∫π2−π2+θsin(ϕ)ξ(sec(ϕ)∥w∥)dϕ

with . The inner product between the expected coarse and true gradients w.r.t.

 ⟨EZ[gcrelu(v,w;Z)],∂f∂w(v,w)⟩=q(θ,w)2π∥w∥(v⊤v∗)2≥0.

Moreover, if further and , there exists a constant depending on and , such that

 ∥∥EZ[gcrelu(v,w;Z)]∥∥2≤Acrelu(∥∥∥∂f∂v(v,w)∥∥∥2+⟨EZ[gcrelu(v,w;Z)],∂f∂w(v,w)⟩).
###### Lemma 8.

When Algorithm 1 converges, and vanish simultaneously, which only occurs at the

1. Saddle points where (8) is satisfied according to Proposition 1.

2. Minimizers of (2) where , , or , .

### 3.3 Derivative of the Identity Function as STE

Now we consider the derivative of identity function. Similar results to Lemmas 5 and 6 are not valid anymore. It happens that the coarse gradient derived from the identity STE does not vanish at local minima, and Algorithm 1 may never converge there.

###### Lemma 9.

Let in (5). Then the expected coarse partial gradient w.r.t. is

 EZ[gid(v,w;Z)]=1√2π(∥v∥2w∥w∥−(v⊤v∗)w∗). (14)

If and ,

 ∥∥EZ[gid(v,w;Z)]∥∥=2(m−1)√2π(m+1)2(1⊤mv∗)2≥0,

i.e., does not vanish at the local minimizers if and .

###### Lemma 10.

If and , then the inner product between the expected coarse and true gradients w.r.t. is

 ⟨EZ[gid(v,w;Z)],∂f∂w(v,w)⟩=sin(θ(w,w∗))(√2π)3∥w∥(v⊤v∗)2≥0. (15)

When , , if and , we have

 ∥∥EZ[gid(v,w;Z)]∥∥2∥∥∂f∂v(v,w)∥∥2+⟨EZ[gid(v,w;Z)],∂f∂w(v,w)⟩→+∞. (16)

Lemma 9 suggests that if , the coarse gradient descent will never converge near the spurious minimizers with and , because does not vanish there. By the positive correlation implied by (15) of Lemma 10, for some proper , the iterates may move towards a local minimizer in the beginning. But when approaches it, the descent property (3.1) does not hold for because of (16), hence the training loss begins to increase and instability arises.

## 4 Experiments

While our theory implies that both vanilla and clipped ReLUs learn a two-linear-layer CNN, their empirical performances on deeper nets are different. In this section, we compare the performances of the identity, ReLU and clipped ReLU STEs on MNIST (LeCun et al., 1998) and CIFAR-10 (Krizhevsky, 2009) benchmarks for 2-bit or 4-bit quantized activations. As an illustration, we plot the 2-bit quantized ReLU and its associated clipped ReLU in Figure 3 in the appendix. Intuitively, the clipped ReLU should be the best performer, as it best approximates the original quantized ReLU. We also report the instability issue of the training algorithm when using an improper STE in section 4.2. In all experiments, the weights are kept float.

The resolution for the quantized ReLU needs to be carefully chosen to maintain the full-precision level accuracy. To this end, we follow (Cai et al., 2017)

and resort to a modified batch normalization layer

(Ioffe & Szegedy, 2015) without the scale and shift, whose output components approximately follow a unit Gaussian distribution. Then the that fits the input of activation layer the best can be pre-computed by a variant of Lloyd’s algorithm (Lloyd, 1982; Yin et al., 2018a) applied to a set of simulated 1-D half-Gaussian data. After determining the , it will be fixed during the whole training process. Since the original LeNet-5 does not have batch normalization, we add one prior to each activation layer. We emphasize that we are not claiming the superiority of the quantization approach used here, as it is nothing but the HWGQ (Cai et al., 2017), except we consider the uniform quantization.

The optimizer we use is the stochastic (coarse) gradient descent with momentum = 0.9 for all experiments. We train 50 epochs for LeNet-5

(LeCun et al., 1998) on MNIST, and 200 epochs for VGG-11 (Simonyan & Zisserman, 2014) and ResNet-20 (He et al., 2016) on CIFAR-10. The parameters/weights are initialized with those from their pre-trained full-precision counterparts. The schedule of the learning rate is specified in Table 2 in the appendix.

### 4.1 Comparison Results

The experimental results are summarized in Table 1, where we record both the training losses and validation accuracies. Among the three STEs, the derivative of clipped ReLU gives the best overall performance, followed by vanilla ReLU and then by the identity function. For deeper networks, clipped ReLU is the best performer. But on the relatively shallow LeNet-5 network, vanilla ReLU exhibits comparable performance to the clipped ReLU, which is somewhat in line with our theoretical finding that ReLU is a great STE for learning the two-linear-layer (shallow) CNN.

### 4.2 Instability

We report the phenomenon of being repelled from a good minimum on ResNet-20 with 4-bit activations when using the identity STE, to demonstrate the instability issue as predicted in Theorem 1. By Table 1, the coarse gradient descent algorithms using the vanilla and clipped ReLUs converge to the neighborhoods of the minima with validation accuracies (training losses) of (0.25) and (0.04), respectively, whereas that using the identity STE gives (1.38). Note that the landscape of the empirical loss function does not depend on which STE is used in the training. Then we initialize training with the two improved minima and use the identity STE. To see if the algorithm is stable there, we start the training with a tiny learning rate of . For both initializations, the training loss and validation error significantly increase within the first 20 epochs; see Figure 4.2. To speedup training, at epoch 20, we switch to the normal schedule of learning rate specified in Table 2 and run 200 additional epochs. The training using the identity STE ends up with a much worse minimum. This is because the coarse gradient with identity STE does not vanish at the good minima in this case (Lemma 9). Similarly, the poor performance of ReLU STE on 2-bit activated ResNet-20 is also due to the instability of the corresponding training algorithm at good minima, as illustrated by Figure 4 in Appendix C, although it diverges much slower.

## 5 Concluding Remarks

We provided the first theoretical justification for the concept of STE that it gives rise to descent training algorithm. We considered three STEs: the derivatives of the identity function, vanilla ReLU and clipped ReLU, for learning a two-linear-layer CNN with binary activation. We derived the explicit formulas of the expected coarse gradients corresponding to the STEs, and showed that the negative expected coarse gradients based on vanilla and clipped ReLUs are descent directions for minimizing the population loss, whereas the identity STE is not since it generates a coarse gradient incompatible with the energy landscape. The instability/incompatibility issue was confirmed in CIFAR experiments for improper choices of STE. In the future work, we aim further understanding of coarse gradient descent for large-scale optimization problems with intractable gradients.

#### Acknowledgments

This work was partially supported by NSF grants DMS-1522383, IIS-1632935, ONR grant N00014-18-1-2527, AFOSR grant FA9550-18-0167, DOE grant DE-SC0013839 and STROBE STC NSF grant DMR-1548924.

## Appendix

### D.  Additional Supporting Lemmas

###### Lemma 11.

Let be a Gaussian random vector with entries i.i.d. sampled from . Given nonzero vectors with the angle , we have

 E[1{z⊤w>0}]=12,E[1{z⊤w>0,z⊤~w>0}]=π−θ2π,

and

 E[z1{z⊤w><