Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets

03/13/2019
by   Penghang Yin, et al.
6

Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass, so that the "gradient" through the modified chain rule becomes non-trivial. Since this unusual "gradient" is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual "gradient" given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2020

Learning Quantized Neural Nets by Coarse Gradient Method for Non-linear Classification

Quantized or low-bit neural networks are attractive due to their inferen...
research
03/05/2021

Autocalibration and Tweedie-dominance for Insurance Pricing with Machine Learning

Boosting techniques and neural networks are particularly effective machi...
research
08/15/2018

Blended Coarse Gradient Descent for Full Quantization of Deep Neural Networks

Quantized deep neural networks (QDNNs) are attractive due to their much ...
research
05/25/2019

Hebbian-Descent

In this work we propose Hebbian-descent as a biologically plausible lear...
research
02/11/2023

Global Convergence Rate of Deep Equilibrium Models with General Activations

In a recent paper, Ling et al. investigated the over-parametrized Deep E...
research
08/10/2022

Frequency propagation: Multi-mechanism learning in nonlinear physical networks

We introduce frequency propagation, a learning algorithm for nonlinear p...
research
06/24/2018

Beyond Backprop: Alternating Minimization with co-Activation Memory

We propose a novel online algorithm for training deep feedforward neural...

Please sign up or login with your details

Forgot password? Click here to reset