Deep Neural Networks (DNNs) compression through quantization is a recent direction in edge implementation of deep networks. Quantized networks are simple to deploy on hardware devices with constrained resources such as cell phones and IoT equipment. Quantized networks not only consume less memory and simplify computation, it also yields energy saving. Two well-known extreme quantization schemes are binary (one bit) and and ternary (two bit) networks, which allow up to and computation speed up, respectively. Binary quantization only keep track of the sign and ignores the magnitude, and ternary quantization extends the binary case to
to allow for sparse representation. BatchNorm facilitates neural networks training as a known fact. A common intuition suggests BatchNorm matches input and output first and second moments. There are two other clues among others:[ioffe2015batchnorm] claim that BatchNorm corrects covariate shift, and [santurkar2018bnoptim] show BatchNorm bounds the gradient and makes the optimization smoother in full-precision networks. None of these arguments work for quantized networks! The role of BatchNorm is to prevent exploding gradient empirically observed in [ardakani2018learningrecbinter] and [hou2019normalization].
2 Full-Precision Network
Suppose a mini batch of size
for a given neuron. Let
be the mean and the standard deviation of the dot product, between inputs and weights,. For a given layer , BatchNorm is defined as where is the standardized dot product and the pair , is trainable, initialized with .
Given the objective function
, BatchNorm parameters are trained in backpropagation
For a given layer , it is easy to prove equals
Assume weights and activations are independent, and identically distributed (iid) and centred about zero. Formally, denote the dot product vectorof sample in layer , with neurons. Let
be the element-wise activation function,be the input vector, with elements be the weights matrix; one may use to denote an identically distributed elements of layer . It is easy to verify
Assume that the feature element and the weight element are centred and iid. Reserve to index the current neuron and use for the previous or the next layer neuron and where
is the variance of the weight in layer
which explodes or vanishes depending on . This is the main reason common full-precision initialization methods suggest . For any full-precision network, BatchNorm affects backpropagation as
3 Binary Network
Controlling the variance has no fundamental effect on forward propagation if is symmetric about zero as the sign function filters the magnitude and only keeps the sign of the dot product. The term can be regarded as as a new trainable parameter, thus BatchNorm layer can be replaced by adding biases to the network to compensate. [sari2019study] shows that the gradient variance for binary quantized networks without BatchNorm is
and with BatchNorm is
for an arbitrary .
Gradients are stabilized only if . Moving from full-precision weight to binary weight changes the situation dramatically: i) BatchNorm corrects exploding gradients in BNNs as the layer width ratio in common neural models. If this ratio diverges from unity binary training is problematic even with BatchNorm.
4 Ternary Network
Ternary neural networks (TNNs) are studied in [Sari_Nia_2020] and the BatchNorm effect is detailed there. Full-precision weights during training are ternarized during forward propagation. Given a threshold ternary quantization function is
Let’s suppose the threshold is given so that the learning is feasible, for instance is tuned so that of ternary weights are set to zero
In the literature [li2016twn] suggests to set . Under simplified assumptions of iid weight and activation
and (4) reduces to . In this setting, variance is bigger than which produces exploding gradients similar to the binary case. Suppose weights and activation are iid and weights are centred about zero, for a layer ,
Therefore (2) reduces to
see [Sari_Nia_2020] for details. Similar to the binary case, in most deep architectures or equivalently , so the variance would not explode for networks with BatchNorm layer.
We derived the analytical expression for full-precision network under assumptions of [he2015init] and extended it for binary and ternary case. Our study shows that the real effect of BatchNorm is played in scaling. The main role of BatchNorm in quantized training is to adjust gradient explosion.