Streaming Normalization: Towards Simpler and More Biologically-plausible Normalizations for Online and Recurrent Learning

10/19/2016 ∙ by Qianli Liao, et al. ∙ 0

We systematically explored a spectrum of normalization algorithms related to Batch Normalization (BN) and propose a generalized formulation that simultaneously solves two major limitations of BN: (1) online learning and (2) recurrent learning. Our proposal is simpler and more biologically-plausible. Unlike previous approaches, our technique can be applied out of the box to all learning scenarios (e.g., online learning, batch learning, fully-connected, convolutional, feedforward, recurrent and mixed --- recurrent and convolutional) and compare favorably with existing approaches. We also propose Lp Normalization for normalizing by different orders of statistical moments. In particular, L1 normalization is well-performing, simple to implement, fast to compute, more biologically-plausible and thus ideal for GPU or hardware implementations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 13

page 14

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Batch Normalization [Ioffe and Szegedy, 2015]

(BN) is a highly effective technique for speeding up convergence of feedforward neural networks. It enabled recent development of ultra-deep networks

[He et al., 2015]

and some biologically-plausible variants of backpropagation

[Liao et al., 2015]. However, despite its success, there are two major learning scenarios that cannot be handled by BN: (1) online learning and (2) recurrent learning.

For the second scenario recurrent learning, [Liao and Poggio, 2016] and [Cooijmans et al., 2016] independently proposed “time-specific batch normalization”: different normalization statistics are used for different timesteps of an RNN. Although this approach works well in many experiments in [Liao and Poggio, 2016] and [Cooijmans et al., 2016]

, it is far from perfect due to the following reasons: First, it does not work with small mini-batch or online learning. This is similar to the original batch normalization, where enough samples are needed to compute good estimates of statistical moments. Second, it requires sufficient training samples for every timestep of an RNN. It is not clear how to generalize the model to unseen timesteps. Finally, it is not biologically-plausible. While homeostatic plasticity mechanisms (e.g., Synaptic Scaling)

[Turrigiano and Nelson, 2004, Stellwagen and Malenka, 2006, Turrigiano, 2008] are good biological candidates for BN, it is hard to imagine how such normalizations can behave differently for each timestep. Recently, Layer Normalization (LN) [Ba et al., 2016] was introduced to solve some of these issues. It performs well in feedforward and recurrent settings when only fully-connected layers are used. However, it does not work well with convolutional networks. A summary of normalization approaches in various learning scenarios is shown in Table 1.

Approach FF & FC FF & Conv Rec & FC Rec & Conv
Online
Learning
Small
Batch
All
Combined
Original Batch
Normalization(BN)
Suboptimal
Time-specific BN Limited Limited Suboptimal
Layer
Normalization
* * *
Streaming
Normalization
Table 1: An overview of normalization techiques for different tasks. : works well. : does not work well. FF: Feedforward. Rec: Recurrent. FC: Fully-connected. Conv: convolutional. Limited: time-specific BN requires recording normalization statistics for each timestep and thus may not generalize to novel sequence length. *Layer normalization does not fail on these tasks but perform significantly worse than the best approaches.

We note that different normalization methods like BN and LN can be described in the same framework detailed in Section 3. This framework introduces Sample Normalization and General Batch Normalization (GBN) as generalizations of LN and BN, respectively. As their names imply, they either collect normalization statistics from a single sample or from a mini-batch. We explored many variants of these models in the experiment section.

A natural and biologically-inspired extension of these methods would be Streaming Normalization: normalization statistics are collected in an online fashion from all previously seen training samples (and all timesteps if recurrent). We found numerous advantages associated with this approach: 1. it naturally supports pure online learning or learning with small mini-batches. 2. for recurrent learning, it is more biologically-plausible since a unique set of normalization statistics is maintained for all timesteps. 3. it performs well out of the box in all learning scenarios (e.g., online learning, batch learning, fully-connected, convolutional, feedforward, recurrent and mixed — recurrent and convolutional). 4. it offers a new direction of designing normalization algorithms, since the idea of maintaining online estimates of normalization statistics is independent from other design choices, and as a result any existing algorithm (e.g., in Figure 1 C and D) can be extended to a “streaming” setting.

We also propose Lp normalization: instead of normalizing by the second moment like BN, one can normalize by the p-th root of the p-th absolute moment (See Section 3.4

for details). In particular, L1 normalization works as well as the conventional approach (i.e., L2) in almost all learning scenarios. Furthermore, L1 normalization is easier to implement: it is simply the average of absolute values. We believe it can be used to simplify and speed up BN and our Streaming Normalization in GPGPU, dedicated hardware or embedded systems. L1 normalization may also be more biologically-plausible since the gradient of the absolute value is trivial to implement, even for biological neurons.

In the following section, we introduce a simple training scheme, a minor but necessary component of our formulation.

2 Online and Batch Learning with “Decoupled Accumulation and Update”

Although it is not our main result, we discuss a simple but to the best of our knowledge less explored111We are not aware of this approach in the literature. If there is, please inform us. training scheme we call a “Decoupled Accumulation and Update” (DAU).

Conventionally, the weights of a neural network are updated every mini-batch. So the accumulation of gradients and weights update are coupled. We note that a general formulation would be that while for every mini-batch the gradients are still accumulated, one does not necessarily update the weights. Instead, the weights are updated every mini-batches. The gradients are cleared after each weight update. Two parameters characterize this procedure: Samples per Batch (S/B) and Batch per Update (B/U) . In conventional training, .

Note that this procedure is similar to but distinct from simply training larger mini-batches since every mini-batch arrives in a purely online fashion so one cannot look back into any previous mini-batches. For example, if batch normalization is present, performing this procedure with S/B and B/U is different from that with S/B and B/U.

If , it reduces to a pure online setting where one sample arrives at a time. The key advantage of this proposal over conventional training is that we explicitly require less frequent (but more robust) weight updates. The memory requirement (in additional to storing the network weights) is storing samples and related activations.

This training scheme generalizes the conventional approach, and we found that merely applying this approach greatly mitigated (although not completely solved) the catastrophic failure of training batch normalization with small mini-batches (See Figure 4). Therefore, we will use this formulation throughout our paper.

We also expect this scheme to benefit learning sequential recurrent networks with varying input sequence lengths. Sometimes it is more efficient to pack in a mini-batch training samples with the same sequence length. If this is the case, our approach predicts that it would be desirable to process multiple such mini-batches with varying sequence lengths before a weight update. Accumulating gradients from training different sequence lengths should provide a more robust update that works for different sequence lengths, thus better approximating the true gradient of the dataset.

In Streaming Normalization with recurrent networks, it is often beneficial to learn with (i.e., more than one batch per update). The first mini-batch collects all the normalization statistics from all timesteps so that later mini-batches are normalized in a more stable way.

3 A General Framework for Normalization

We propose a general framework to describe different normalization algorithms. A normalization can be thought of as a process of modifying the activation of a neuron using some statistics collected from some reference activations. We adopt three terms to characterize this process: A Normalization Operation (NormOP) is a function that is applied to each neuron to modify its value from to , where is the Normalization Statistics (NormStats) for this neuron. NormStats is any data required to perform NormOP, collected using some function , where is a set of activations (could include ) called Normalization Reference (NormRef). Different neurons may or may not share NormRef and NormStats.

This framework captures many previous normalizations algorithms as special cases. For example, the original Batch Normalization (BN) [Ioffe and Szegedy, 2015] can be described as follows: the NormRef for each neuron is all activations in the same channel of the same layer and same batch (See Figure 1 C for illustration). The NormStats are ={,

} — the mean and standard deviation of the activations in NormRef. The NormOp is N(

,{,}) =

For the recent Layer Normalization [Ba et al., 2016], NormRef is all activations in the same layer (See Figure 1 C). The NormStats and NormOp are the same as BN.

We group normalization algorithms into three categories:

  1. Sample Normalization (Figure 1 C): NormStats are collected from one sample.

  2. General Batch Normalization (Figure 1 D): NormStats are collected from all samples in a mini-batch.

  3. Streaming Normalization: NormStats are collected in an online fashion from all pass training samples.

In the following sections, we detail each algorithm and provide pseudocode.

Figure 1: A General Framework of Normalization. A: the input to convolutional layer is a 3D matrix consists of 3 dimensions: x (image width), y (image height) and features/channel. For fully-connected layers, x=y=1. B: training with decoupled accumulation and update. C: Sample Normalization. D: General Batch Normalization. E: Streaming Normalization

3.1 Sample Normalization

Sample normalization is the simplest category among the three. NormStats are collected only using the activations in the current layer of current sample. The computation is the same at training and test times. It handles online learning naturally. Layer Normalization [Ba et al., 2016] is an example of this category. More examples are shown in Figure 1 C.

The pseudocode of forward and backpropagation is show in Algorithm 1 and Algorithm 2.

0:  layer input x (a sample), NormOP N(.,.), function S(.) to compute NormStats for every element of x
0:  layer output y (a sample)
  s = S(x)
  y = N(x,s)
Algorithm 1 Sample Normalization Layer: Forward
0:   (a sample) where is objective, layer input x, NormOP N(.,.), function S(.) to compute NormStats for every element of x
0:   (a sample)
  

can be calculated using chain rule. Detail omitted.

Algorithm 2 Sample Normalization Layer: Backpropagation

3.2 General Batch Normalization

In General Batch Normalization (GBN), NormStats are collected in some way using the activations from all samples in a training mini-batch. Note that one cannot really compute the batch NormStats at test time since test samples should be handled independently from each other, instead in a batch. To overcome this, one can simply compute running estimates of training NormStats and use them for testing (e.g., the original Batch Normalization [Ioffe and Szegedy, 2015] computes moving averages).

More examples of GBN are shown in Figure 1 D. The pseudocode is shown in Algorithm 3 and 4.

0:  layer input x (a mini-batch), NormOP N(.,.), function S(.) to compute NormStats for every element of x, running estimates of NormStats ŝ, function F to update ŝ.
0:  layer output y (a mini-batch), running estimates of NormStats ŝ
  if  then
     s = S(x)
     ŝ = F(ŝ,s)
     y = N(x,s)
  else {}
     y = N(x,ŝ)
  end if
Algorithm 3 General Batch Normalization Layer: Forward
0:   (a mini-batch) where is objective, layer input x (a mini-batch), NormOP N(.,.), function S(.) to compute NormStats for every element of x, running estimates of NormStats ŝ, function F to update ŝ.
0:   (a mini-batch)
   can be calculated using chain rule. Detail omitted.
Algorithm 4 General Batch Normalization Layer: Backpropagation

3.3 Streaming Normalization

Finally, we present the main results of this paper. We propose Streaming Normalization: NormStats are collected in an online fashion from all past training samples. The main challenge of this approach is that it introduces infinitely long dependencies throughout the neuron’s activation history — every neuron’s current activation depends on all previously seen training samples.

It is intractable to perform exact backpropagation on this dependency graph for the following reasons: there is no point backpropagating beyond the last weight update (if any) since one cannot redo the weight update. This, on the other hand, would imply that one cannot update the weights until having seen all future samples. Even backpropagating within a weight update (i.e., the interval between two weight updates, consisting of mini-batches) turns out to be problematic: one is usually not allowed to backpropagate to the previous several mini-batches since they are discarded in many practical settings.

For the above reasons, we abandoned the idea of performing exact backpropagation for streaming normalization. Instead, we propose two simple heuristics: Streaming NormStats and Streaming Gradients. They are discussed below and the pseudocode is shown in Algorithm

5 and Algorithm 6.

3.3.1 Streaming NormStats

Streaming NormStats is a natural requirement of Streaming Normalization, since NormStats are collected from all previously seen training samples. We maintain a structure/table to keep all the information needed to generate a good estimate of NormStats. and update it using a function everytime we encounter a new training sample. Function also generates the current estimate of NormStats called . We use to normalize instead of . See Algorithm 5 for details.

There could be many potential designs for and , and in our experiments we explored a particular version: we compute two sets of running estimates to keep track of the long-term and short-term NormStats: .

  • Short-term NormStats is the exact average of NormStats since the last weight update. The keeps track of the number of times a different is encountered to compute the exact average of .

  • Long-term NormStats is an exponential average of since the beginning of training.

Whenever the weights of the network is updated: , is reset to 0 and is set to empty. In our experiments, so that an exponential average of is maintained in .

In our implementation, before testing the model, the last weight update is NOT performed, and is also NOT cleared, since is needed for testing 222It also possible to simply store the last computed in training for testing, instead of storing and re-computing in testing. These two options are mathematically equivalent.

In addition to updating in the way described above, the function also computes . Finally, is used for normalization.

3.3.2 Streaming Gradients

We maintain a structure/table to keep all the information needed to generate a good estimate of gradients of NormStats and update it using a function everytime backpropagation reaches this layer. Function also generates the current estimate of gradients of NormStats called . We use for further backpropagation instead of . See Algorithm 6 for details.

Again, there could be many potential designs for and , and we explored a particular version: we compute two sets of running estimates to keep track of the long-term and short-term gradients of NormStats: .

  • Short-term Gradients of NormStats is the exact average of gradients of NormStats since the last weight update. The keeps track of the number of times a different is encountered to compute the exact average of .

  • Long-term Gradients of NormStats is an exponential average of since the beginning of training.

Whenever the weights of network is updated: , is reset to 0 and is set to empty. In our experiments, so that an exponential average of is maintained in .

In addition to updating in the way described above, the function also computes . Finally, is used for further backpropagation.

0:  layer input x (a mini-batch), NormOP N(.,.), function S(.) to compute NormStats for every element of x, running estimates of NormStats and/or related information packed in a structure/table , function F to update and generate current estimate of NormStats ŝ.
0:  layer output y (a mini-batch) and (it is stored in this layer, instead of feeding to other layers), always maintain the latest ŝ in case of testing
  if  then
     s = S(x)
     {,ŝ} = F(,s)
     y = N(x,ŝ)
  else {}
     y = N(x,ŝ)
  end if
Algorithm 5 Streaming Normalization Layer: Forward
0:   (a mini-batch) where is objective, layer input x (a mini-batch), NormOP N(.,.), function S(.) to compute NormStats for every element of x, running estimates of NormStats ŝ, running estimates of gradients and/or related information packed in a structure/table , function G to update and generate the current estimates of gradients of NormStats .
0:   (a mini-batch) and (it is stored in this layer, instead of feeding to other layers)
   is calculated using chain rule.
  {,} = G(,)
  Use for further backpropagation, instead of
   is calculated using chain rule.
Algorithm 6 Streaming Normalization Layer: Backpropagation

3.3.3 A Summary of Streaming Normalization Design and Hyperparameters

The NormOp used in this paper is N(,{,}) = , the same as what is used by BN. The NormStats are collected using one of the Lp normalization schemes described in Section 3.4.

With our particular choices of , , and

, the following hyperparameters uniquely characterize a Streaming Normalization algorithm:

, , , samples per batch , batches per update and a choice of mini-batch NormRef (i.e., SP1-SP5, BA1-BA6 in Figure 1 C and D).

Unless mentioned otherwise, we use BA1 in Figure 1 D as the NormRef throughout the paper. We also demonstrate the use of other NormRefs (BA4 and BA6) in the Appendix Figure A1.

An important special case: If , we ignore all NormStats and gradients beyond the current mini-batch. The algorithm reduces to exactly the GBN algorithm. So Streaming Normalization is strictly a generalization of GBN and thus also captures the original BN [Ioffe and Szegedy, 2015] as a special case. Recall from Section 3.3.1 that is NOT cleared before testing the model. Thus for testing, NormStats are inherited from the last training mini-batch. It works well in practice.

Unless mentioned otherwise, we set , , , , , . We leave it to future research to explore different choices of hyperparameters (and perhaps other and ).

3.3.4 Implementation Notes

One minor drawback of not performing exact backpropagation is that it may break the gradient check of the entire model. One solution coulde be: (1) perform gradient check of the model without SN and then add a correctly implemented SN. (2) make sure the SN layer is correctly coded (e.g., by reducing it to standard BN using the hyperparameters discussed above and then perform gradient check).

3.4 Lp Normalization: Calculating NormStats with Different Orders of Moments

Let us discuss the function for calculating NormStats. There are several choices for this function. We propose Lp normalization. It captures the previous mean-and-standard-deviation normalization as a special case.

First, mean is always calculated the same way — the average of the activations in NormRef. The divisive factor , however, can be calculated in several different ways. In Lp Normalization, is chosen to be the -th root of the -th Absolute Moment.

Here the Absolute Moment of a distribution about a point is:

(1)

and the discrete form is:

(2)

Lp Normalization can be performed with three settings:

  • Setting A: is the -th root of the -th absolute moment of all activations in NormRef with being the mean of NormRef.

  • Setting B: is the -th root of the -th absolute moment of all activations in NormRef with being the running estimate of the average.

  • Setting C: is the -th root of the -th absolute moment of all activations in NormRef with being 0.

Most of these variants have similar performance but some are better in some situations.

We call it Lp normalization since it is similar to the norm in the space.

Setting B and C are better for online learning since A will give degenerate result (i.e., ) when there is only one sample in NormRef. Empirically, when there are enough samples in a mini-batch, A and B perform similarly.

We discuss several important special cases:

Special Case A-2: setting A with n=2, is the standard deviation (square root of the 2nd moment) of all activations in NormRef. This setting is what is used by Batch Normalization [Ioffe and Szegedy, 2015] and Layer Normalization [Ba et al., 2016].

Special Case p=1: Whenever p=1, is simply the average of absolute values of activations. This setting works virtually the same as p=2, but is much simpler to implement and faster to run. It might also be more biologically-plausible, since the gradient computations are much simpler.

3.5 Separate Learnable Bias and Gain Parameters

The original Batch Normalization [Ioffe and Szegedy, 2015] also learns a bias and a gain (i.e., shift and scaling) parameter for each feature map. Although not usually done, but clearly these shift and scaling operations can be completely separated from the normalization layer. We implemented them as a separate layer following each normalization layer. These parameters are learned in the same way for all normalization schemes evaluated in this paper.

4 Generalization to Recurrent Learning

In this section, we generalize Sample Normalization, General Batch Normalization and Streaming Normalization to recurrent learning. The difference between recurrent learning and feedforward learning is that for each training sample, every hidden layer of the network receives activations , instead of only one.

4.1 Recurrent Sample Normalization

Sample Normalization naturally generalizes to recurrent learning since all NormStats are collected from the current layer of the current timestep. Training and testing algorithms remain the same.

4.2 Recurrent General Batch Normalization (RGBN)

The generalization of GBN to recurrent learning is the same as what was proposed by [Liao and Poggio, 2016] and [Cooijmans et al., 2016]. The training procedure is exactly the same as before (Algorithm 3 and Algorithm 4). For testing, one set of running estimates of NormStats is maintained for each timestep.

Another way of viewing this algorithm is that the same GBN layers described in Algorithm 3 and 4 are used in the unrolled recurrent network. Each GBN layer in the unrolled network uses a different memory storage for NormStats.

4.3 Recurrent Streaming Normalization

Extending Streaming Normalization to recurrent learning is straightforward – we not only stream through all the past samples, but also through all past timesteps. Thus, we maintain a unique set of running estimates of NormStats for all timesteps. This is more biologically-plausible and memory efficient than the above approach (RGBN).

Again, another way of viewing this algorithm is that the same Streaming Normalization layers described in Algorithm 5 and 6 are used in the original (instead of the unrolled) recurrent network. All unrolled versions of the same layer share running estimates of NormStats and other related data.

One caveat is that as time proceeds, the running estimates of NormStats are slightly modified. Thus when backpropagation reaches the same layer again, the NormStats are slightly different from the ones originally used for normalization. Empirically, it seems to not cause any problem on the performance. Training with “decoupled accumulation update” with Batches per Update (B/U) > 1 333B/U=2 is often enough can also mitigate this problem, since it makes NormStats more stable over time.

5 Streaming Normalized RNN and GRU

In our character-level language modeling task, we tried Normalized Recurrent Neural Network (RNN) and Normalized Gated Recurrent Unit (GRU)

[Chung et al., 2014]

. Let us use Norm(.) to denote a normalization, which can be either Sample Normalization, General Batch Normalization or Streaming Normalization. A bias and gain parameter is also learned for each neuron. We use NonLinear to denote a nonlinear function. We used hyperbolic tangent (tanh) nonlinearity in our experiments. But we observed ReLU also works.

is the hidden activation at time . is the network input at time . denotes the weights. denotes elementwise multiplication.

Normalized RNN

(3)

Normalized GRU

(4)
(5)
(6)
(7)

6 Related Work

[Laurent et al., 2015] and [Amodei et al., 2015] used Batch Normalization (BN) in stacked recurrent networks, where BN was only applied to the feedforward part (i.e., “vertical” connections, input to each RNN), but not the recurrent part (i.e., “horizontal”, hidden-to-hidden connections between timesteps). [Liao and Poggio, 2016] and [Cooijmans et al., 2016] independently proposed applying BN in recurrent/hidden-to-hidden connections of recurrent networks, but separate normalization statistics must be maintained for each timestep. [Liao and Poggio, 2016]

demonstrated this idea with deep multi-stage fully recurrent (and convolutional) neural networks with ReLU nonlinearities and residual connections.

[Cooijmans et al., 2016] demonstrated this idea with LSTMs on language processing tasks and sequential MNIST. [Ba et al., 2016] proposed Layer Normalization (LN) as a simple normalization technique for online and recurrent learning. But they observed that LN does not work well with convolutional networks. [Salimans and Kingma, 2016] and [Neyshabur et al., 2015] studied normalization using weight reparameterizations. An early work by [Ullman and Schechtman, 1982] mathematically analyzed a form of online normalization for visual perception and adaptation.

7 Experiments

7.1 CIFAR-10 architectures and Settings

We evaluated the normalization techniques on CIFAR-10 dataset using feedforward fully-connected networks, feedforward convolutional network and a class of convolutional recurrent networks proposed by [Liao and Poggio, 2016]. The architectural details are shown in Figure 2

. We train all models with learning rate 0.1 for 25 epochs and 0.01 for 5 epochs. Momentum 0.9 is used. We used MatConvNet

[Vedaldi and Lenc, 2015] to implement our models.

Figure 2: Architectures for CIFAR-10. Note that C reduces to B when .

7.2 Lp Normalization

We show BN with Lp normalization in Figure 3. Note that Lp normalization can be applied to Layer Normalization and all other normalizations show in 1 C and D. L1 normalization works as well as L2 while being simpler to implement and faster to compute.

Figure 3: Lp Normalization. The architecture is a feedforward and convolutional network (shown in Figure 2 B). All statistical moments perform similarly well. L7 normalization is slightly worse.

7.3 Online Learning or Learning with Very Small Mini-batches

We perform online learning or learning with small mini-batches using architecture A in Figure 2.

Plain Mini-batch vs. Decoupled Accumulation and Update (DAU): We show in Figure 4 comparisons between conventional mini-batch training and Decoupled Accumulation and Update (DAU).

Figure 4: Plain Mini-batch vs. Decoupled Accumulation and Update (DAU). The architecture is a feedforward and fully-connected network (shown in Figure 2 A). S/B: Samples per Batch. B/U: Batches per Weight Update. We show there are significant performance differences between plain mini-batch (i.e., B/U=1) and Decoupled Accumulation and Update (DAU, i.e., B/U=n>1). DAU significantly improves the performance of BN with small number of samples per mini-batch (e.g., compare curve 1 with 3).

Layer Normalization vs. Batch Normalization vs. Streaming Normalization: We compare in Figure 5 Layer Normalization, Batch Normalization and Streaming Normalization with different choices of S/B and B/U.

Figure 5: Different normalizations applied to a feedforward and fully-connected network (shown in Figure 2 A). The right two pannels are zoomed-in versions of the left two pannels. S/B: Samples per Batch. B/U: Batches per Weight Update. “Ours” refers to Streaming Normalization with “L1 norm” (Setting B with p=1 in Section 3.4) and , and (see Section 3.3.3 for more details about hyperparameters). We show that our algorithm works with pure online learning (1 S/B) and tiny mini-batch (2 S/B), and it outperforms Layer Normalization. The choice of S/B does not matter for layer normalization since it processes samples independently.

7.4 Evaluating Variants of Batch Normalization

Feedforward Convolutional Networks: In Figure 6, we tested algorithms shown in Figure 1 C and D using the architecture B in Figure 2. We also show the performance of our Streaming Normalization for reference.

Figure 6: Different normalizations applied to a feedforward and convolutional network (shown in Figure 2 B). All models were trained with 32 Samples per Batch (S/B), 1 Batch per Update (B/U). “Our approach” refers to Streaming Normalization with “L2 norm” (Setting A with p=2 in Section 3.4) and , and (see Section 3.3.3 for more details about hyperparameters). LN: Layer Normalization. Sample Normalizations (including LN) seem to all work similarly. It seems beneficial to normalize each channel/feature map separately (e.g., compare BA3 with BA4), like what BN does.

ResNet-like convolutional RNN: In Figure 7, we tested algorithms shown in Figure 1 C and D using the architecture C in Figure 2. We also show the performance of our Streaming Normalization for reference.

Figure 7: Different normalizations applied to a recurrent and convolutional network (Figure 2 C with and ). All models were trained with 32 Samples per Batch (S/B), 1 Batch per Update (B/U). “Our approach” refers to Streaming Normalization with “L2 norm” (Setting A with p=2 in Section 3.4) and , and (see Section 3.3.3 for more details about hyperparameters). LN: Layer Normalization. Sample Normalizations (including LN) seem to all work similarly. It seems beneficial to normalize each channel/feature map separately (e.g., compare BA3 with BA4), like what BN does.

Densely Recurrent Convolutional Network: In Figure 8, we tested Time-Specific Batch Normalization and Streaming Normalization on the architecture D in Figure 2.

Figure 8: Time-specific Batch Normalization (TSBN) and Streaming Normalization applied to a densely recurrent and convolutional network (Figure 2 D with ). “Ours” refers to Streaming Normalization with “L2 norm” (Setting B with p=2 in Section 3.4), 32 Samples per Batch (S/B) and 2 Batches per Update (B/U) and , and (see Section 3.3.3 for more details about hyperparameters). Sometimes for recurrent networks, B/U > 1 is preferred, since the first mini-batch collects NormStats from all timesteps so that the second mini-batch is normalized in a more stable way. TSBN was trained with 64 S/B, 1 B/U (32 S/B, 2 B/U would give similar performance, if not worse). Streaming Normalization has similar performance to TSBN but does not require storing different NormStats for each timestep.

7.5 More Experiments on Streaming Normalization

In Figure 9, we compare the performances of original BN (i.e., NormStats shared over time), time-specific BN, layer normalization and streaming normalization on a recurrent and convolutional network shown in Figure 2 C.

Figure 9: Different normalizations applied to a recurrent and convolutional network (shown in Figure 2 C with unrolling parameters ). The right two pannels are zoomed-in versions of the left two pannels. “Ours” refers to Streaming Normalization with “L2 norm” (Setting B with p=2 in Section 3.4), 32 Samples per Batch (S/B), 2 Batches per Update (B/U) and , and (see Section 3.3.3 for more details about hyperparameters). Time-specific Batch Normalization, original BN and Layer Normalization (LN) were trained with 64 S/B, 1 B/U (32 S/B, 2 B/U would give similar performance, if not worse). Streaming Normalization clearly outperforms other methods in training. Streaming Normalization converges more than twice as fast as LN. Note that 32 S/B 2 B/U and 64 S/B 1 B/U are equivalent to LN since it processes samples independently. Original BN fails on testing.

We evaluated different choices of hyperparameter , and in Figure 10.

Figure 10: Evaluate different choices of hyperparameter , and . The architecture is a recurrent and convolutional network (shown in Figure 2 C with unrolling parameters ). The right two pannels are zoomed-in versions of the left two pannels. The models are Streaming Normalization with “L2 norm” (Setting B with p=2 in Section 3.4), 32 Samples per Batch (S/B), 2 Batches per Update (B/U) and , . The hyperparameters are shown in the figure. (0,0,0) means that the gradients of NormStats are ignored. (0,0,1) means only using NormStats gradients from the current mini-batch. (0,1,0) means only using NormStats gradients accumulated since the last weight update. Note that regardless the values of , the gradients of NormStats are always accumulated (See Section 3.3.2). Using gradients from the previous weight update (i.e., 1,0,0) seems to work reasonably well. Some combinations (i.e., (0.7,0.0.3) or (0.7,0.0.3)) of previous and current gradients seem to give the best performances. This experiment indicates that streaming the gradients of NormStats is very important for performance.

7.6 Recurrent Neural Networks for Character-level Language Modeling

We tried our simple implementations of vanilla RNN and GRU described in Section 5. The RNN and GRU both have 1 hidden layer with 100 units. Weights are updated using the simple Manhattan update rule described in [Liao et al., 2015]. The models were trained with learning rate 0.01 for 2 epochs and 0.001 for 1 epoch on a text file of all Shakespeare’s work concatenated. We use 99% the text file for training and 1% for validation. The training and validation softmax losses are reported. Training losses are from mini-batches so they are noisy, and we smoothed them using moving averages of 50 neighbors (using the Matlab smooth function). The test loss on the entire validation set is evaluated and recorded every 20 mini-batches. We show in Figure 11 and 12 the performances of Time-specific Batch Normalization, Layer Normalization and Streaming Normalization. Truncated BPTT was performed with 100 timesteps.

Figure 11: Character-level language modeling with RNN on Shakespeare’s work concatenated. The training (left) and validation (right) softmax losses are reported. “Ours” refers to Streaming Normalization with “L2 norm” (Setting B with p=2 in Section 3.4), 32 Samples per Batch (S/B), 2 Batches per Update (B/U) and , and (see Section 3.3.3 for more details about hyperparameters). TSBN: time-specific BN. LN: Layer Normalization. Both TSBN and Streaming Normalization (SN) converges faster than LN. SN reaches slightly lower loss than TSBN and LN.
Figure 12: Character-level language modeling with GRU on Shakespeare’s work concatenated. The training (left) and validation (right) softmax losses are reported. “Ours” refers to Streaming Normalization with “L2 norm” (Setting B with p=2 in Section 3.4), 32 Samples per Batch (S/B), 2 Batches per Update (B/U) and , and (see Section 3.3.3 for more details about hyperparameters). TSBN: time-specific BN. LN: Layer Normalization. Streaming Normalization converges faster than LN and reaches lower loss than TSBN.

8 Discussion

Biological Plausibility

We found that the simple “Neuron-wise normalization” (BA6 in Figure 1 D) performs very well (Figure 6 and 7). This setting does not require collecting normalization statistics from any other neurons. We show the streaming version of neuron-wise normalization in Figure A1

, and the performance is again competitive. In neuron-wise normalization, each neuron simply maintains running estimates of its own mean and variance (and related gradients), and all the information is maintained locally. This approach may serve as a baseline model for biological homeostatic plasticity mechanisms (e.g., Synaptic Scaling)

[Turrigiano and Nelson, 2004, Stellwagen and Malenka, 2006, Turrigiano, 2008], where each neuron internally maintains some normalization/scaling factors that depend on neuron’s firing history and can be applied and updated in a pure online fashion.

Lp Normalization

Our observations about Lp normalization have several biological implications: First, we show that most Lp normalizations work similarly, which suggests that there might exist a large class of statistics that can be used for normalization. Biological systems could implement any of these methods to get the same level of performance. Second, L1 normalization is particularly interesting, since its gradient computations are much easier for biological neurons to implement.

As an orthogonal direction of research, it would also be interesting to study the relations between our Lp normalization (standardizing the average Lp norm of activations) and Lp regularization (discounting the the Lp norm of weights, e.g., L1 weight decay).

Theoretical Understanding

Although normalization methods have been empirically shown to significantly improve the performance of deep learning models, there is not enough theoretical understanding about them. Activation-normalized neurons behave more similarly to biological neurons whose activations are constrained into a certain range: is it a blessing or a curse? Does it affect approximation bounds of shallow and deep networks

[Mhaskar et al., 2016, Mhaskar and Poggio, 2016]? It would also be interesting to see if certain normalization methods can mitigate the problems of poor local minima and saddle points, as the problems have been analysed without normalization [Kawaguchi, 2016].

Internal Covariant Shift in Recurrent Networks

Note that our approach (and perhaps the brain’s “synaptic scaling”) does not normalize differently for each timestep. Thus, it does not naturally handle internal covariant shift [Ioffe and Szegedy, 2015] (more precisely, covariate shift over time) in recurrent networks, which was the main motivation of the original Batch Normalization and Layer Normalization. Our results seem to suggest that internal covariate shift is not as hazardous as previously believed as long as the entire network’s activations are normalized to a good range. But more research is needed to answer this question.

Acknowledgments

This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF – 1231216.

References

Appendix A Other Variants of Streaming Normalization

Figure A1: We explore other variants of Streaming Normalization with different NormRef (e.g., SP1-SP5, BA1-BA6 in 1 C and D) within each mini-batch. -B denotes the batch version. -S denotes the streaming version. The architecture is a feedforward and convolutional network (shown in Figure 2 B). Streaming significantly lowers training errors.