Stabilizing RNN Gradients through Pre-training

08/23/2023
by   Luca Herranz-Celotti, et al.
0

Numerous theories of learning suggest to prevent the gradient variance from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory that we refer to as the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks to fulfill the LSC often results in improved final performance across models. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.

READ FULL TEXT

page 1

page 6

research
05/25/2018

When Recurrent Models Don't Need To Be Recurrent

We prove stable recurrent neural networks are well approximated by feed-...
research
05/25/2020

Fractional moment-preserving initialization schemes for training fully-connected neural networks

A common approach to initialization in deep neural networks is to sample...
research
01/11/2018

Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?

We give a rigorous analysis of the statistical behavior of gradients in ...
research
04/11/2022

Time-Adaptive Recurrent Neural Networks

Data are often sampled irregularly in time. Dealing with this using Recu...
research
03/03/2017

Exponential Moving Average Model in Parallel Speech Recognition Training

As training data rapid growth, large-scale parallel training with multi-...
research
08/18/2016

Decoupled Neural Interfaces using Synthetic Gradients

Training directed neural networks typically requires forward-propagating...
research
09/07/2016

Deep Markov Random Field for Image Modeling

Markov Random Fields (MRFs), a formulation widely used in generative ima...

Please sign up or login with your details

Forgot password? Click here to reset