Unofficial UR-LSTM implementation in Pytorch
Gating mechanisms are widely used in neural network models, where they allow gradients to backpropagate more easily through depth or time. However, their saturation property introduces problems of its own. For example, in recurrent models these gates need to have outputs near 1 to propagate information over long time-delays, which requires them to operate in their saturation regime and hinders gradient-based learning of the gate mechanism. We address this problem by deriving two synergistic modifications to the standard gating mechanism that are easy to implement, introduce no additional hyperparameters, and improve learnability of the gates when they are close to saturation. We show how these changes are related to and improve on alternative recently proposed gating mechanisms such as chrono-initialization and Ordered Neurons. Empirically, our simple gating mechanisms robustly improve the performance of recurrent models on a range of applications, including synthetic memorization tasks, sequential image classification, language modeling, and reinforcement learning, particularly when long-term dependencies are involved.READ FULL TEXT VIEW PDF
Successful recurrent models such as long short-term memories (LSTMs) and...
Sequential information contains short- to long-range dependencies; howev...
Recurrent neural network (RNN) has been widely studied in sequence learn...
Gated recurrent neural networks have achieved remarkable results in the
Reinforcement learning is a popular machine learning paradigm which can ...
The pre-dominant approach to language modeling to date is based on recur...
Recurrent Neural Networks (RNNs) play a major role in the field of seque...
Unofficial UR-LSTM implementation in Pytorch
Recurrent neural networks (RNNs) have become a standard machine learning tool for learning from sequential data. However, RNNs are prone to the vanishing gradient problem, which occurs when the gradients of the recurrent weights become vanishingly small as they get backpropagated through time(Hochreiter et al., 2001)
. A common approach to alleviate the vanishing gradient problem is to use gating mechanisms, leading to models such as the long short term memory(Hochreiter and Schmidhuber, 1997, LSTM) et al., 2014, GRUs). These gated RNNs have been very successful in several different application areas such as in reinforcement learning (Kapturowski et al., 2018; Espeholt et al., 2018) et al., 2014; Kočiskỳ et al., 2018).
At every time step, gated recurrent models form a weighted combination of the history summarized by the previous state, and (a function of) the incoming inputs, to create the next state. The values of the gates, which are the coefficients of the combination, control the length of temporal dependencies that can be addressed. This weighted update can be seen as an additive or residual connection on the recurrent state, which helps signals propagate through time without vanishing. However, the gates themselves are prone to a saturating property which can also hamper gradient-based learning. This is particularly troublesome for RNNs, where carrying information for very long time delays requires gates to be very close to their saturated states.
We formulate and address two particular problems that arise with the standard gating mechanism of recurrent models. First, typical initialization of the gates is relatively concentrated. This restricts the range of timescales the model can address, as the timescale of a particular unit is dictated by its gates. Our first proposal, which we call uniform gate initialization (Section 2.2), addresses this by directly initializing the activations of these gates from a distribution that captures a wider spread of dependency lengths.
Second, learning when gates are in their saturation regime is difficult because of vanishing gradients through the gates. We derive a modification that uses an auxiliary refine gate to modulate a main gate, which allows it to have a wider range of activations without gradients vanishing as quickly.
Combining these two independent modifications yields our main proposal, which we call the UR- gating mechanism. These changes can be applied to any gate (i.e. bounded parametrized function) and have minimal to no overhead in terms of speed, memory, code complexity, and (hyper-)parameters. We apply them to the forget gate of recurrent models, and evaluate on many benchmarks including synthetic long-term dependency tasks, sequential pixel-level image classification, language modeling, program execution, and reinforcement learning. Finally, we connect our methods to other proposed gating modifications, introduce a framework that allows each component to be replaced with similar ones, and perform theoretical analysis and extensive ablations of our method. Empirically, the UR- gating mechanism robustly improves on the standard forget and input gates of gated recurrent models. When applied to the LSTM, these simple modifications solve synthetic memory tasks that are pathologically difficult for the standard LSTM, achieve state-of-the-art results on sequential MNIST and CIFAR-10, and show consistent improvements in language modeling on the WikiText-103 dataset (Merity et al., 2016) and reinforcement learning tasks (Hung et al., 2018).
Broadly speaking, RNNs are used to sweep over a sequence of input data to produce a sequence of recurrent states summarizing information seen so far. At a high level, an RNN is just a parametrized function in which each sequential application of the network computes a state update . Gating mechanisms were introduced to address the vanishing gradient problem (Bengio et al., 1994; Hochreiter et al., 2001), and have proven crucial to the success of RNNs. This mechanism essentially smooths out the update using the following equation,
where the forget gate and input gate are -valued functions that control how fast information is forgotten or allowed into the memory state. When the gates are tied, i.e. as in GRUs, they behave as a low-pass filter, deciding the time-scale on which the unit will respond (Tallec and Ollivier, 2018). For example, large forget gate activations close to are necessary for recurrent models to address long-term dependencies.111In this work, we use “gate” to alternatively refer to a -valued function or the value (“activation”) of that function.
We will introduce our improvements to the gating mechanism primarily in the context of the LSTM, which is the most popular recurrent model. However, these techniques can be used in any model that makes similar use of gates. A typical LSTM (equations (2)-(7)) is an RNN whose state is represented by a tuple consisting of a “hidden” state and “cell” state. The basic gate equation (1) is used to create the next cell state (Equation (5)). Note that the gate and update activations are a function of the previous hidden state instead of . Here, corresponds to a parameterized linear function of its inputs with bias , e.g.
refers to the commonly used sigmoid activation function which we will assume is used for defining-valued activations in the rest of this paper. A third “output” gate is used to define the next by clamping . As originally proposed by Hochreiter and Schmidhuber (1997), the cell state of an LSTM serves as an “error carousel” that allows gradients to propagate through time without vanishing due to the additive gated update.
The gates of the LSTM were initially motivated as a binary mechanism, switching on or off to allow information and gradients to pass through. However, in reality this fails to happen due to a combination of initialization and saturation. This can prove to be problematic, for example if very long dependencies are present. Additionally, even if this behavior is somehow obtained, saturation also makes it very difficult to unlearn it if needed (Gulcehre et al., 2016).
As previously mentioned, we present two solutions which work in tandem to address the issues described in the previous section. The first ensures a diverse range of gate values at the start of training by sampling the gate’s biases so that the activations will be approximately uniformly distributed at initialization. We call this Uniform Gate Initialization (UGI). The second allows better gradient flow by reparameterizing the gate using an auxiliary gate, which allows the activation function to vary between different saturation regimes.
As our main application is for recurrent models, we present the full UR-LSTM model in equations (9)-(15). However, we note that these methods can be used to modify any gate (or more generally, bounded function) in any model. In this context the UR-LSTM is simply defined by applying UGI (equation (9)) and a refine gate (equation (11)) on the original forget gate (equation (10)) to create an effective forget gate (equation (12)). This effective gate is then used in the cell state update (13). Empirically, these small modifications to an LSTM are enough to allow it to achieve nearly binary activations and solve difficult memory problems (Figure 4). In the rest of Section 2, we provide theoretical justifications of UGI and refine gates.
Standard initialization schemes for the gates can prevent the learning of long-term temporal correlations (Tallec and Ollivier, 2018). For example, supposing that a unit in the cell state has constant forget gate value , then the contribution of an input in time steps will decay by . This gives the unit an effective decay period or characteristic timescale of .222This corresponds to the number of timesteps it takes to decay by . Standard initialization of linear layers sets the bias term to , which causes the forget gate values (2) to concentrate around . A common trick of setting the forget gate bias to (Jozefowicz et al., 2015) does increase the value of the decay period to . However, this is still relatively small, and moreover fixed, hindering the model from easily learning dependencies at varying timescales.
We instead propose to directly control the distribution of forget gates, and hence the corresponding distribution of decay periods. In particular, we propose to simply initialize the value of the forget gate activations () according to a uniform distribution , as described in Section 2.1. 333Some care must be taken to avoid instabilities (implementation details in Appendix D.5). An important difference between UGI and standard initialization or other proposed initializations (e.g. Tallec and Ollivier, 2018) is that negative forget biases are allowed. The effect of UGI is that all timescales are covered, from units with very high forget activations remembering information (nearly) indefinitely, to those with low activations focusing solely on the incoming input. Additionally, it introduces no additional parameters; it even can have less hyperparameters than the standard gate initialization, as the forget bias is sometimes treated as a hyperparameter.
Given a gate , the refine gate is an independent gate , and modulates to produce a value which will be used in place of downstream. We motivate it by considering how to modify the output of a gate in a way that promotes gradient-based learning, as it is derived below.
The root of the saturation problem is that the gradient of a gate, which can be written solely as a function of the activation value as , decays rapidly as approaches or . Thus when the activation is past a certain upper or lower threshold, learning effectively stops. This problem cannot be fully addressed only by modifying the input to the sigmoid, as in UGI and other techniques, as the gradient will still vanish by backpropagating through the bounded activation function.
Therefore to better control activations near the saturating regime, instead of changing the input to the sigmoid in , we consider modifying the output. In particular, we consider adjusting with an input-dependent update for some function , to create an effective gate that will be used in place of downstream such as in the main state update (1). This sort of additive (“residual”) connection is a common technique to increase gradient flow, and indeed was the motivation of the LSTM additive gated update itself (Hochreiter and Schmidhuber, 1997).
Although many choices seem plausible for selecting the additive update , we reason backwards from necessary properties of the effective activation to deduce a principled function . The refine gate will appear as a result.
First, note that might need to be increased or decreased, regardless of what its value is. For example, given a large activation near saturation, it may need to be even higher to address long-term dependencies in recurrent models; alternatively, if it is too high by initialization or needs to unlearn previous behavior, it may need to decrease. Therefore, the additive update to should create an effective activation in the range for some . Note that the allowed adjustment range needs to be a function of in order to keep . In particular,
The additive adjustment range should satisfy
Validity: , to ensure .
Symmetry: Since and are completely symmetrical in the gating framework, .
Differentiability: will be used in backpropagation, requiring .
Figure 1(a) illustrates the general appearance of based on these properties. In particular, Validity implies that that its derivative satisfies and , Symmetry implies , and Differentiability implies is continuous. The simplest such function satisfying these is the linear , yielding .
Given such a , recall that the goal is to produce an effective activation such that (Figure 1(b)). Our final observation is that the simplest such function satisfying this is for some . Using the standard method for defining -valued functions via a non-linearity leads to for another gate .
The full update is given in Equation (16),
Equation (16) has the elegant interpretation that the gate linearly interpolates between the lower band and the symmetric upper band (Figure 1(b)). In other words, the original gate is the coarse-grained determinant of the effective gate , while the gate “refines” it. This allows the effective gate to reach much higher and lower activations than the constituent gates and (Figure 1(c)), bypassing the saturating gradient problem. For example, this allows the effective forget gate to reach when the forget gate is only .
As an aside, we note that instead of choosing a single function defining the allowable range for , we could have defined a separate upper and lower band . However, tying has the additional intuitive property that when the refine gate is centered (), the effective gate is equal to the forget gate.
Formally, the full mechanism of the refine gate as applied to gated recurrent models is defined in equations (11)-(13). Note that it is an isolated change where the forget gate (10) is modified before applying the the standard update (1). Figure 1(b) illustrates how the refine gate changes the effective gate to produce an effective gate within a band. In particular, Figure 1(d) considers the magnitude of the gradient as a function of the activation . For example, this function is when is the basic gate. However, this gradient increases substantially when the refine gate is introduced. Figure 1(d) plots the minimum and maximum multiplicative increase over the case of a standard gate, illustrating how the inclusion of a refine gate promotes better gradient flow and optimization – increasingly so in the saturating regime near .
Finally, to simplify comparisons and ensure that we always use the same number of parameters as the standard gates, when using the refine gate we tie the input gate to the effective forget gate, . However, we emphasize that these techniques are extremely simple and broad, and can be applied to any gate444Or more broadly any bounded function. to improve initialization distribution and help optimization. For example, our methods can be combined in different ways in recurrent models, e.g. an independent input gate can be modified with its own refine gate. Alternatively, the refine gate can also be initialized uniformly, which we do in our experiments whenever both UGI and refine gates are used.
Figure 1 illustrates the refine gate as applied to a LSTM cell.
Finally, we remark that although we have been working directly with gate activation values , it is illustrative to reason with their characteristic timescales instead, whence both UGI and refine gate also have clean interpretations. First, UGI is equivalent to initializing the decay period from a particular heavy-tailed distribution, in contrast to standard initialization with a fixed decay period .
UGI is equivalent to to sampling the decay period from a distribution with density proportional to , i.e. a Pareto distribution.
On the other hand, for any forget gate activation with timescale , the refine gate fine-tunes it between and .
Given a forget gate activation with timescale , the refine gate creates an effective forget gate with timescale in .
Together, UGI and the refine gate can be viewed as principled techniques to modify the input to and output of a sigmoid gate, respectively, in a way that encourages diverse activations and reduces saturation.
, which is conventionally the sigmoid function (solid line). The refine gate interpolates around the original gateto yield an effective gate within the upper and lower curves, . (c) Contours of the effective gate as a function of the forget and refine gates . High effective activations can be achieved with more modest values. (d) The gradient as a function of effective gate activation . [Black, blue]: Lower and upper bounds on the ratio of the gradient when using a refine gate vs. without.
We highlight a few recent works that also propose small gate changes to address problems of long-term or variable-length dependencies. Like ours, they can be applied to any gated update equation.
Tallec and Ollivier (2018) suggest an initialization strategy to capture long-term dependencies on the order of , by sampling the gate biases from . Although similar to UGI in definition, chrono initialization (CI) has key differences in the timescales captured, for example by using an explicit timescale parameter and having no negative biases. Due to its relation to UGI, we provide a more detailed comparison in Appendix B.1. As mentioned in Section 2.3, techniques such as these that only modify the input to a sigmoid gate do not fully address the saturation problem due to the fact gradients can still approach zero in the saturated regime.
The Ordered Neuron LSTM introduced by Shen et al. (2018) aims to induce an ordering over the units in the hidden states such that “higher-level” neurons retain information for longer and capture higher-level information. We highlight this work due to its recent success in NLP, and also because its novelties can be factored into introducing two mechanisms which only affect the forget and input gates, namely (i) the
activation function which creates a monotonically increasing vector in, and (ii) a pair of “master gates” which are ordered by and fine-tuned with another pair of gates.
Moreover, we observe that these are related to our techniques in that one controls the distribution of a gate activation, and the other is an auxiliary gate with modulating behavior.
Despite its important novelties, we observe that the ON-LSTM has drawbacks including speed/stability issues and theoretical flaws in the scaling of its gates. We provide the formal definition and detailed analysis of the ON-LSTM in Appendix B.2. In particular we flesh out a deeper relation between the master and refine gates and show how they can be interchanged for each other.
Our insights about previous work with related gate components allow us to perform extensive ablations of our contributions. We observe two independent axes of variation, namely, activation function/initialization (, constant bias sigmoid, CI, UGI) and auxiliary modulating gates (master, refine), where different components can be replaced with each other. Therefore we propose several other gate combinations to isolate the effects of different gating mechanisms. We summarize a few ablations here; precise details are given in Appendix B.3.
O-: Ordered gates. A natural simplification of the main idea of ON-LSTM, while keeping the hierarchical bias on the forget activations, is to simply drop the auxiliary master gates and define (2)-(3) using the activation function.
UM-: UGI master gates. This variant of the ON-LSTM’s gates ablates the operation on the master gates, replacing it with a sigmoid activation and UGI which maintains the same initial distribution on the activation values.
OR-: Refine instead of master. A final variant in between the UR- gates and the ON-LSTM’s gates combines with refine gates. In this formulation, as in UR- gates, the refine gate modifies the forget gate and the input gate is tied to the effective forget gate. The forget gate is ordered using .
Table 1 summarizes the gating modifications we consider and their naming conventions. Note that we also denote the ON-LSTM method as “OM-LSTM” (M for master) for mnemonic ease. Finally, we remark that all methods here are controlled with the same number of parameters as the standard LSTM, aside from the OM-LSTM and UM-LSTM which use an additional -fraction parameters where is the downsize factor on the master gates (Appendix B.2).
|Name||Gate Mechanism||U-||Uniform gate initialization, no auxiliary gate|
|-||Standard gate initialization (1)||R-||Refine gate with standard gate initialization|
|C-||Chrono initialization, no auxiliary gate||O-||activation on forget/input gates|
|OM-||Ordered main gates, auxiliary master gates||UM-||UGI main gates, auxiliary master gates|
|UR-||UGI main gate, auxiliary refine gate||OR-||Ordered main gate, auxiliary refine gate|
We first perform full ablations of the gating variants (Section 2.6) on benchmark synthetic memorization and pixel-by-pixel image classification tasks. We then evaluate our main method on important applications for recurrent models including language modeling and reinforcement learning, comparing against baseline methods where appropriate. We provide further results on a program execution task in Appendix E.3.
When chrono-initialization is used and not explicitly tuned, we set
to be proportional to the hidden size. This heuristic uses the intuition that if dependencies of lengthexist, then so should dependencies of all lengths . Moreover, the amount of information that can be remembered is proportional to the number of hidden units.
All of our benchmarks have prior work with recurrent baselines, from which we used the same models, protocol, and hyperparameters whenever possible, changing only the gating mechanism. Since our simple gate changes are compatible with other recurrent cores, we evaluate them in tandem with recurrent models such as the GRU, Reconstructive Memory Agent (RMA; Hung et al., 2018), and Relational Memory Core (RMC; Santoro et al., 2018) whenever they were used on these tasks. We make a note of any important experimental details for each task, and full protocols and details for evaluation and figures are given in Appendix D.
Our first set of experiments is on synthetic memory tasks (Hochreiter and Schmidhuber, 1997; Arjovsky et al., 2016) that are known to be hard for standard LSTMs to solve. For these tasks, we used single layer models with 256 hidden units, trained using Adam with learning rate .
Copy task. In the Copy task, a sequence of digits is generated where the first 10 tokens are randomly chosen from , the middle N tokens are set to , and the last ten tokens are . The goal of the recurrent model is to output in order on the last 10 time steps, whenever the cue token is presented. We trained our models using cross-entropy with baseline loss (Appendix D.1).
Adding task. The input consists of two sequences: 1. numbers sampled independently from 2. an index in the first half of the sequence and
in the second half, together encoded as a two-hot sequence. The target output isand models are evaluated by the mean squared error with baseline loss .
Figure 3 shows the loss of various methods on the Copy and Adding tasks. The only gate combinations capable of solving Copy completely are OR-, UR-, O-, and C-LSTM. This confirms the mechanism of their gates: these are the only methods capable of producing high enough forget gate values either through the non-linearity, the refine gate, or extremely high forget biases. The U-LSTM is the only other method able to make progress, but converges slower as it suffers from gate saturation without the refine gate. The vanilla LSTM makes no progress. The OM-LSTM and UM-LSTM also get stuck at the baseline loss, despite the OM-LSTM’s activation, which we hypothesize is due to the suboptimal magnitudes of the gates at initialization (Appendix B.2). On the Adding task, every method besides the basic LSTM is able to eventually solve it, with all refine gate variants fastest.
Figure 4 shows the distributions of forget gate activations of sigmoid-activation methods, before and after training on the Copy task. Note that the C-LSTM and vanilla LSTM stay close to initialization (near for C-LSTM, and for vanilla). Thus the C-LSTM solves the task as it has artificially high gate values by construction while the LSTM cannot learn any high-valued gates. On the other hand, the U-LSTM makes slow progress at pushing gate values toward . Finally, the UR-LSTM learns a bimodal distribution of forget gate activations, tending toward either or , with a large fraction of the neurons able to get close to perfect remembering as is required by the task.
The C-LSTM has extremal gate initializations by construction, which is well suited for this task. This raises the question: what happens if the initialization distribution does not match the task at hand; could the gates learn back to a more moderate regime? We point out that such a phenomenon could occur non-pathologically on more complex setups, such as a scenario where a model trains to remember on a Copy-like task and then needs to “unlearn” as part of a meta-learning or continual learning setup. In Appendix E.1, we consider such a synthetic scenario and experimentally show that the addition of a refine gate helps models train much faster while in a saturated regime with extremal activations. We also point to the poor performance of C- outside of synthetic memory tasks when using our high hyperparameter-free initialization as more evidence that it is very difficult for standard gates to unlearn undesired saturated behavior.
These tasks involve feeding a recurrent model the pixels of an image in a scanline order before producing a classification label. We test on the sequential MNIST (sMNIST), permuted MNIST (pMNIST) (Le et al., 2015), and sequential CIFAR-10 (sCIFAR) tasks. For this task, single-layer LSTMs with 512 hidden units are used as the base model. Each LSTM method was ran with a learning rate sweep with 3 seeds each. The best validation score found over any run is reported in the first two rows of Table 2.555sMNIST is not included here as it is too easy, making it difficult to draw conclusions. We find in general that all methods are able to improve over the vanilla LSTM. However, the differences become even more pronounced when stability is considered. Although Table 2 reports the best validation accuracies found on any run, we found that many methods were quite unstable. Asterisks are marked next to a score denoting how many of the 3 seeds diverged, for the learning rate that score was found at.
Conversely, Figure 5 shows the accuracy curves of each method at their best stable learning rate. The basic LSTM is noticeably worse than all of the others. This suggests that any of the gate modifications, whether better initialization, non-linearity, or master or refine gates, are better than standard gates especially when long-term dependencies are present. Additionally, the uniform gate initialization methods are generally better than the ordered and chrono initialization, and the refine gate performs better than the master gate. We additionally consider applying other techniques developed for recurrent models that are independent of the gating mechanism. Table 2 also reports scores when the same gating mechanisms are applied to the GRU model instead of the LSTM, where similar trends hold across the gating variants. In particular, UR-GRU is the only method that is able to stably attain good performance. As another example, the addition of a generic regularization technique—we chose Zoneout (Krueger et al., 2016) with default hyperparameters (, )—continued improving the UR-LSTM/GRU, outperforming even non-recurrent models on sequential MNIST and CIFAR-10. Table 3 compares the test accuracy of our main model against other models.
|Dilated GRU (Chang et al., 2017)||99.0||94.6||-|
|IndRNN (Li et al., 2018a)||99.0||96.0||-|
|r-LSTM (2-Layer with Auxiliary Loss) (Trinh et al., 2018)||98.4||95.2||72.2|
|Transformer (Trinh et al., 2018)||98.9||97.9||62.2|
|Temporal convolution network (Bai et al., 2018a)||99.0||97.2|
|TrellisNet (Bai et al., 2018b)||99.20||98.13||73.42|
|UR-LSTM + Zoneout (Krueger et al., 2016)||99.21||97.58||74.34|
|UR-GRU + Zoneout||99.27||96.51||74.4|
From Sections 3.1 and 3.2, we draw a few conclusions about the comparative performance of different gate modifications. First, the refine gate is consistently better than comparable master gates. CI solves the synthetic memory tasks but is worse than any other variant outside of those. We find ordered () gates to be effective, but speed issues prevent us from using them in more complicated tasks. UR- gates are consistently among the best performing and most stable.
We consider word-level language modeling on the WikiText-103 dataset, where (i) the dependency lengths are much shorter than in the synthetic tasks, (ii) language has an implicit hierarchical structure and timescales of varying lengths. We evaluate our gate modifications against the exact hyperparameters of a state-of-the-art LSTM-based baseline (Rae et al., 2018) without additional tuning (Appendix D). Additionally, we compare against OM-LSTM, which was designed for this domain (Shen et al., 2018), and chrono initialization, which addresses dependencies of a particular timescale as opposed to timescale-agnostic UGI methods. In addition to our default hyperparameter-free initialization, we tested models with the chrono hyperparameter manually set to and , values previously used for language modeling and meant to mimic fixed biases of about and respectively (Tallec and Ollivier, 2018).
Table 3(a) shows Validation and Test set perplexities for various models. We find that the OM-LSTM, U-LSTM, and UR-LSTM all robustly improve over the standard LSTM with no additional tuning. However although the OM-LSTM was designed to capture the hierarchical nature of language with the activation, it does not perform better than the U-LSTM and UR-LSTM. The chrono initialization with our default initialization strategy is far too large. While manually tweaking the hyperparameter helps, it is still far from any UGI-based methods. We attribute these observations to the nature of language having dependencies on multiple widely-varying timescales, and that UGI is enough to capture these without resorting to strictly enforced hierarchies such as in OM-LSTM.
In many partially observable reinforcement learning (RL) tasks, the agent can observe only part of the environment at a time and thus requires a memory model to summarize what it has seen previously. However, designing memory architectures for reinforcement learning problems has been a challenging task (Oh et al., 2016; Wayne et al., 2018). These are usually based on an LSTM core to summarize what an agent has seen into a state.
We investigated if changing the gates of these recurrent cores can improve the performance of RL agents, especially on difficult tasks involving memory and long-term credit assignment. We chose the Passive and Active Image Match tasks from Hung et al. (2018) using A3C agents (Mnih et al., 2016). In these tasks, agents are either initially shown a colored indicator (Passive) or must search for it (Active), before being teleported to a room in which they must press a button with matching color to receive reward. In between these two phases is an intermediate phase where they can acquire distractor rewards, but the true objective reported is the final reward in the last phase. These tasks require memorization and credit assignment across long sequences (episodes are steps in length).
Hung et al. (2018) evaluated agents with different types of recurrent cores: the basic LSTM, the DNC (an LSTM with memory), and the RMA (which also uses an LSTM core). We modified each of these with our gates. In Figure 6, we showed the results of different models on the Passive Matching and Active Matching tasks without distractors. These tasks are the most similar to the synthetic tasks 3.1, and we found that those trends largely transferred to the RL setting even with several additional confounders present such as agents learning via RL algorithms, being required to learn relevant features from pixels rather than being given the relevant tokens, and being required to explore in the Active Match case.
We found that the UR- gates substantially improved the performance of the basic LSTM core on both Passive Match and Active Match tasks, with or without distractor rewards. On the difficult Active Match task, it was the only method to achieve better than random behavior.
Figure 7 shows performance of LSTM and RMA cores on the harder Active Match task with distractors. Here the UR- gates again learn the fastest and reach the highest reward. In particular, although the RMA is a memory architecture with an explicit memory bank designed for long-term credit assignment, its performance was also improved.
Appendix (E.1) shows an additional synthetic experiment investigating the effect of refine gates on saturation. Appendix (E.3) has results on a program execution task, which is interesting for having explicit long and variable-length dependencies and hierarchical structure. It additionally shows another very different gated recurrent model where the UR- gates show consistent improvement.
Finally, we would like to comment on the longevity of the LSTM, which for example was frequently found to outperform newer competitors when tuned better (Melis et al., 2017). Although many improvements have been suggested over the years, none have been proven to be as robust as the LSTM across an enormously diverse range of sequence modeling tasks. By experimentally starting from well-tuned LSTM baselines, we believe our simple isolated gate modifications to actually be robust improvements. In the Appendices B.1 and B.2, we offer a few conclusions for the practitioner about the other gate components considered based on our experimental experience.
Several methods exist for addressing gate saturation or allowing more binary activations. Gulcehre et al. (2016) proposed to use piece-wise linear functions with noise in order to allow the gates to operate in saturated regimes. Li et al. (2018b) instead use the Gumbel trick (Maddison et al., 2016; Jang et al., 2016)
, a technique for learning discrete variables within a neural network, to train LSTM models with binary gates. These stochastic approaches can suffer from issues such as gradient estimation bias, unstable training, and limited expressivity from discrete instead of continuous gates. Additionally they require more involved training protocols with an additional temperature hyperparameter that needs to be tuned explicitly.
Alternatively, gates can be entirely replaced with non-saturating activation functions if strong constraints are imposed on other parts of the model, such as diagonal (Li et al., 2018a), identity (Le et al., 2015), or orthogonal (Arjovsky et al., 2016; Henaff et al., 2016) weight matrices, or purely additive state updates Chandar et al. (2019). However, although these gate-less techniques can be used to reduce the vanishing gradient problem with RNNs, unbounded activation functions can cause less stable learning dynamics and exploding gradients. Additionally, we reiterate that the gated update (1) is fundamental to temporal dynamical systems (Tallec and Ollivier, 2018).
As mentioned, a particular consequence of the inability of gates to approach extrema is that gated recurrent models struggle to capture very long dependencies. These problems have traditionally been addressed by introducing new components to the basic RNN setup. Some techniques include stacking layers in a hierarchy (Chung et al., 2016), adding skip connections and dilations (Koutnik et al., 2014; Chang et al., 2017), using an external memory (Graves et al., 2014; Weston et al., 2014; Wayne et al., 2018; Gulcehre et al., 2017), auxiliary semi-supervision (Trinh et al., 2018), and more. However, these approaches have not been widely adopted over the standard LSTM as they are often specialized for certain tasks, are not as robust, and introduce additional complexity. Recently the transformer model has been successful in many applications areas such as NLP (Radford et al., 2019; Dai et al., 2019). However, recurrent neural networks are still important and commonly used due their faster inference without the need to maintain the entire sequence in memory. We emphasize that the vast majority of proposed RNN changes are completely orthogonal to the simple gate improvements in this work, and we do not focus on them. A few other recurrent cores that use the basic gated update (1) but use more sophisticated update functions include the GRU, Reconstructive Memory Agent (RMA; Hung et al., 2018), and Relational Memory Core (RMC; Santoro et al., 2018), which we consider in our experiments.
We finally remark that a significant downside of every approach outlined above is the introduction of additional hyperparameters in the form of constants, training protocol, and/or substantial architectural changes. For example, even for chrono initialization, one of the less intrusive proposals, we experimentally find it to be particularly sensitive to the hyperparameter (Section 3).
In this work, we introduce, analyze, and evaluate several modifications to the ubiquitous gating mechanism that appears in recurrent neural networks. We describe theoretically-justified methods that improve on the standard gating method by alleviating problems with initialization and optimization. The mechanisms considered include changes on independent axes, namely initialization method and auxiliary gates, and we perform extensive ablations on our improvements with previously considered modifications. Our main gate model robustly improves on standard gates across many different tasks and recurrent cores, while requiring less tuning. We point out that the performance on these tasks can perhaps be further improved with dedicated tuning of the new methods, and also that many combinations were left unexplored. Finally, we emphasize that these improvements are completely independent of the large body of research on neural network architectures that use gates, and hope that these insights can be applied to improve machine learning models at large.
A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941. Cited by: §3.2, §4.
The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §4.
We show how the gated update in a typical LSTM implementation can be easily replaced by UR- gates.
The chrono initialization
was the first to explicitly attempt to initialize the activation of gates across a distributional range. They also elucidate the benefits of tying the input and forget gates, leading to the simple trick (18) for approximating tying the gates at initialization, which we borrow for UGI. (We remark that perfect tied initialization can be accomplished by fully tying the linear maps , but (18) is a good approximation.)
However, the main drawback of CI is that the initialization distribution is too heavily biased toward large terms. This leads to empirical consequences such as difficult tuning (due to most units starting in the saturation regime, requiring different learning rates) and high sensitivity to the hyperparameter that represents the maximum potential length of dependencies. For example, Tallec and Ollivier (2018) set this parameter according to a different protocol for every task, with values ranging from to . Our experiments used a hyperparameter-free method to initialize (Section 3), and we found that chrono initialization generally severely over-emphasizes long-term dependencies if is not carefully controlled.
A different workaround suggested by Tallec and Ollivier (2018) is to sample from and setting . Note that such an initialization would be almost equivalent to sampling the decay period from the distribution with density (since the decay period is ). This parameter-free initialization is thus similar in spirit to the uniform gate initialization (Proposition 2), but from a much heavier-tailed distribution that emphasizes very long-term dependencies.
These interpretations suggest that it is plausible to define a family of Pareto-like distributions from which to draw the initial decay periods from, with this distribution treated as a hyperparameter. However, with no additional prior information on the task, we believe the uniform gate initialization to be the best candidate, as it 1. is a simple distribution with easy implementation, 2. has characteristic timescale distributed as an intermediate balance between the heavy-tailed chrono initialization and sharply decaying standard initialization, and 3. is similar to the ON-LSTM’s activation, in particular matching the initialization distribution of the activation.
Table 4 summarizes the decay period distributions at initialization using different activations and initialization strategies.
|Initialization method||Timescale distribution|
|Chrono initialization (known timescale )|
|Chrono initialization (unknown timescale)|
|Uniform gate initialization|
In general, our experimental recommendation for CI is that it can be better than standard initialization or UGI in narrow regimes (tasks with long dependencies and nearly fixed-length sequences as in Sections 3.1, 3.4) and/or when it can be explicitly tuned (both the hyperparameter , as well as the learning rate to compensate for almost all units starting in saturation). Otherwise, we recommend UGI or standard initialization. We found no scenarios where it outperformed UR- gates.
In this section we elaborate on the connection between the mechanism of Shen et al. (2018) and our methods. We define the full ON-LSTM and show how its gating mechanisms can be improved. For example, there is a remarkable connection between its master gates and our refine gates – independently of the derivation of refine gates in Section 2.3, we show how a specific way of fixing the normalization of master gates becomes equivalent to a single refine gate.
First, we formally define the full ON-LSTM. The master gates are a -activation gate
These combine with an independent pair of forget and input gates , meant to control fine-grained behavior, to create an effective forget/input gate which are used to update the state (equation (1) or (5)).
As mentioned in Section 4, this model modifies the standard forget/input gates in two main ways, namely ordering the gates via the activation, and supplying an auxiliary set of gates controlling fine-grained behavior. Both of these are important novelties and together allow recurrent models to better capture tree structures.
However, the UGI and refine gate can be viewed as improvements over each of these, respectively, demonstrated both theoretically (below) and empirically (Sections 3 and E.3), even on tasks involving hierarchical sequences.
Despite having the same parameter count and asymptotic efficiency as standard sigmoid gates, gates seem noticeably slower and less stable in practice for large hidden sizes. Additionally, using auxiliary master gates creates additional parameters compared to the basic LSTM. Shen et al. (2018) alleviated both of these problems by defining a downsize operation, whereby neurons are grouped in chunks of size , each of which share the same master gate values. However, this also creates an additional hyperparameter.
The speed and stability issues can be fixed by just using the sigmoid non-linearity instead of . To recover the most important properties of the —activations at multiple timescales—the equivalent sigmoid gate can be initialized so as to match the distribution of gates at initialization. This is just uniform gate initialization (equation (9)).
However, we believe that the activation is still valuable in many situations if speed and instability are not issues. These include when the hidden size is small, when extremal gate activations are desired, or when ordering needs to be strictly enforced to induce explicit hierarchical structure. For example, Section (3.1) shows that they can solve hard memory tasks by themselves.
We observe that the magnitudes of master gates are suboptimally normalized. A nice interpretation of gated recurrent models shows that they are a discretization of a continuous differential equation. This leads to the leaky RNN model , where is the update to the model such as . Learning as a function of the current time step leads to the simplest gated recurrent model666In the literature, this is called the JANET (van der Westhuizen and Lasenby, 2018), which is also equivalent to the GRU without a reset gate (Chung et al., 2014), or a recurrent highway network with depth (Zilly et al., 2017).
Tallec and Ollivier (2018) show that this exactly corresponds to the discretization of a differential equation that is invariant to time warpings and time rescalings. In the context of the LSTM, this interpretation requires the values of the forget and input gates to be tied so that . This weight-tying is often enforced, for example in the most popular LSTM variant, the GRU (Cho et al., 2014), or our UR- gates. In a large-scale LSTM architecture search, it was found that removing the input gate was not significantly detrimental (Greff et al., 2016).
However, the ON-LSTM does not satisfy this conventional wisdom that the input and forget gates should sum to close to .
At initialization, the expected value of the average effective forget gate activation is .
Note that the master gates (19), (20) sum in expectation at initialization, as do the original forget and input gates. Looking at individual units in the ordered master gates, we have . Thus the above simplifies to
The gate normalization can be fixed by re-scaling equations (22) and (23). It turns out that tying the master gates and re-scaling is exactly equivalent to the mechanism of a refine gate. In this equivalence, the role of the master and forget gates of the ON-LSTM are played by our forget and refine gate respectively.
Now we have
which has the correct scaling, i.e. at initialization assuming that at initialization.
But (26) can be rewritten as follows:
This is equivalent to the refine gate, where the master gate plays the role of the forget gate and the forget gate plays the role of the refine gate. It can be shown that in this case, the effective input gate (27) is also defined through a refine gate mechanism, where is refined by :
Based on our experimental findings, in general we would recommend the refine gate in place of the master gate.
For clarity, we formally define the gate ablations considered which mix and match different gate components.
We remark that other combinations are possible, for example combining CI with either auxiliary gate type, which would lead to CR- or CM- gates. Alternatively, the master or refine gates could be defined using different activation and initialization strategies. We chose not to consider these methods due to lack of interpretation and theoretical soundness.
This ablation uses the activation to order the forget/input gates and has no auxiliary gates.
We note that one difficulty with this in practice is the reliance on the expensive , and hypothesize that this is perhaps the ON-LSTM’s original motivation for the second set of gates combined with downsizing.
This ablation combines ordered main gates with an auxilliary refine gate.
are used as the effective forget and input gates.
The gradient analysis in Figure 2 was constructed as follows. Let be the forget, refine, and effective gates
Substituting the relation
this reduces to
Given the constraint , this function can be minimized and maximized in terms of to produce the upper and lower bounds in Figure 1(d). This was performed numerically.
To normalize the number of parameters used for models using master gates, i.e. the OM- and UM- gating mechanisms, we used a downsize factor on the main gates (see Section B.2). This was set to for the synthetic and image classification tasks, and for the language modeling and program execution tasks which used larger hidden sizes.
All models consisted of single layer LSTMs with 256 hidden units, trained with the Adam optimizer Kingma and Ba (2014) with learning rate 1e-3. Gradients were clipped at .
The training data consisted of randomly generated sequences for every minibatch rather than iterating through a fixed dataset. Each method ran 3 seeds, with the same training data for every method.
Our version of the Copy task is a very minor variant of other versions reported in the literature, with the main difference being that the loss is considered only over the last 10 output tokens which need to be memorized. This normalizes the loss so that losses approaching indicate true progress. In contrast, this task is usually defined with the model being required to output a dummy token at the first steps, meaning it can be hard to evaluate performance since low average losses simply indicate that the model learns to output the dummy token.
For Figure 3, the log loss curves show the median of 3 seeds, and the error bars indicate 60% confidence.
For Figure 4, each histogram represents the distribution of forget gate values of the hidden units (of which there are ). The values are created by averaging units over time and samples, i.e., reducing a minibatch of forget gate activations of shape (batch size, sequence length, hidden size) over the first two diensions, to produce the average activation value for every unit.
All models used a single hidden layer recurrent network (LSTM or GRU). Inputs to the model were given in batches as a sequence of shape (sequence length, num channels), (e.g. for CIFAR-10), by flattening the input image left-to-right, top-to-bottom. The outputs of the model of shape (sequence length, hidden size)
were processed independently with a single ReLU hidden layer of size
before the final fully-connected layer outputting softmax logits. All training was performed with the Adam optimizer, batch size
, and gradients clipped at
. MNIST trained for 150 epochs, CIFAR-10 used 100 epochs over the training set.
All models (LSTM and GRU) used hidden state size . Learning rate swept in with three seeds each.
Table 2 reports the highest validation score found. The GRU model swept over learning rates ; all methods were unstable at higher learning rates.
The UR-LSTM and UR-GRU used 1024 hidden units for the sequential and permuted MNIST task, and 2048 hidden units for the sequential CIFAR task. The vanilla LSTM baseline used 512 hidden units for MNIST and 1024 for CIFAR. Larger hidden sizes were found to be unstable.
Hyperparameters are taken from Rae et al. (2018) tuned for the vanilla LSTM, which consist of (chosen parameter bolded out of sweep): LSTM layer, embedding dropout, layer norm, and input/output embedding parameters. Our only divergence is using a hidden size of instead of , which we found improved the performance of the vanilla LSTM. Training was performed with Adam at learning rate 1e-3, gradients clipped to , sequence length , and batch size on TPU. The LSTM state was reset between article boundaries. Figure 5(a) shows smoothed validation perplexity curves showing the 95% confidence intervals over the last 1% of data.
The Active Match and Passive Match tasks were borrowed from Hung et al. (2018) with the same settings. For Figures 6 and 9, the discount factor in the environment was set to . For Figure 7, the discount factor was . Figure 9 corresponds to the full Active Match task in Hung et al. (2018), while Figure 7 is their version with small distractor rewards where the apples in the distractor phase give instead of reward.
Protocol was taken from Santoro et al. (2018) with minor changes to the hyperparameter search. All models were trained with the Adam optimizer, the Mix curriculum strategy from Zaremba and Sutskever (2014), and batch size .
RMC: The RMC models used a fixed memory slot size of and swept over memories and attention heads for a total memory size of or . They were trained for 2e5 iterations.
LSTM: Instead of two-layer LSTMs with sweeps over skip connections and output concatenation, single-layer LSTMs of size or were used. Learning rate was swept in 5e-4, 1e-3, and models were trained for 5e5 iterations. Note that training was still faster than the RMC models despite the greater number of iterations.
The inverse sigmoid function (9) can be unstable if the input is too close to . Uniform gate initialization was instead implemented by sampling from the distribution instead of , where is the hidden size, to avoid any potential numerical edge cases. This choice is justified by the fact that with perfect uniform sampling, the expected smallest and largest samples would be and .
For distributional initialization strategies, a trainable bias vector was sampled independently from the chosen distribution (i.e. equation (17) or (9)) and added/subtracted to the forget and input gate ((2)-(3)) before the non-linearity. Additionally, each linear model such as had its own trainable bias vector, effectively doubling the learning rate on the pre-activation bias terms on the forget and input gates. This was an artifact of implementation and not intended to affect performance.
The refine gate update equation (12) can instead be implemented as
In an effort to standardize the permutation used in the Permuted MNIST benchmark, we use a particular deterministic permutation rather than a random one. After flattening the input image into a one-dimensional sequence, we apply the bit reversal permutation. This permutation sends the index to the index such that ’s binary representation is the reverse of ’s binary representation. The intuition is that if two indices are close, they must differ in their lower-order bits. Then the bit-reversed indices will be far apart. Therefore the bit-reversal permutation destroys spatial and temporal locality, which is desirable for these sequence classification tasks meant to test long-range dependencies rather than local structure.
Figure 4 illustrates how the refine gate significantly helps gates learn extreme activations, which is normally difficult because of gate saturation. Here we empirically demonstrate the opposite phenomenon: that if gates are too extreme and need to be regressed (perhaps due to needing to “unlearn” previous weights after distributional input shift, or simply due to initialization such as with chrono initialization or uniform gate initialization).
For this experiment, we initialize the biases of the gates extremely high (effective forget activation . We then consider the Adding task (Section 3.1 of length 500, hidden size 64, learning rate 1e-4. The R-LSTM is able to solve the task, while the LSTM is stuck after 1e4 iterations.
Figures 6 and 7 evaluated our gating methods with the LSTM and RMA models on the Passive Match and Active Match tasks, with and without distractors. We additionally ran the agents on an even harder version of the Active Match task with larger distractor rewards (the full Active Match from Hung et al. (2018)). Learning curves are shown in Figure 9. Similarly to the other results, the UR- gated core is noticeably better than the others. For the DNC model, it is the only one that performs better than random chance.
The Learning to Execute (Zaremba and Sutskever, 2014) dataset consists of algorithmic snippets from a programming language of pseudo-code. An input is a program from this language presented one character at a time, and the target output is a numeric sequence of characters representing the execution output of the program. There are three categories of tasks: Addition, Control, and Program, with distinctive types of input programs. We use the most difficult setting from Zaremba and Sutskever (2014), which uses the parameters nesting=4, length=9, referring to the nesting depth of control structure and base length of numeric literals, respectively. Examples of input programs are shown in previous works (Zaremba and Sutskever, 2014; Santoro et al., 2018).
We are interested in this task for several reasons. First, we are interested in comparing against the C- and OM- gate methods, because
The maximum sequence length is fairly long (several hundred tokens), meaning our heuristic for C- gates is within the right order of magnitude of dependency lengths.
The task has highly variable sequence lengths, wherein the standard training procedure randomly samples inputs of varying lengths (called the ”Mix” curriculum in Zaremba and Sutskever (2014)). Additionally, the Control and Program tasks contain complex control flow and nested structure. They are thus a measure of a sequence model’s ability to model dependencies of differing lengths, as well as hierarchical information. Thus we are interested in comparing the effects of UGI methods, as well as the full OM- gates which are designed for hierarchical structures (Shen et al., 2018).
Finally, this task has prior work using a different type of recurrent core, the Relational Memory Core (RMC), that we also use as a baseline to evaluate our gates on different models Santoro et al. (2018). Both the LSTM and RMC were found to outperform other recurrent baselines such as the Differential Neural Computer (DNC) and EntNet.
Training curves are shown in Figure 10, which plots the median accuracy with conf