Dataset of hyperparameters and final losses (train, validation, eval) accompanying the ICLR 2017 paper, "Capacity and Trainability in Recurrent Neural Networks".
Two potential bottlenecks on the expressiveness of recurrent neural networks (RNNs) are their ability to store information about the task in their parameters, and to store information about the input history in their units. We show experimentally that all common RNN architectures achieve nearly the same per-task and per-unit capacity bounds with careful training, for a variety of tasks and stacking depths. They can store an amount of task information which is linear in the number of parameters, and is approximately 5 bits per parameter. They can additionally store approximately one real number from their input history per hidden unit. We further find that for several tasks it is the per-task parameter capacity bound that determines performance. These results suggest that many previous results comparing RNN architectures are driven primarily by differences in training effectiveness, rather than differences in capacity. Supporting this observation, we compare training difficulty for several architectures, and show that vanilla RNNs are far more difficult to train, yet have slightly higher capacity. Finally, we propose two novel RNN architectures, one of which is easier to train than the LSTM or GRU for deeply stacked architectures.READ FULL TEXT VIEW PDF
Dataset of hyperparameters and final losses (train, validation, eval) accompanying the ICLR 2017 paper, "Capacity and Trainability in Recurrent Neural Networks".
, and RNNs have become the central component for some very successful model classes and application domains in deep learning (speech recognition(Amodei et al., 2015), seq2seq (Sutskever et al., 2014)Bahdanau et al., 2014), the DRAW model (Gregor et al., 2015), educational applications (Piech et al., 2015), and scientific discovery (Mante et al., 2013)). Despite these recent successes, it is widely acknowledged that designing and training the RNN components in complex models can be extremely tricky. Painfully acquired RNN expertise is still crucial to the success of most projects.
One of the main strategies involved in the deployment of RNN models is the use of the Long Short Term Memory (LSTM) networks(Hochreiter & Schmidhuber, 1997)
, and more recently the Gated Recurrent Unit (GRU) proposed byCho et al. (2014); Chung et al. (2014) (we refer to these as gated architectures). The resulting models are perceived as being more easily trained, and achieving lower error. While it is widely appreciated that RNNs are universal approximators (Doya, 1993), an unresolved question is the degree to which gated models are more computationally powerful in practice, as opposed to simply being easier to train.
Here we provide evidence that the observed superiority of gated models over vanilla RNN models is almost exclusively driven by trainability. First we describe two types of capacity bottlenecks that various RNN architectures might be expected to suffer from: parameter efficiency related to learning the task, and the ability to remember input history. Next, we describe our experimental setup where we disentangle the effects of these two bottlenecks, including training with extremely thorough hyperparameter (HP) optimization. Finally, we describe our capacity experiment results (per-parameter and per-unit), as well as the results of trainability experiments (training on extremely hard tasks where gated models might reasonably be expected to perform better).
There are several potential bottlenecks for RNNs, for example: How much information about the task can they store in their parameters? How much information about the input history can they store in their units? These first two bottlenecks can both be seen as memory capacities (one for the task, one for the inputs), for different types of memory.
Another, different kind of capacity stems from the set of computational primitives an RNN is able to perform. For example, maybe one wants to multiply two numbers. In terms of number of units and time steps, this task may be very straight-forward using some specific computational primitives and dynamics, but with others it may be extremely resource heavy. One might expect that differences in computational capacity due to different computational primitives would play a large role in performance. However, despite the fact that the gated architectures are outfitted with a multiplicative primitive between hidden units, while the vanilla RNN is not, we found no evidence of a computational bottleneck in our experiments. We therefore will focus only on the per-parameter capacity of an RNN to learn about its task during training, and on the per-unit memory capacity of an RNN to remember its inputs.
RNNs have many HPs, such as the scalings of matrices and biases, and the functional form of certain nonlinearities. There are additionally many HPs involved in training, such as the choice of optimizer, and the learning rate schedule. In order to train our models we employed a HP tuner that uses a Gaussian Process model similar to Spearmint (see Appendix, section on HP tuning and Desautels et al. (2014); Snoek et al. (2012) for related work). The basic idea is that one requests HP values from the tuner, runs the optimization to completion using those values, and then returns the validation loss. This loss is then used by the tuner, in combination with previously reported losses, to choose new HP values such that over many experiments, the validation loss is minimized with respect to the HPs. For our experiments, we report the evaluation loss (separate from the validation loss returned to the HP optimizer, except where otherwise noted) after the HP tuner has highly optimized the task (hundreds to many thousands of experiments for each architecture and task).
In our studies we used a variety of well-known RNN architectures: standard RNNs such as the vanilla RNN and the newer IRNN (Le et al., 2015), as well as gated RNN architectures such as the GRU and LSTM. We rounded out our set of models by innovating two novel (to our knowledge) RNN architectures (see Section 1.4) we call the Update Gate RNN (UGRNN), and the Intersection RNN (+RNN). The UGRNN is a ‘minimally gated’ RNN architecture that has only a coupled gate between the recurrent hidden state, and the update to the hidden state. The +RNN uses coupled gates to gate both the recurrent and depth dimensions in a straightforward way.
To further explore the various strengths and weaknesses of each RNN architecture, we also used a variety of network depths: 1, 2, 4, 8, in our experiments.111Not all experiments used a depth of 8, due to limits on computational resources. In most experiments, we held the number of parameters fixed across different architectures and different depths. More precisely, for a given experiment, a maximum number of parameters was set, along with an input and output dimension. The number of hidden units per layer was then chosen such that the number of parameters, summed across all layers of the network, was as large as possible without exceeding the allowed maximum.
For each of our 6 tasks, 6 RNN variants, 4 depths, and 6+ model sizes, we ran the HP tuner in order to optimize the relevant loss function. Typically this resulted in many hundreds to several thousands of HP evaluations, each of which was a full training run up to millions of training steps. Taken together, this amounted to CPU-millennia worth of computation.
While it is well known that RNNs are universal approximators of arbitrary dynamical systems (Doya, 1993), there is little theoretical work on the task-capacity of RNNs. Koiran & Sontag (1998) studied the VC dimension of RNNs, which provides an upper bound on their task-capacity (defined in Section 2.1). These upper bounds are not a close match to our experimental results. For instance, we find that performance saturates rapidly in terms of the number of unrolling steps (Figure 2b), while the relevant bound increases linearly with the number of unrolling steps. "Unrolling" refers to recurrent computation through time.
Empirically, Karpathy et al. (2015) have studied how LSTMs encode information in character-based text modeling tasks. Further, Sussillo & Barak (2013) have reverse-engineered the vanilla RNN trained on simple tasks, using the tools and language of nonlinear dynamical systems theory. In Foerster et al. (2016) the behavior of switched affine recurrent networks is carefully examined.
The ability of RNNs to store information about their input has been better studied, in both the context of machine learning and theoretical neuroscience. Previous work on short term memory traces explores the tradeoffs between memory fidelity and duration, for the case that a new input is presented to the RNN at every time step(Jaeger & Haas, 2004; Maass et al., 2002; White et al., 2004; Ganguli et al., 2008; Charles et al., 2014)
. We use a simpler capacity measure consisting only of the ability of an RNN to store a single input vector. Our results suggest that, contrary to common belief, the capacity of RNNs to remember their input history is not a practical limiting factor on their performance.
The precise details of what makes an RNN architecture perform well is an extremely active research field (e.g. Jozefowicz et al. (2015)). A highly related article is Greff et al. (2015), in which the authors used random search of HPs, along with systematic removal of pieces of the LSTM architecture to determine which pieces of the LSTM were more important than the others. Our UGRNN architecture is directly inspired by the large impact of removing the forget gate from the LSTM (Gers et al., 1999). Zhou et al. (2016) introduced an architecture with minimal gating that is similar to the UGRNN, but is directly inspired by the GRU. An in-depth comparison between RNNs and GRUs in the context of end-to-end speech recognition and a limited computational budget was conducted in Amodei et al. (2015). Further, ideas from RNN architectures that improve ease of training, such as forget gates (Gers et al., 1999), and copying recurrent state from one time step to another, are making their way into deep feed-forward networks as highway networks (Srivastava et al., 2015)He et al., 2015), respectively. Indeed, the +RNN was inspired in part by the coupled depth gate of Srivastava et al. (2015).
Below we briefly define the RNN architectures used in this study. Unless otherwise stated denotes a matrix, denotes a vector of biases. The symbol is the input at time , and is the hidden state at time . Remaining vector variables represent intermediate values. The function
denotes the logistic sigmoid function andis either or ReLU, set as a HP (see Appendix, Section RNN HPs for the complete list of HPs). Initial conditions for the networks were set to a learned bias. Finally, it is a well-known trick of the trade to initialize the gates of an LSTM or GRU with a large bias to induce better gradient flow. We included this parameter, denoted as , and tuned it along with all other HPs.
Note the IRNN is identical in structure to the vanilla RNN, but with an identity initialization for , zero initialization for the biases, and only.
Based on Greff et al. (2015), where they noticed the forget gate “was crucial” to LSTM performance, we tried an RNN variant where we began with a vanilla RNN and added a single gate. This gate determines whether the hidden state is carried over from the previous time step, or updated – hence, it is an update gate. An alternative way to view the UGRNN is a highway layer gated through time (Srivastava et al., 2015).
Due to the success of the UGRNN for shallower architectures in this study (see later figures on trainability), as well as some of the observed trainability problems for both the LSTM and GRU for deeper architectures (e.g. Figure 4h) we developed the Intersection RNN (denoted with a ‘+’) architecture with a coupled depth gate in addition to a coupled recurrent gate. Additional influences for this architecture were the recurrent gating of the LSTM and GRU, and the depth gating from the highway network (Srivastava et al., 2015). This architecture has recurrent input, , and depth input, . It also has recurrent output, , and depth output, . Note that this architecture only applies between layers where and have the same dimension, and is not appropriate for networks with a depth of 1 (we exclude depth one +RNNs in our experiments).
In practice we used ReLU for s1 and for s2.
A foundational result in machine learning is that a single-layer perceptron withparameters can store at least 2 bits of information per parameter (Cover, 1965; Gardner, 1988; Baldi & Venkatesh, 1987). More precisely, a perceptron can implement a mapping from , -dimensional, input vectors to arbitrary -dimensional binary output vectors, subject only to the extremely weak restriction that the input vectors be in general position. RNNs provide a far more complex input-output mapping, with hidden units, recurrent dynamics, and a diversity of nonlinearities. Nonetheless, we wondered if there were analogous capacity results for RNNs that we might be able to observe empirically.
As we will show in Section 3, tasks with complex temporal dynamics, such as language modeling, exhibit a per-parameter capacity bottleneck that explains the performance of RNNs far better than a per-unit bottleneck. To make the experimental design as simple as possible, and to remove potential confounds stemming from the choice of temporal dynamics, we study per-parameter capacity using a task inspired by Gardner (1988). Specifically, to measure how much task-related information can be stored in the parameters of an RNN, we use a memorization task, where a random static input is injected into an RNN, and a random static output is read out some number of time steps later. We emphasize that the same per-parameter bottleneck that we find in this simplified task also arises in more temporally complex tasks, such as language modeling.
At a high level, we draw a fixed set of random inputs and random labels, and train the RNN to map random inputs to randomly chosen labels via cross-entropy error. However, rather than returning the cross-entropy error to the HP tuner (as is normally done), we instead return the mutual information between the RNN outputs and the true labels. In this way, we can treat the number of input-output mappings as a HP, and the tuner will select for us the correct number of mappings so as to maximize the mutual information between the RNN outputs and the labels. From this mutual information we compute bits per parameter, which provides a normalized measurement of how much the RNN learned about the task.
More precisely, we draw datasets of binary inputs and target binary labels at uniform from the set of all binary datasets, , , where is the number of samples, and is the dimensionality of the inputs. Number of samples, , is treated as a HP and in practice the optimal dataset size is very close to the bits of mutual information between true and predicted labels. This trend is demonstrated in Figure App.1 in the Appendix. For each value of the RNN is trained to minimize the cross entropy of the network output with the true labels. We write the output of the RNN for all inputs as
, with corresponding random variable. We are interested in the mutual information between the true class labels and the class labels predicted by the RNN. This is the amount of (directly recoverable) information that the RNN has stored about the task. In this setting, it is calculated as
is the fraction of correctly classified samples. The numberis then adjusted, along with all the other HPs, so as to maximize the mutual information . In practice is computed using only a single draw of .
We performed this optimization of for various RNN architectures, depths, and numbers of parameters. We plot the best value of vs. number of parameters in Figure 1a. This captures the amount of information stored in the parameters about the mapping between and
. To get an estimate of bits per parameter, we divide by the number of parameters, as shown in Figure1e.
Examining the results of Figure 1, we find the capacity of all architectures is roughly linear in the number of parameters, across several orders of magnitude of parameter count. We further find that the capacity is between 3 and 6 bits per parameter, once again across all architectures, depths 1, 2 and 4, and across several orders of magnitude in terms of number of parameters. Given the possibility of small size effects, and a larger portion of weights used as biases at a small number of parameters, we believe our estimates for larger networks are more reliable. This leads us to a bits per parameter estimate of approximately 5, averaging over all architectures and all depths. Finally, we note that the per-parameter task capacity increases as a function of the number of unrollings, though with diminishing gains (Figure 2b).
The finding that our results are consistent across diverse architectures and scales is even more surprising, since prior to these experiments it was not clear that capacity would even scale linearly with the number of parameters. For instance, previous results on model compression – by reducing the number of parameters (Yang et al., 2015), or by reducing the bit depth of parameters (Hubara et al., 2016) – might lead one to predict that different architectures use parameters with vastly different efficiencies, and that task capacity increases only sublinearly with parameter count.
While overall, the different architectures performed very similarly, there are some capacity differences between architectures that appear to hold up across most depths and parameter counts. To quantify these differences we constructed a table showing the change in the number of parameters one would need to switch from one architecture to another, while maintaining equivalent capacity (Figure 1i). One trend that emerged from our capacity experiments is a slightly reduced capacity as a function of "gatedness". Putting aside the IRNN, which performed the worst and is discussed below, we noticed that across all depths and all model sizes, the performance was on average RNN > UGRNN > GRU > LSTM > +RNN. The vanilla RNN has no gates, the UGRNN has one, while the remaining three have two or more.
In our capacity tasks, the IRNN performed noticeably worse than all other architectures, reaching a maximum bits per parameter of roughly 3.5. To determine if this performance drop was due to the ReLU nonlinearity of the IRNN, or its identity initialization, we sorted through the RNN and UGRNN results (which both have ReLU and as choices for the nonlinearity HP) and looked at the maximum bits per parameter when only optimizations using ReLU are considered. Indeed, both the RNN and UGRNN bits per parameter dropped dramatically to the 3.5 range (Figure 2a) when those architectures exclusively used ReLU, providing strong evidence that the ReLUactivation function is problematic for this capacity task.
An additional capacity bottleneck in RNNs is their ability to store information about their inputs over time. It may be plainly obvious that an IRNN, which is essentially an integrator, can achieve perfect memory of its inputs if the number of inputs is less than or equal to the number of hidden units, but it is not so clear for some of the more complex architectures. So we measured the per-unit input memory empirically. Figure 2c shows the intuitive result that every RNN architecture (at every depth and number of parameters) we studied can reconstruct a random dimensional input at some time in the future, if and only if the number of hidden units per layer in the network, , is greater than or equal to Moreover, regardless of RNN architecture, the error in reconstructing the input follows the same curve as a function of the number of hidden units for all RNN variants, corresponding to reconstructing an dimensional subspace of the dimensional input.
We highlight this per-unit capacity to make the point that a per-parameter task capacity appears to be the limiting factor in our experiments (e.g. Figure 1 and Figure 3), and not a per-unit capacity, such as the per-unit capacity to remember previous inputs. Thus when comparing results between architectures, one should normalize different architectures by the number of parameters, and not the number of units, as is frequently done in the literature (e.g. when comparing vanilla RNNs to LSTMs). This makes further sense as, for all common RNN architectures, the computational cost of processing a single sample is linear in the number of parameters, and quadratic in the number of units per layer. As we show in Figure 3d, plotting the capacity results by numbers of units gives very misleading results.
We studied additional tasks that we believed to be easy enough to train that the evaluation loss of different architectures would reveal variations in capacity rather than trainability. A critical aspect of these tasks is that they could not be learned perfectly by any of the model sizes in our experiments. As we change model size, we therefore expect performance on the task to also change. The tasks are (see Appendix, section Task Definitions for further elaboration of these tasks):
text8 - 1-step ahead character-based prediction on the text8 Wikipedia dataset (100 million characters) (Mahoney, 2011).
Random Continuous Functions (RCF) - A task similar to the per-parameter capacity task above, except the target outputs are real numbers (not categorical), and the number of training samples is held fixed.
The performance on these two tasks is shown in Figure 3. The evaluation loss as a function of the number of parameters is plotted in panels a-c and e-g, for the text8 task, and RCF task, respectively. For all tasks in this section, the number of parameters rather than the number of units provided the bottleneck on performance, and all architectures performed extremely closely for the same number of parameters. By close performance we mean that, for one model to achieve the same loss as another the model, the number of parameters would have to be adjusted by only a small factor (exemplified in Figure 1i for the per-parameter capacity task).
In practice it is widely appreciated that there is often a significant gap in performance between, for example, the LSTM and the vanilla RNN, with the LSTM nearly always outperforming the vanilla RNN. Our per-parameter capacity results provide evidence for a rough equivalence among a variety of RNN architectures, with slightly higher capacity in the vanilla RNN (Figure 1). To reconcile our per-parameter capacity results with widely held experience, we provide evidence that gated architectures, such as the LSTM, are far easier to train than the vanilla RNN (and often the IRNN).
We study two tasks that are difficult to learn: parallel parentheses counting of independent input streams, and mathematical addition of integers encoded in a character string (see Appendix, section Task Definitions). The parentheses task is moderately difficult to learn, while the arithmetic task is quite hard. The results of the HP optimizations are shown in Figure 4a-4h for the parentheses task, and in Figure 4i-4p for the arithmetic task. These tasks show that, while it is possible for a vanilla RNN to learn these tasks reasonably well, it is far more difficult than for a gated architecture. Note that the best achieved loss on the arithmetic task is still significantly decreasing, even after 2500 HP evaluations (2500 full complete optimizations over the training set), for the RNN and IRNN.
There are three noteworthy trends in these trainability experiments. First, across both tasks, and all depths (1, 2, 4 and 8), the RNN and IRNN performed most poorly, and took the longest to learn the task. Note, however that both the RNN and IRNN always solved the tasks eventually, at least for depth 1. Second, as the stacking depth increased, the gated architectures became the only architectures that could solve the tasks. Third, the most trainable architecture for depth 1 was the GRU, and the most trainable architecture for depth 8 was the +RNN (which performed the best on both of our metrics for trainability, on both tasks).
To achieve our results on capacity and trainability, we relied heavily on a HP tuner. Most practitioners do not have the time or resources to make use of such a tuner, typically only adjusting the HPs a few times themselves. So we wondered how the various architectures would perform if we set HPs randomly, within the ranges specified (see Appendix for ranges). We tried this 1000 times on the parentheses task, for all 200k parameter architectures at depths 1 and 8 (Figure 5 and Table 1). The noticeable trends are that the IRNN returned an infeasible error nearly half of the time, and the LSTM (depth 1) and GRU (depth 8) were infeasible the least number of times, where infeasibility means that the training loss diverged. For depth 1, the GRU gave the smallest error, and the smallest median error, and for depth 8, the +RNN delivered the smallest error and smallest median error.
|Architecture||% Infeasible (1 layer)||% Infeasible (8 layer)|
|GRU||15.5 %||3.2 %|
|IRNN||56.7 %||44.6 %|
|LSTM||12.0 %||4.0 %|
|RNN||21.5 %||18.7 %|
|UGRNN||20.2 %||11.5 %|
Here we report that a number of RNN variants can hold between 3-6 bits per parameter about their task, and that these variants can remember a number of random inputs that is nearly equal to the number of hidden units in the RNN. The quantification of the number of bits per parameter an RNN can store about a task is particularly important, as it was not previously known whether the amount of information about a task that could be stored was even linear in the number of parameters.
While our results point to empirical capacity limits for both task memorization, and input memorization, apparently the requirement to remember features of the input through time is not a practical bottleneck. If it were, then the vanilla RNN and IRNN would perform better than the gated architectures in proportion to the ratio of the number of units, which they do not. Based on widespread results in the literature, and our own results on our difficult tasks, the loss of some memory capacity (and possibly a small amount of per-parameter storage capacity) for improved trainability seems a worthwhile trade off. Indeed, the input memory capacity did not obviously impact any task not explicitly designed to measure it, as the error curves – for instance for the language modeling task – overlapped across architectures for the same number of parameters, but not the same number of units.
Our result on per-parameter task capacity, about 5 bits per parameter averaged over architectures, is in surprising agreement with recently published results on the capacity of synapses in biological neurons. This number was recently calculated to be about 4.7 bits per synapse, based on biological synapses in the hippocampus having roughly 26 measurable discrete sizes(Bartol et al., 2016). Our capacity results have implications for compressed networks that employ quantization techniques. In particular, they provide an estimate of the number of bits which a weight may be compressed without loss in task performance. Coincidentally, in Han et al. (2015), the authors used 5 bits per weight in the fully connected layers.
An additional observation about per-parameter task capacity in our experiments is that it increases for a few time steps beyond one (Figure 2b), and then appears to saturate. We interpret this to suggest that recurrence endows additional capacity to a network with shared parameters, but that there are diminishing returns, and the total capacity remains bounded even as the number of time steps increases.
We also note that performance is nearly constant across RNN architectures if the number of parameters is held fixed. This may motivate the design and use of architectures with small compute per parameter ratios, such as mixture of experts RNNs (Shazeer et al., 2017), and RNNs with large embedding dictionaries on input and output (Józefowicz et al., 2016).
Despite our best efforts, we cannot claim that we perfectly trained any of the models. Potential problems in HP optimization could be local minima, as well as stochastic behavior in the HP optimization as a result of the stochasticity of batching or random draws for weight matrices. We tried to uncover these effects by running the best performing HPs 100 times, and did not observe any serious deviations from the best results (see Table App.1 in Appendix). Another form of validation comes from the fact that in our capacity task, essentially 3 independent experiments (one for each level of depth) yielded a clustering by architecture (Figure 1e).
Do our results yield a framework for choosing a recurrent architecture? In total, we believe yes. As explored in Amodei et al. (2015), a practical concern for recurrent models is speed of execution in a production environment. Our results suggest that if one has a large resource budget for training and confined resource budget for inference, one should choose the vanilla RNN. Conversely, if the training resource budget is small, but the inference budget large, one should choose a gated model. Another serious concern relates to task complexity. If the task is easy to learn, a vanilla RNN should yield good results. However if the task is even moderately difficult to learn, a gated architecture is the right choice. Our results point to the GRU as being the most learnable of gated RNNs for shallow architectures, followed by the UGRNN. The +RNN typically performed best for deeper architectures. Our results on trainability confirm the widely held view that the LSTM is an extremely reliable architecture, but it was almost never the best performer in our experiments. Of course further experiments will be required to fully vet the UGRNN and +RNN. All things considered, in an uncertain training environment, our results suggest using the GRU or +RNN.
We would like to thank Geoffrey Irving, Alex Alemi, Quoc Le, Navdeep Jaitly, and Taco Cohen for helpful feedback.
Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition.IEEE transactions on electronic computers, (3):326–334, 1965.
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural Networks for Machine Learning, 4, 2012.
Proceedings of the IEEE International Conference on Computer Vision, pp. 1476–1483, 2015.
We used a HP tuner that uses a Gaussian Process (GP) Bandits approach for HP optimization. Our setting of the tuner’s internal parameters was such that it uses Batched GP Bandits with an expected improvement acquisition function and a Matern 5/2 Kernel with feature scaling and automatic relevance determination performed by optimizing over kernel HPs. Please see Desautels et al. (2014) and Snoek et al. (2012) for closely related work.
For all our tasks, we requested HPs from the tuner, and reported loss on a validation dataset. For the per-parameter capacity task, the evaluation, validation and training datasets were identical. For text8, the validation and evaluation sets consisted of different sections of held out data. For all other tasks, evaluation, validation, and training sets were randomly drawn from the same distribution. The performance we plot in all cases is on the evaluation dataset.
Below is the list of all tunable HPs that were generically applied to all models. In total, each RNN variant had between 10 and 27 HP dimensions relating to the architecture, optimization, and regularization.
- as used in the following RNN definitions, a nonlinearity determined by the HP tuner, , . The only exception was the IRNN, which used ReLU exclusively.
For any matrix that is inherently square, e.g.
, there were three possible initializations: identity, orthogonal, or random normal distribution scaled by, with the number of recurrent units. The sole exception was the RNN, which was limited to either orthogonal or random normal initializations, to differentiate it from the IRNN. For any matrix that is inherently rectangular, e.g. , we initialized with a random normal distribution scaled by , with the number of inputs.
For all matrix initializations except the identity initialization, there was a multiplicative scalar used to set the scale of matrix. The scalar was exponentially distributed infor recurrent matrices and for rectangular matrices.
Biases could have two possible distributions: all biases set to a constant value, or drawn from a standard normal distribution.
For all bias initializations, a multiplicative scalar was drawn, uniformly distributed inand applied to bias initialization.
We included a scalar bias HP for architectures that contain forget or update gates, as is commonly employed in practice, which was uniformly distributed in .
Additionally, the HP tuner was used to optimize HPs associated with learning:
The number of training steps - The exact range varied between tasks, but always fell between 50K and 20M.
learning rate initial value, exponentially distributed in
learning rate decay - exponentially distributed in . The learning rate exponentially decays by this factor over the number of training steps chosen by the tuner
optimizer momentum-like parameter - expressed as a logit, and uniformly distributed in
gradient clipping value - exponentially distributed in
l2 decay - exponentially distributed in .
The perceptron capacity task also had associated HPs:
The number of samples in the dataset, - between 0.1x and 10x the number of model parameters
A HP determined whether the input vector was presented to the RNN only at the first time step, or whether it was presented at every time step.
Some optimization algorithms had additional parameters such as ADAM’s second order decay rate, or epsilon parameter. These were set to their default values and not optimized. The batch size was set individually by hand for all experiments. The same seed was used to initialize the random number generator for all task parameters, whereas the generator was randomly seeded for network parameters (e.g. initializations). Note that for each network, the initial condition was set to a learned vector.
While at a high-level, for the perceptron capacity task, we wanted to optimize the amount of information the RNN carried about true random labels, in practice, the training objective was standard cross-entropy. However, when returning a validation loss to the HP tuner, we returned the mutual information . Conceptually, this is as if there is one nested optimization inside another. The inner loop optimizes the RNN for the set of HPs, training cross entropy, but returning mutual information. The outer loop then chooses the HPs, in particular, the number of samples , in equation (21), so as to maximize the amount of mutual information. This implementation is necessitated because there is no straightforward way to differentiate mutual information with respect to number of samples. During training, cross entropy error is evaluated beginning after 5 time steps.
In the Memory Capacity task, we wanted to know how much information an RNN can reconstruct about its inputs at some later time point. We picked an input dimension, 64, and varied the number of parameters in the networks such that the number of hidden units was roughly centered around 64. After 12 time steps the target of the network was exact reconstruction of the input, with a square error loss. The inputs were random values drawn from a uniform distribution between and
(corresponding to a variance of 1).
A dataset was constructed consisting of random unit norm Gaussian input vectors , with size . Target scalar outputs were generated for each input vector, and were also drawn from a unit norm Gaussian. Each sample was assigned a power law weighting , where was a normalization constant such that the weightings summed to 1, and the characteristic time constant . The loss function for training was calculated after 50 time steps and was weighted square error on the , with the acting as the weighting terms.
In the text8 task, the task was to predict one character ahead in the text8 dataset (1e8 characters of Wikipedia) (Mahoney, 2011). Input was a hot-one encoded sequence, as was the output. The loss was cross-entropy loss on a softmax output layer. Rather than use partial unrolling as is common in language modeling, we generated random pointers into the text. The first 13 time steps (where ) were used to initialize the RNN into a normal operating mode, and remaining steps were used for training or inference.
The parentheses counting task independently counts the number of opened ‘parens’, e.g. ‘(’, without the closing ‘)’. Here parens is used to mean any of 10 parens type pairs, e.g. ‘<>’ or ‘’. Additionally, there were 10 noise characters, ‘a’ to ‘j’. For each paren type, there was a hot-one encoding of all paren and noise symbols, for a total of 300 inputs. The output for each paren type was a hot-one encoding of the digits 0-9, which represented the count of the opened parens of that type. If the count exceeded 9, the the network kept the count at 9, if the paren was closed, the count decreased. The loss was the sum of cross-entropy losses, one for each paren type. Finally, for each paren input stream, 50% random noise characters were drawn, and 50% random paren characters were drawn, e.g. 10 streams like ‘(a<a<bcb>[[[)’. Parens of other types were treated as noise for the current type, e.g. for the above string if the paren type was ‘<>’, the answer is ‘1’ at the end. The loss was defined only at the final time point, , and .
In the arithmetic task, a hot-one encoded character sequence of an addition problem was presented as input to the network, e.g., ‘-343243+93851= ’, and the output was the hot-one encoded answer, including the correct amount of left padded spaces, ‘-249392’. An additional HP for this task was the number of compute steps (1-6) between the input of the ‘=’ and the first non-space character in the target output sequence. The two numbers in the input were randomly, uniformly selected in. After 36 time steps, cross-entropy loss was calculated. We found this task to be extremely difficult for the networks to learn, but when the task was learned, certain of the network architectures could perform the task nearly perfectly.
We wondered how robust the HPs are to the variability of both random batching of data, and random initialization of parameters. So we identified the best HPs from the parentheses experiments of 100k parameter, 1 layer architectures, and reran the parameter optimization 100 times. We measured the number of infeasible experiments, as well as a number of statistics of the loss for the reruns (Table App.1). These results show that the best HPs yielded a distribution of losses very close to the originally reported loss value.
Results of 100 runs on the parentheses task using the best HPs for each architecture, at depth 1. HPs were chosen to be the set which achieved the minimum loss. Table shows original loss achieved by the HP tuner, amount of infeasible trials, minimum loss from running 100 iterations of the same HPs, mean loss, maximum loss, standard deviation, and standard deviation divided by the mean.
|1 layer||8 layer|
|GRU/IRNN||-7.74||696||< 0.001||-3.51||1360||< 0.001|
|GRU/LSTM||-6.65||1750||< 0.001||-4.84||1290||< 0.001|
|GRU/RNN||-26.5||1340||< 0.001||-3.93||1330||< 0.001|