janet
Code for the forget-only version of the LSTM in the paper "The unreasonable effectiveness of the forget gate"
view repo
Given the success of the gated recurrent unit, a natural question is whether all the gates of the long short-term memory (LSTM) network are necessary. Previous research has shown that the forget gate is one of the most important gates in the LSTM. Here we show that a forget-gate-only version of the LSTM with chrono-initialized biases, not only provides computational savings but outperforms the standard LSTM on multiple benchmark datasets and competes with some of the best contemporary models. Our proposed network, the JANET, achieves accuracies of 99 standard LSTM which yields accuracies of 98.5
READ FULL TEXT VIEW PDFCode for the forget-only version of the LSTM in the paper "The unreasonable effectiveness of the forget gate"
Good engineers ensure that their designs are practical. After showing that a sequence analysis problem is best solved by the long short-term memory (LSTM) recurrent neural network, the next step is to devise an implementation enabling the often resource constrained real-world application. Given the success of the gated recurrent unit (GRU)
(Cho et al., 2014), which uses two gates, the first approach to a more hardware efficient LSTM could be the elimination of redundant gates, if there are any. Because we seek a model more efficient than the GRU, only a single-gate LSTM model is a worthwhile endeavour. To motivate why this single gate should be the forget gate, we begin with the LSTM genesis.In an era where training recurrent neural networks (RNNs) was notoriously difficult, Hochreiter and Schmidhuber (1997) argued that having a single weight (edge) in the RNN to control whether input or output of a memory cell needs to be accepted or ignored, creates conflicting updates (gradients). Essentially, the long and short-range error act on the same weight at each step, and with sigmoid activated units, this results in the gradients vanishing faster than the weights can grow. They proceeded to propose the long short-term memory (LSTM) unit recurrent neural network, which had multiplicative input and output gates. These gates would mitigate the conflicting update issue by “protecting” the cells from irrelevant information, either from the input or from the output of other cells.
This first version of the LSTM had only two gates; it was Gers et al. (2000) who realized that if there is no mechanism for the memory cells to forget information, they may grow indefinitely and eventually cause the network to break down. As a solution, they proposed another multiplicative gate for the LSTM architecture, known as the forget gate – completing the version of the LSTM that we know today^{1}^{1}1It’s interesting to note the difference between the motivations that lead to the LSTM and the chain-of-thought that yielded the gated recurrent unit (GRU). Cho was “not well aware” (Cho, 2015, §4.2.3) of the LSTM when he, together with collaborators, designed the GRU. In contrast to the conflicting update problem (Hochreiter and Schmidhuber, 1997) and the indefinite state growth (Gers et al., 2000) arguments, Cho (2015) approached the RNN problem by thinking of it as a computer processor with memory registers. In the case of computers, we do not want to overwrite all the registers (memory values) at each step. Therefore, the RNN requires an update gate, which controls the hidden states (registers) that are overwritten (the update gate in the GRU is akin to the combined function of the input and forget gates in the LSTM). Furthermore, we do not necessarily need to read all the registers at each time step, only the important ones. Thus another gate is required in the RNN (the reset gate) to regulate the registers considered. Ideally, all of the gating operations would be binary values, but such values would result in zero gradients. Fortunately, the sigmoid or tanh nonlinearities provide leaky versions of these gating mechanisms and have smooth gradients..
It wasn’t until many years later that Greff et al. (2015) and Jozefowicz et al. (2015) simultaneously discovered the forget gate to be the crucial ingredient of the LSTM. Gers et al. (2000) proposed initializing the forget gate biases to positive values and Jozefowicz et al. (2015) showed that an initial bias of 1 for the LSTM forget gate makes the LSTM as strong as the best of the explored architectural variants (including the GRU) (Goodfellow et al., 2016, §10.10.2). Given the new-found importance of the forget gate, would the input and output gates have been found necessary if the LSTM was conceived with only a forget gate?
In this work, we take the liberty of exploring the gains introduced by the sole use of the forget gate. On the five tasks explored, use of only the forget gate provides a better solution than the use of all three LSTM gates. Many improvements have been proposed for the LSTM, which we review in the following section.
With some success, many studies have improved the LSTM by making the cell more complex (Neil et al., 2016; He et al., 2017; Fraccaro et al., 2016; Krueger et al., 2017; Graves, 2011), with classic examples being peephole connections (Gers and Schmidhuber, 2000) and depth gated LSTMs (Yao et al., 2015). Similarly, several studies have proposed recurrent neural networks (RNN) simpler than the LSTM yet still competitive, such as the skip-connected RNN (Zhang et al., 2016), the unitary RNN (Arjovsky et al., 2016), the Delta-RNN (Ororbia II et al., 2017), and the identity RNN (Le et al., 2015)
. However, one of the most thorough studies on the architecture of the LSTM is probably the study by
Greff et al. (2015) (5,400 experiment simulations). They explored the following LSTM variants individually:No input gate
No forget gate
No output gate
No input activation function
No output activation function
No peepholes
Coupled input and forget gate
Full gate recurrence
The first five variants are self-explanatory. Peepholes (Gers and Schmidhuber, 2000) connect the cell to the gates, adding an extra term to the pre-activations of the input, output, and forget gates. The coupled input and forget gate variant uses only one gate for modulating the input and the cell recurrent self-connections, i.e., . Full gate recurrence is the initial setup of Hochreiter and Schmidhuber (1997), wherein all the gates receive recurrent inputs from all gates at the previous time step. This cumbersome architecture requires 9 additional recurrent weight matrices and did not feature in any of their later papers. Interestingly, the results in Greff et al. (2015) indicate that none of the variants significantly improve on the standard LSTM. The forget gate was found to be essential, but a forget-gate-only variant was not explored.
Two studies that are closely related to ours are those by Zhou et al. (2016) and Wu and King (2016). The former successfully implemented a similar gate reduction to the gated recurrent unit (GRU); they couple the reset (input) gate to the update (forget) gate and show that this minimal gated unit (MGU) achieves a performance similar to the standard GRU with only two-thirds of the parameters. The study by Wu and King (2016) proposes a gate reduction similar to that of ours for LSTMs. They demonstrate that their simple LSTM achieves the same performance as the standard LSTM on a speech synthesis task. Compared with our work, they keep the hyperbolic tangent activation function on the memory cell, and their implementation did not employ the same bias initialization scheme, which we show is paramount for successful implementation of these models over a wide range of datasets. We became aware of these studies after having completed most of our work; our simplification of the LSTM provides a network that yields classification accuracies at least as good as the standard LSTM and often performs substantially better – a result not achieved by the models proposed in the afore-mentioned studies.
Recurrent neural networks (RNNs) typically create a lossy summary of a sequence. It is lossy because it maps an arbitrarily long sequence
into a fixed length vector. As mentioned before, recent work has shown that this forgetting property of LSTMs is one of the most important
(Greff et al., 2015; Jozefowicz et al., 2015). Hence, we propose a simple transformation of the LSTM that leaves it with only a forget gate, and since this is Just Another NETwork (JANET), we name it accordingly. We start from the standard LSTM (Lipton et al., 2015), which, with symbols taking their standard meaning, is defined as(1) |
To transform the above into the JANET architecture, the input and output gates are removed. It seems sensible to have the accumulation and deletion of information be related, therefore we couple the input and forget modulation as in Greff et al. (2015), which is similar to the leaky unit implementation (Jaeger, 2002, §8.1). Furthermore, the activation of
shrinks the gradients during backpropagation, which could exacerbate the vanishing gradient problem, and since the weights
can accommodate values beyond the range [-1,1], we can remove this unnecessary, potentially problematic, nonlinearity. The resulting JANET is given by(2) |
Intuitively, allowing slightly more information to accumulate than the amount forgotten would make sequence analysis easier. We found this to be true empirically by subtracting a pre-specified value from the input control component^{2}^{2}2 is a constant-valued column vector of the appropriate size., as given by
(3) |
We speculate that the value for is dataset dependent, however, we found that setting provides the best results for the datasets analysed in this study, which have sequence lengths varying from 200 to 784.
If we follow the standard parameter initialization scheme for LSTMs, the JANET quickly encounters a problem. The standard procedure is to initialize the weights and to be distributed as , where is the size of each layer (He et al., 2015b; Glorot and Bengio, 2010), and to initialize all biases to zero except for the forget gate bias , which is initialized to one (Jozefowicz et al., 2015). Hence, if the values of both input and hidden layers are zero-centred over time, will be centred around . In this case, the memory values of the JANET would not be retained for more than a couple of time steps. This problem is best exemplified by the MNIST dataset (LeCun, 1998) processed in scanline order (Cooijmans et al., 2016); each training example contains many consecutive zero-valued subsequences, each of length 10 to 20. In the best case scenario – a length 10 zero-valued subsequence – the memory values at the end of the subsequence would be centred around
(4) |
Thus, with the standard initialization scheme, little information would be propagated during the forward pass and in turn, the gradients will quickly vanish.
Fortunately, the recent work by Tallec and Ollivier (2018) proposed a more suitable initialization scheme for the forget gate biases of the LSTM. To motivate this initialization scheme we start by re-writing the leaky RNN (Jaeger, 2002, §8.1)
(5) |
as its continuous time version, by making use of the first order Taylor expansion and a discretization step ,
(6) |
Tallec and Ollivier (2018) state that in the free regime, when inputs stop after a certain time for , with and , eq. 6 becomes
(7) |
From eq. 7 the hidden state h will decrease to of its original value over a time proportional to . This can be interpreted as the characteristic forgetting time, or the time constant, of the recurrent neural network. Therefore, when modelling sequential data believed to have dependencies in a range , it would be sensible to use a model with a forgetting time lying in approximately the same range, i.e., having , where is the number of hidden units.
For the LSTM, the input gate i and the forget gate f learn time-varying approximations of and , respectively. Obtaining a forgetting time centred around requires i to be centred around and f to be centred around . Assuming the shortest dependencies to be a single time step, Tallec and Ollivier (2018) propose the chrono initializer, which initializes the LSTM gate biases as
(8) |
with the expected range of long-term dependencies and
the uniform distribution. Importantly, these are only the initializations, and the gate biases are allowed to change independently during training.
Applying chrono initialization to the forget gate f of the JANET^{3}^{3}3The memory cell biases are initialized to zero., mitigates the memory issue (eq. 4). With the values of the input and hidden layers zero-centred over time, the forget gate corresponding to a long-range () cell will have an activation of
(9) |
Consequently, for the MNIST memory problem (), these long-range cells would retain most of their information, even after 20 consecutive zeros
(10) |
For the JANET, chrono initialization provides an elegant implementation of skip-like connections between the memory cells over time. It has long been known that skip connections mitigate the vanishing gradient problem (Srivastava et al., 2015; Lin et al., 1996). A systematic study of recurrent neural networks (RNNs) by Zhang et al. (2016)
found that explicitly adding skip connections in the RNN graph improves performance by allowing information to be transmitted directly between non-consecutive time steps. For RNNs, they devise the recurrent skip coefficient, a value that measures the number of time steps through which unimpeded flow of information is allowed, and argue that higher values are usually better. Furthermore, skip connections are responsible for much of the boom in machine learning; they are the pith of the powerful residual networks
(He et al., 2015a), highway networks (Srivastava et al., 2015), and the WaveNet (Van Den Oord et al., 2016). A natural question that follows, is how the skip-connections influence the gradients of the JANET and the LSTM.Before comparing the gradients of the LSTM and the JANET we provide some preliminaries. We denote the derivatives of the element-wise nonlinearities by the following:
(11) |
For brevity, we denote the pre-activation vectors in eq. 1 and 2 as
(12) |
Lastly, we consider a diagonal matrix as a vector of its diagonal elements. Thus, a derivative of an element-wise multiplication of two vectors is written as a vector. Consider the following derivative of an element-wise multiplication of vectors
(13) |
which we write as
(14) |
Here we compare the gradient propagation through the memory cells of a single-layer JANET with that of a single-layer LSTM. To analyse this flow of information we can compute the gradient of the objective function with respect to some arbitrary memory vector . Starting with the JANET (eq. 2), we re-write it as
(15) |
For this architecture the gradient of the objective function is given by
(16) |
with
(17) |
Assuming that the input and hidden layers are zero-centred over time (as for the memory problem eq. 4) and all the forget gate biases are initialized to the longest range (eq. 10), will typically take values of one^{4}^{4}4With the biases large enough for and values near zero. In this scenario, we see that all but one of the terms in eq. 17 reduce to zero and we have
(18) |
meaning that gradients from distant memory cells are largely unaffected by the sequence length.
Moving on to the LSTM, we re-write eq. 1 as
(19) |
Here the gradient of the objective function is
(20) |
With a forget gate chrono-initialized to a hypothetical value of one and with , the LSTM would permit unhindered gradient propagation. Under standard and chrono-initialization schemes, however, this term is unlikely to be zero. First,
(21) |
which is non-zero with (centred around 0.5 under the memory problem assumptions eq. 4) and . Second,
(22) |
where under chrono-initialized assumptions
(23) |
would typically take values of zero because , and are centred near zero, but depends on gradients w.r.t. the output gate and new-input functions ( and ), resulting in a summation of non-zero gradients. Initializing the biases of these two gates such that could provide a better solution for the LSTM and we leave exploration of this for future work.
In practice the gradients are not as ill-conditioned as we have described here because the gate activations are not homogeneous; some gate-cell combinations track short-term dependencies and others track long-term dependencies. However, with all the initializations kept the same, these derivations could explain why the JANET could be easier to train than the LSTM.
We have shown how the simplification of the LSTM could lead to a better-conditioned training regime, we follow with the theoretical computational savings gleaned by this simplification.
Hardware efficient machine learning is a field of study by itself (Adolf et al., 2016; Hinton et al., 2015; Sindhwani et al., 2015; Han et al., 2015; Wang et al., 2017). The general aim is to maintain the same level of accuracy but require less computational resource in the process. Usually, this applies to only the forward pass efficiency of the network, i.e., being able to run a trained network on a small device. This is the same goal we have for our simplified version of the LSTM. If we assume the accuracies of the JANET and the LSTM to be the same, how much do we save on computation?
Consider an LSTM layer that has inputs and hidden units, then we have . For the LSTM we have , and the total number of parameters is . For the JANET we have , and the total number of parameters is . Thus we reduce the number of parameters by half, but what does this mean in terms of memory consumption and computational cost? A proxy for the required memory is the number of values that need to be in memory at each step; e.g., the LSTM requires values to be stored. Since this value is dominated by the term (typically a hidden state size is used), the JANET would require approximately half of the memory required by an LSTM in a forward pass. Adolf et al. (2016) showed that matrix and element-wise multiplication operations each constitute roughly half of the computation required by an LSTM. With the JANET, the processing required for element-wise multiplications is reduced by one third because there are no output gate element-wise multiplications. Thus, the total processing power required by the JANET is roughly of the processing power required by the LSTM.
If we assume that the electrical power consumed by the memory component of our device is 5% of that consumed by the processor (Acar et al., 2016), then the JANET will consume approximately
of the electrical power consumed by the LSTM. However, this ratio is a theoretical estimation and would be different in practice.
Such computational efficiencies are particularly beneficial when applications involve resource-constrained devices. If our simplification of the LSTM is able to provide the same classification accuracy as the standard LSTM, this would be an essential step towards hardware efficient LSTMs.
We start by evaluating the performance of the JANET on three publicly available datasets. These comprise the MNIST, permuted MNIST (pMNIST) (Arjovsky et al., 2016), and MIT-BIH arrhythmia datasets. The permuted MNIST dataset is the same as the MNIST dataset, except, the pixels in each image have been permuted in the same random order. As stated by Arjovsky et al. (2016), the MNIST images have regular distinctive patterns much shorter than the 784-long input sequences; permuting the pixels create longer-term dependencies that are harder for LSTMs to learn.
Single heartbeats were extracted from longer filtered signals on channel 1 of the MIT-BIH dataset (Moody and Mark, 2001; Goldberger et al., 2000) by means of the BioSPPy package (Carreiras et al., 2015). The signals were filtered using a bandpass FIR filter between 3 and 45 Hz, and the Hamilton QRS detector (Hamilton, 2002) was used to detect and segment single heartbeats. We chose the four heartbeat classes that are best represented over different patients in the dataset: normal, right bundle branch block, paced, and premature ventricular contraction. The resulting dataset contained 89,670 heartbeats, each of length 216 time steps, from 47 patients. We randomly split the data over patients to have heartbeats from 33 train-, 5 validation-, and 9 test-patients (70:10:20). An acceptable split was considered to have all classes in each set contain at least smallest-class-size data points, where is the split fraction (0.7, 0.1, or 0.2). The standard split was used in the case of MNIST.
For the MNIST dataset we used a model with two hidden layers of 128 units, whereas a single layer of 128 units was used for the pMNIST and MIT-BIH datasets. All the networks were trained using Adam (Kingma and Ba, 2015)
with a learning rate of 0.001 and a minibatch size of 200. Dropout of 0.1 was used on the output of the recurrent layers, and a weight decay factor of 1e-5 was used. For the LSTM and the JANET, chrono initialization was employed. The models were trained for 100 epochs and the best validation loss was used to determine the final model. Furthermore, the gradient norm was clipped at a value of 5, and the models were implemented using Tensorflow
(Abadi et al., 2015).In table 1
we present the test set accuracies achieved for the three different datasets. In addition to JANET and the standard LSTM, we show the results obtained with the standard recurrent neural network (RNN) and other recent RNN modifications. The means and standard deviations from 10 independent runs are reported. The code to reproduce these experiments is available online:
https://github.com/JosvanderWesthuizen/janet.Model | MNIST | pMNIST | MIT-BIH |
---|---|---|---|
JANET | 99.0 0.120 | 92.5 0.767 | 89.4 0.193 |
LSTM | 98.5 0.183 | 91.0 0.518 | 87.4 0.130 |
RNN | 10.8 0.689 | 67.8 20.18 | 73.5 4.531 |
uRNN (Arjovsky et al., 2016) | 95.1 | 91.4 | - |
iRNN (Le et al., 2015) | 97.0 | 82.0 | - |
tLSTM^{a} (He et al., 2017) | 99.2 | 94.6 | - |
stanh RNN^{b} (Zhang et al., 2016) | 98.1 | 94.0 | - |
Effectively has more layers than the other networks.
Single hidden layer of 95 units.
Surprisingly, the results indicate that the JANET yields higher accuracies than the standard, LSTM. Moreover, JANET is among the top performing models on all of the analysed datasets. Thus, by simplifying the LSTM, we not only save on computational cost but also gain in test set accuracy.
As in Zhang et al. (2016), due to the 10 to 20 long subsequences of consecutive zeros (see section 3), we found training of LSTMs to be harder on MNIST compared to training on pMNIST. By harder, we mean that gradient problems and bad local minima cause the objective function to have a rougher and consequent slower descent than the smooth monotonic descent experienced when training is easy. This does not mean that achieving near-perfect classification is more difficult; near-perfect classification on MNIST is relatively easy, whereas the longer-range dependencies in the pMNIST dataset render near-perfect classification difficult. This pMNIST permutation, in fact, blends the zeros and ones for each data point, giving rise to more uniform sequences, which make training easier.
In figure 1 we elucidate the difficulty of training on MNIST digits, processed in scanline order. We show the median values with the 10 and percentiles shaded. From the figure, LSTMs clearly have a rougher ascent in accuracy on MNIST than on pMNIST and can sometimes fail catastrophically on MNIST. The chrono initializer prevents this catastrophic failure during training, but it results in a lower optimum accuracy. On the pMNIST dataset, there were no discernible differences between the chrono and standard-initialized LSTMs – the benefits of chrono initialization for LSTMs are not obvious on these datasets.
As described in section 3, the JANET allows skip connections over time steps of the sequence. In figure 2 we show how these skip connections result in the JANET being more efficient to train than the LSTM on the MNIST dataset. The median values of the test set accuracies during training are plotted, with the 25 and percentiles shaded. There is a recent machine learning theme of creating models that are easier to optimize instead of creating better optimizers, which is difficult (Goodfellow et al., 2016, §10.11). Being an easier to train version of the LSTM, the JANET continues this theme.
Given the success of the JANET on the pMNIST dataset (table 1), we experimented with larger layer sizes. In figure 3 we illustrate the test set accuracies during training for different layer sizes of the LSTM and the JANET. Additionally, we depict the best-reported accuracy on pMNIST (He et al., 2017) by the dashed blue line. This best accuracy of 96.7% was achieved by a WaveNet (Van Den Oord et al., 2016)
, a network with dilated convolutional neural network layers. The dilation increases exponentially across the layers and essentially enables a skip connection mechanism over multiple time steps.
The results show that the JANET not only outperforms the LSTM, but it competes with one of the best performing models on this dataset. With 1000 units in a single hidden layer the JANET achieves a mean classification accuracy of 95.0% over 10 independent runs with a standard deviation of 0.48%. The benefit of more units is unclear for the LSTM, which has a similar performance with 500 and 128 units to that of the JANET with 128 units. Furthermore, our models were trained on a Nvidia GeForce GTX 1080 GPU, and the largest LSTM we could train was an LSTM with 500 units. Even with a minibatch size of 1, the LSTM with 1000 units was too large to fit into the 8Gb of GPU memory.
Note that the WaveNet performed worse than the JANET on the standard MNIST dataset, achieving a classification accuracy of 98.3% compared to the JANET’s 99.0%. The WaveNet results presented here were produced by Chang et al. (2017) using 10 layers of 50 units each. The WaveNet gains additional skip connections with more layers, the JANET gains additional skip connections with more units per layer.
To further ensure that the JANET performs at least as well as the LSTM, we compare the models on two commonly used synthetic tasks for RNN benchmarks. These are known as the copy task and the add task (Arjovsky et al., 2016; Tallec and Ollivier, 2018; Hochreiter and Schmidhuber, 1997).
Consider 10 categories . The input takes the form of a length sequence of categories. The first 10 entries, a sequence that needs to be remembered, are sampled uniformly, independently, and with replacement from . The following entries are , a dummy value. The next single entry is , representing a delimiter, which should indicate to the model that it is now required to reproduce the initial 10 categories in the output sequence. Thus, the target sequence is entries of , followed by the first 10 elements of the input sequence in the same order. The aim is to minimize the average cross entropy of category predictions at each time step of the sequence. This translates to remembering the categorical sequence of length 10 for time steps. The best that a memoryless model can do on the copy task is to predict at random from among possible characters, yielding a loss of (Arjovsky et al., 2016) ^{5}^{5}5The first entries are assumed to be , giving a loss of ..
Here each input consists of two sequences of length . The first sequence consists of numbers sampled at random from
. The second sequence, with exactly two entries of one and the remainder zero, is an indicator sequence. The first 1 entry is located uniformly at random within the first half of the sequence, and the second is located uniformly at random in the second half of the sequence. The scalar output corresponds to the sum of the two entries in the first sequence corresponding to the non-zero entries of the second sequence. A naive strategy would be to predict a sum of 1 regardless of the input sequence, which would yield a mean squared error of 0.167, the variance of the sum of two independent uniform distributions
(Arjovsky et al., 2016).We follow Tallec and Ollivier (2018)
and use identical hyperparameters for all our models with a single hidden layer of 128 units. The models were trained using
Adam with a learning rate of 0.001 and a minibatch size of 50. We illustrate the results for the copy task with , the maximum sequence length used in (Arjovsky et al., 2016), in figure 4. For the addition task, we explored values of 200 and 750 for ; the results are presented in figure 5.In both tasks, we achieve similar results to those reported by Tallec and Ollivier (2018) and Arjovsky et al. (2016), and the standard-initialized LSTM performs the worst among the three techniques. Compared to the chrono-initialized LSTM, the JANET converges faster and to a better optimum on the copy task. On the add task, the chrono-initialized LSTM and the JANET have a similar performance, with the latter being slightly better for larger . The copy task is arguably more memory intensive than the add task. This could explain why the JANET, which has built-in long-term memory capability, would outperform the LSTM on the copy task.
In this work, we proposed a simplification of the LSTM that employs only the forget gate and uses chrono-initialized biases. The proposed model was shown to achieve better generalization than the LSTM on synthetic memory tasks and on the MNIST, pMNIST, and MIT-BIH arrhythmia datasets. Additionally, the model requires half of the number of parameters required by an LSTM and two-thirds of the element-wise multiplications, permitting computational savings. The JANET is well-suited for applications to continuous0valued time series with long-term memory requirements. For example, medical time series often have an outcome after several time steps and could have sections of consecutive zero-valued entries. We expect the LSTM to outperform the JANET on next-word prediction tasks where inputs are discrete and non-zero, and predictions are made at each time step.
The unreasonable effectiveness of the proposed model could be attributed to the combination of fewer nonlinearities and chrono initialization. This combination enables skip connections over entries in the input sequence. As described in section 3, the skip connections created by the long-range cells allow information to flow unimpeded from the elements at the start of the sequence to memory cells at the end of the sequence. For the standard LSTM, these skip connections are less apparent and an unimpeded propagation of information is unlikely due to the multiple possible transformations at each time step.
Modern neural networks move towards the use of more linear transformations
(Goodfellow et al., 2016, §8.7.5). These make optimization easier by making the model differentiable almost everywhere, and by making these gradients have a significant slope almost everywhere, unlike the sigmoid nonlinearity. Effectively, information is able to flow through many more layers provided that the Jacobian of the linear transformation has reasonable singular values. Linear functions consistently increase in a single direction, so even if the model’s output is far from correct, it is clear, simply from computing the gradient, which direction its output should move towards to reduce the loss function. In other words, modern neural networks have been designed so that their
local gradient information corresponds reasonably well to moving towards a distant solution; a property also induced by skip connections. What this means for the LSTM, is that, although the additional gates should provide it with more flexibility than our model, the highly nonlinear nature of the LSTM makes this flexibility difficult to utilize and so potentially of little use.With some success, many studies have proposed models more complex than the LSTM. This has made it easy, however, to overlook a simplification that also improves the LSTM. The JANET provides a network that is easier to optimize and therefore achieves better results. Much of this work showcased how important parameter initialization is for neural networks. In future work, improved initialization schemes could allow the standard LSTM to surpass the models described in this study.
We thank José Miguel Hernández-Lobato for helpful discussions. This work is supported by the Skye Cambridge Trust.
Fathom: Reference workloads for modern deep learning methods.
In 2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1–10.Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages 1724–1734.Recurrent Batch Normalization.
In International Conference on Learning Representations.Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.
InProceedings of the IEEE International Conference on Computer Vision
, pages 1026–1034.Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning.
In Advances in Neural Information Processing Systems, pages 1–11. Curran Associates, Inc.The MNIST Database of Handwritten Digits.
http://yann.lecun.com/exdb/mnist/.