Gated Recurrent Unit with Low-rank matrix factorization
Recurrent neural networks (RNNs) are notoriously difficult to train. When the eigenvalues of the hidden to hidden weight matrix deviate from absolute value 1, optimization becomes difficult due to the well studied issue of vanishing and exploding gradients, especially when trying to learn long-term dependencies. To circumvent this problem, we propose a new architecture that learns a unitary weight matrix, with eigenvalues of absolute value exactly 1. The challenge we address is that of parametrizing unitary matrices in a way that does not require expensive computations (such as eigendecomposition) after each weight update. We construct an expressive unitary weight matrix by composing several structured matrices that act as building blocks with parameters to be learned. Optimization with this parameterization becomes feasible only when considering hidden states in the complex domain. We demonstrate the potential of this architecture by achieving state of the art results in several hard tasks involving very long-term dependencies.READ FULL TEXT VIEW PDF
It is well known that it is challenging to train deep neural networks an...
Recurrent Neural Networks (RNNs) are designed to handle sequential data ...
Recurrent neural networks are powerful models for processing sequential ...
Learning long term dependencies in recurrent networks is difficult due t...
It is a known fact that training recurrent neural networks for tasks tha...
A recent strategy to circumvent the exploding and vanishing gradient pro...
Recurrent neural networks (RNNs) are particularly well-suited for modeli...
Gated Recurrent Unit with Low-rank matrix factorization
Unitary Evolution Recurrent Neural Network
to be necessarily arising when trying to learn to reliably store bits of information in any parametrized dynamical system. If gradients propagated back through a network vanish, the credit assignment role of backpropagation is lost, as information about small changes in states in the far past has no influence on future states. If gradients explode, gradient-based optimization algorithms struggle to traverse down a cost surface, because gradient-based optimization assumes small changes in parameters yield small changes in the objective function. As the number of time steps considered in the sequence of states grows, the shrinking or expanding effects associated with the state-to-state transformation at individual time steps can grow exponentially, yielding respectively vanishing or exploding gradients. SeePascanu et al. (2010) for a review.
Although the long-term dependencies problem appears intractable in the absolute (Bengio et al., 1994)
for parametrized dynamical systems, several heuristics have recently been found to help reduce its effect, such as the use of self-loops and gating units in the LSTM(Hochreiter & Schmidhuber, 1997) and GRU (Cho et al., 2014) recurrent architectures. Recent work also supports the idea of using orthogonal weight matrices to assist optimization (Saxe et al., 2014; Le et al., 2015).
In this paper, we explore the use of orthogonal and unitary matrices in recurrent neural networks. We start in Section 2 by showing a novel bound on the propagated gradients in recurrent nets when the recurrent matrix is orthogonal. Section 3 discusses the difficulties of parameterizing real valued orthogonal matrices and how they can be alleviated by moving to the complex domain.
We discuss a novel approach to constructing expressive unitary matrices as the composition of simple unitary matrices which require at most computation and
memory, when the state vector has dimension. These are unlike general matrices, which require computation and memory. Complex valued representations have been considered for neural networks in the past, but with limited success and adoption (Hirose, 2003; Zimmermann et al., 2011). We hope our findings will change this.
Whilst our model uses complex valued matrices and parameters, all implementation and optimization is possible with real numbers and has been done in Theano(Bergstra et al., 2010). This along with other implementation details are discussed in Section 4, and the code used for the experiments is available online. The potential of the developed model for learning long term dependencies with relatively few parameters is explored in Section 5. We find that the proposed architecture generally outperforms LSTMs and previous approaches based on orthogonal initialization.
A matrix, , is orthogonal if . Orthogonal matrices have the property that they preserve norm (i.e.
) and hence repeated iterative multiplication of a vector by an orthogonal matrix leaves the norm of the vector unchanged.
Let and be the hidden unit vectors for hidden layers and of a neural network with hidden layers and . If is the objective we are trying to minimize, then the vanishing and exploding gradient problems refer to the decay or growth of as the number of layers, , grows. Let be a pointwise nonlinearity function, and
then by the chain rule
where is the Jacobian matrix of the pointwise nonlinearity.
In the following we define the norm of a matrix to refer to the spectral radius norm (or operator 2-norm) and the norm of a vector to mean -norm. By definition of the operator norms, for any matrices and vector we have and . If the weight matrices are norm preserving (i.e. orthogonal), then we prove
Since is diagonal, , with the -th pre-activation of the -th hidden layer. If the absolute value of the derivative can take some value , then this bound is useless, since which grows exponentially in . We therefore cannot effectively bound for deep networks, resulting potentially in exploding gradients.
In the case , equation 3 proves that that tends to 0 exponentially fast as2011; Nair & Hinton, 2010). Unless all the activations are killed at one layer, the maximum entry of is 1, resulting in for all layers . With ReLU nonlinearities, we thus have
Most notably, this result holds for a network of arbitrary depth and renders engineering tricks like gradient clipping unnecessary(Pascanu et al., 2010).
To the best of our knowledge, this analysis is a novel contribution and the first time a neural network architecture has been mathematically proven to avoid exploding gradients.
Unitary matrices generalize orthogonal matrices to the complex domain. A complex valued, norm preserving matrix, , is called a unitary matrix and is such that , where is the conjugate transpose of . Directly parametrizing the set of unitary matrices in such a way that gradient-based optimization can be applied is not straightforward because a gradient step will typically yield a matrix that is not unitary, and projecting on the set of unitary matrices (e.g., by performing an eigendecomposition) generally costs computation when is .
The most important feature of unitary and orthogonal matrices for our purpose is that they have eigenvalues with absolute value 1. The following lemma, proved in (Hoffman & Kunze, 1971), may shed light on a method which can be used to efficiently span a large set of unitary matrices.
A complex square matrix is unitary if and only if it has an eigendecomposition of the form
, where denotes the conjugate transpose.
complex matrices, where is unitary, and is a diagonal such that .
Furthermore, is a real orthogonal matrix if and only if for every eigenvalue with eigenvector
with eigenvector, there is also a complex conjugate eigenvalue with corresponding eigenvector .
, a naive method to learn a unitary matrix would be to fix a basis of eigenvectorsand set
where is a diagonal such that .
Lemma 1 informs us how to construct a real orthogonal matrix, . We must (i) ensure the columns of come in complex conjugate pairs, , and (ii) tie weights in order to achieve . Most neural network objective functions are differentiable with respect to the weight matrices, and consequently may be learned by gradient descent.
Unfortunately the above approach has undesirable properties. Fixing and learning requires memory, which is unacceptable given that the number of learned parameters is . Further note that calculating for an arbitrary vector requires computation. Setting to the identity would satisfy the conditions of the lemma, whilst reducing memory and computation requirements to , however, would remain diagonal, and have poor representation capacity.
We propose an alternative strategy to parameterize unitary matrices. Since the product of unitary matrices is itself a unitary matrix, we compose several simple, parameteric, unitary matrices to construct a single, expressive unitary matrix. The four unitary building blocks considered are
, a diagonal matrix with , with parameters ,
, a reflection matrix in the complex vector ,
, a fixed random index permutation matrix, and
, the Fourier and inverse Fourier transforms.
Appealingly, , and all permit storage and computation for matrix vector products. and require no storage and matrix vector multiplication using the Fast Fourier Transform algorithm. A major advantage of composing unitary matrices of the form listed above, is that the number of parameters, memory and computational cost increase almost linearly in the size of the hidden layer. With such a weight matrix, immensely large hidden layers are feasible to train, whilst being impossible in traditional neural networks.
With this in mind, in this work we choose to consider recurrent neural networks with unitary hidden to hidden weight matrices. Our claim is that the ability to have large hidden layers where hidden states norms are preserved provides a powerful tool for modeling long term dependencies in sequence data. (Bengio et al., 1994) suggest that having a large memory may be crucial for solving difficult tasks with long ranging dependencies: the smaller the state dimension, the more information necessarily has to be eliminated when mapping a long sequence to a fixed-dimension state.
We call any RNN architecture which uses a unitary hidden to hidden matrix a unitary evolution RNN (uRNN). After experimenting with several structures, we settled on the following composition
Whilst each but the permutation matrix is complex, we parameterize and represent them with real numbers for implementation purposes. When the final cost is real and differentiable, we may perform gradient descent optimization to learn the parameters. (Yang et al., 2015) construct a real valued, non-orthogonal matrix using a similar parameterization with the motivation of parameter reduction by an order of magnitude on an industrial sized network. This combined with earlier work (Le et al., 2010) suggests that it is possible to create highly expressive matrices by composing simple matrices with few parameters. In the following section, we explain details on how to implement our model and illustrate how we bypass the potential difficulties of working in the complex domain.
In this section, we describe the nonlinearity we used, how we incorporate real valued inputs with complex valued hidden units and map from complex hidden states to real outputs.
Our implementation represents all complex numbers using real values in terms of their real and imaginary parts. Under this framework, we sidestep the lack of support for complex numbers by most deep learning frameworks. Consider multiplying the complex weight matrixby the complex hidden vector , where are real. It is trivially true that . When we represent as , we compute complex matrix vector products with real numbers as follows
More generally, let be any complex function and any complex vector. We may write where . This allows us to implement everything using real valued operations, compatible with any any deep learning framework with automatic differentiation such as Theano.
As is the case with most recurrent networks, our uRNN follows the same hidden to hidden mapping as equation 1 with and . Denote the size of the complex valued hidden states as . The input to hidden matrix is complex valued, . We learn the initial hidden state as a parameter of the model.
Choosing an appropriate nonlinearity is not trivial in the complex domain. As discussed in the introduction, using a ReLU is a natural choice in combination with a norm preserving weight matrix. We first experimented with placing separate ReLU activations on the real and imaginary parts of the hidden states. However, we found that such a nonlinearity usually performed poorly. Our intuition is that applying separate ReLU nonlinearities to the real and imaginary parts brutally impacts the phase of a complex number, making it difficult to learn structure.
We speculate that maintaining the phase of hidden states may be important for storing information across a large number of time steps, and our experiments supported this claim. A variation of the ReLU that we name modReLU, is what we finally chose. It is a pointwise nonlinearity, , which affects only the absolute value of a complex number, defined as
where is a bias parameter of the nonlinearity. For a dimensional hidden space we learn nonlinearity bias parameters, one per dimension. Note that the modReLU is similar to the ReLU in spirit, in fact more concretely .
To map hidden states to output, we define a matrix , where is the output dimension. We calculate a linear output as
where is the output bias. The linear output is real valued (
) and can be used for prediction and loss function calculation akin to typical neural networks (e.g. it may be passed through a softmax which is used for cross entropy calculation for classification tasks).
Due to the stability of the norm preserving operations of our network, we found that performance was not very sensitive to initialization of parameters. For full disclosure and reproducibility, we explain our initialization strategy for each parameter below.
We initialize and (the input and output matrices) as in (Glorot & Bengio, 2010), with weights sampled independently from uniforms, .
The biases, and are initialized to 0. This implies that at initialization, the network is linear with unitary weights, which seems to help early optimization (Saxe et al., 2014).
The reflection vectors for and are initialized coordinate-wise from a uniform . Note that the reflection matrices are invariant to scalar multiplication of the parameter vector, hence the width of the uniform initialization is unimportant.
The diagonal weights for and are sampled from a uniform, . This ensures that the diagonal entries are sampled uniformly over the complex unit circle.
We initialize with a uniform, , which results in . Since the norm of the hidden units are roughly preserved through unitary evolution and inputs are typically whitened to have norm 1, we have hidden states, inputs and linear outputs of the same order of magnitude, which seems to help optimization.
In this section we explore the performance of our uRNN in relation to (a) RNN with tanh activations, (b) IRNN (Le et al., 2015), that is an RNN with ReLU activations and with the recurrent weight matrix initialized to the identity, and (c) LSTM (Hochreiter & Schmidhuber, 1997) models. We show that the uRNN shines quantitatively when it comes to modeling long term dependencies and exhibits qualitatively different learning properties to the other models.
We chose a handful of tasks to evaluate the performance of the various models. The tasks were especially created to be be pathologically hard, and have been used as benchmarks for testing the ability of a model to capture long-term memory (Hochreiter & Schmidhuber, 1997; Le et al., 2015; Graves et al., 2014; Martens & Sutskever, 2011)
Of the handful of optimization algorithms we tried on the various models, RMSProp(Tieleman & Hinton, 2012) lead to fastest convergence and is what we stuck to for all experiments here on in. However, we found the IRNN to be particularly unstable; it only ran without blowing up with incredibly low learning rates and gradient clipping. Since the performance was so poor relative to other models we compare against, we do not show IRNN curves in the figures. In each experiment we use a learning rate of and a decay rate of . For the LSTM and RNN models, we had to clip gradients at 1 to avoid exploding gradients. Gradient clipping was unnecessary for the uRNN.
Recurrent networks have been known to have trouble remembering information about inputs seen many time steps previously (Bengio et al., 1994; Pascanu et al., 2010). We therefore want to test the uRNN’s ability to recall exactly data seen a long time ago.
Following a similar setup to (Hochreiter & Schmidhuber, 1997), we outline the copy memory task. Consider 10 categories, . The input takes the form of a length vector of categories, where we test over a range of values of . The first entries are sampled uniformly, independently and with replacement from , and represent the sequence which will need to be remembered. The next entries are set to , which can be thought of as the ’blank’ category. The next single entry is , which represents a delimiter, which should indicate to the algorithm that it is now required to reproduce the initial categories in the output. The remaining entries are set to . The required output sequence consists of repeated entries of , followed by the first categories of the input sequence in exactly the same order. The goal is to minimize the average cross entropy of category predictions at each time step of the sequence. The task amounts to having to remember a categorical sequence of length 10, for time steps.
A simple baseline can be established by considering an optimal strategy when no memory is available, which we deem the memoryless strategy. The memoryless strategy would be to predict for entries and then predict each of the final categories from the set independently and uniformly at random. The categorical cross entropy of this strategy is .
We ran experiments where the RNN with tanh activations, IRNN, LSTM and uRNN had hidden layers of size 80, 80, 40 and 128 respectively. This equates to roughly 6500 parameters per model. In Figure 1, we see that aside from the simplest case, both the RNN with tanh and more surprisingly the LSTMs get almost exactly the same cost as the memoryless strategy. This behaviour is consistent with the results of (Graves et al., 2014), in which poor performance is reported for the LSTM for a very similar long term memory problem.
The uRNN consistently achieves perfect performance in relatively few iterations, even when having to recall sequences after 500 time steps. What is remarkable is that the uRNN does not get stuck at the baseline at all, whilst the LSTM and RNN do. This behaviour suggests that the representations learned by the uRNN have qualitatively different properties from both the LSTM and classical RNNs.
We closely follow the adding problem defined in (Hochreiter & Schmidhuber, 1997) to explain the task at hand. Each input consists of two sequences of length . The first sequence, which we denote , consists of numbers sampled uniformly at random . The second sequence is an indicator sequence consisting of exactly two entries of 1 and remaining entries 0. The first 1 entry is located uniformly at random in the first half of the sequence, whilst the second 1 entry is located uniformly at random in the second half. The output is the sum of the two entries of the first sequence, corresponding to where the 1 entries are located in the second sequence. A naive strategy of predicting 1 as the output regardless of the input sequence gives an expected mean squared error of
We chose to use 128 hidden units for the RNN with tanh, IRNN and LSTM and 512 for the uRNN. This equates to roughly 16K parameters for the RNN with tanh and IRNN, 60K for the LSTM and almost 9K for the uRNN. All models were trained using batch sizes of 20 and 50 with the best results being reported. Our results are shown in Figure 2.
The LSTM and uRNN models are able to convincingly beat the baseline up to time steps. Both models do well when , but the mean squared error does not reach close to . The uRNN achieves lower test error, but it’s curve is more noisy. Despite having vastly more parameters, we monitored the LSTM performance to ensure no overfitting.
The RNN with tanh and IRNN were not able to beat the baseline for any number of time steps. (Le et al., 2015) report that their RNN solve the problem for and the IRNN for , but they require over a million iterations before they start learning. Neither of the two models came close to either the uRNN or the LSTM in performance. The stark difference in our findings are best explained by our use of RMSprop with significantly higher learning rates ( as opposed to ) than (Le et al., 2015) use for SGD with momentum.
In this task, suggested by (Le et al., 2015), algorithms are fed pixels of MNIST (LeCun et al., 1998) sequentially and required to output a class label at the end. We consider two tasks: one where pixels are read in order (from left to right, bottom to top) and one where the pixels are all randomly permuted using the same randomly generated permutation matrix. The same model architectures as for the adding problem were used for this task, except we now use a softmax for category classification. We ran the optimization algorithms until convergence of the mean categorical cross entropy on test data, and plot test accuracy in Figure 3.
Both the uRNN and LSTM perform applaudably well here. On the correct unpermuted MNIST pixels, the LSTM performs better, achieving 98.2 % test accurracy versus 95.1% for the uRNN. However, when we permute the ordering of the pixels, the uRNN dominates with 91.4% of accuracy in contrast to the 88% of the LSTM, despite having less than a quarter of the parameters. This result is state of the art on this task, beating the IRNN (Le et al., 2015), which reaches close to 82% after 1 million training iterations. Notice that uRNN reaches convergence in less than 20 thousand iterations, while it takes the LSTM from 5 to 10 times as many to finish learning.
Permuting the pixels of MNIST images creates many longer term dependencies across pixels than in the original pixel ordering, where a lot of structure is local. This makes it necessary for a network to learn and remember more complicated dependencies across varying time scales. The results suggest that the uRNN is better able to deal with such structure over the data, where the LSTM is better suited to more local sequence structure tasks.
Norms of hidden state gradients. As discussed in Section 2, key to being able to learn long term dependencies is in controlling . With this in mind, we explored how each model propagated gradients, by examining as a function of . Gradient norms were computed at the beginning of training and again after 100 iterations of training on the adding problem. The curves are plotted in Figure 4. It is clear that at first, the uRNN propagates gradients perfectly, while each other model has exponentially vanishing gradients. After 100 iterations of training, each model experiences vanishing gradients, but the uRNN is best able to propagate information, having much less decay.
Hidden state saturation. We claim that typical recurrent architectures saturate, in the sense that after they acquire some information, it becomes much more difficult to acquire further information pertaining to longer dependencies. We took the uRNN and LSTM models trained on the adding problem with , and computed a forward pass with newly generated data for the adding problem with . In order to show saturation effects, we plot the norms of the hidden states and the distance between each state and the last in Figure 4.
In our experiments, it is clear that the uRNN does not suffer as much as other models do. Notice that whilst the norms of hidden states in the uRNN grow very steadily over time, in the LSTM they grow very fast, and then stay constant after about time steps. This behaviour may suggest that the LSTM hidden states saturate in their ability to incorporate new information, which is vital for modeling long complicated sequences. It is interesting to see that the LSTM hidden state at , is close to that of , whilst this is far from the case in the uRNN. Again, this suggests that the LSTM’s capacity to use new information to alter its hidden state severly degrades with sequence length. The uRNN does not suffer from this difficulty nearly as badly.
A clear example of this phenomenon was observed in the adding problem with
. We found that the Pearson correlation between the LSTM output prediction and the first of the two uniform samples (whose sum is the target output) was. This suggests that the LSTM learnt to simply find and store the first sample, as it was unable to incorporate any more information by the time it reached the second, due to saturation of the hidden states.
There are a plethora of further ideas that may be explored from our findings, both with regards to learning representation and efficient implementation. For example, one hurdle of modeling long sequences with recurrent networks is the requirement of storing all hidden state values for the purpose of gradient backpropagation. This can be prohibitive, since GPU memory is typically a limiting factor of neural network optimization. However, since our weight matrix is unitary, its inverse is its conjugate transpose, which is just as easy to operate with. If further we were to use an invertible nonlinearity function, we would no longer need to store hidden states, since they can be recomputed in the backward pass. This could have potentially huge implications, as we would be able to reduce memory usage by an order of , the number of time steps. This would make having immensely large hidden layers possible, perhaps enabling vast memory representations.
In this paper we demonstrate state of the art performance on hard problems requiring long term reasoning and memory. These results are based on a novel parameterization of unitary matrices which permit efficient matrix computations and parameter optimization. Whilst complex domain modeling has been widely succesful in the signal processing community (e.g. Fourier transforms, wavelets), we have yet to exploit the power of complex valued representation in the deep learning community. Our hope is that this work will be a step forward in this direction. We motivate the idea of unitary evolution as a novel way to mitigate the problems of vanishing and exploding gradients. Empirical evidence suggests that our uRNN is better able to pass gradient information through long sequences and does not suffer from saturating hidden states as much as LSTMs, typical RNNs, or RNNs initialized with the identity weight matrix (IRNNs).
Acknowledgments: We thank the developers of Theano (Bergstra et al., 2010) for their great work. We thank NSERC, Compute Canada, Canada Research Chairs and CIFAR for their support. We would also like to thank Çaglar Gulçehre, David Krueger, Soroush Mehri, Marcin Moczulski, Mohammad Pezeshki and Saizheng Zhang for helpful discussions, comments and code sharing.
On the properties of neural machine translation: Encoder–Decoder approaches.In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, October 2014.
Journal of Machine Learning Research, 12:2493–2537, 2011.
International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.
Rectified linear units improve restricted boltzmann machines.International Conference on Machine Learning, 2010.
International Conference on Computer Vision (ICCV), 2015.