Sequential non-normal initializers for recurrent neural networks
Training recurrent neural networks (RNNs) is a hard problem due to degeneracies in the optimization landscape, a problem also known as the vanishing/exploding gradients problem. Short of designing new RNN architectures, various methods for dealing with this problem that have been previously proposed usually boil down to orthogonalization of the recurrent dynamics, either at initialization or during the entire training period. The basic motivation behind these methods is that orthogonal transformations are isometries of the Euclidean space, hence they preserve (Euclidean) norms and effectively deal with the vanishing/exploding gradients problem. However, this idea ignores the crucial effects of non-linearity and noise. In the presence of a non-linearity, orthogonal transformations no longer preserve norms, suggesting that alternative transformations might be better suited to non-linear networks. Moreover, in the presence of noise, norm preservation itself ceases to be the ideal objective. A more sensible objective is maximizing the signal-to-noise ratio (SNR) of the propagated signal instead. Previous work has shown that in the linear case, recurrent networks that maximize the SNR display strongly non-normal dynamics and orthogonal networks are highly suboptimal by this measure. Motivated by this finding, in this paper, we investigate the potential of non-normal RNNs, i.e. RNNs with a non-normal recurrent connectivity matrix, in sequential processing tasks. Our experimental results show that non-normal RNNs significantly outperform their orthogonal counterparts in a diverse range of benchmarks. We also find evidence for increased non-normality and hidden chain-like feedforward structures in trained RNNs initialized with orthogonal recurrent connectivity matrices.READ FULL TEXT VIEW PDF
A recent strategy to circumvent the exploding and vanishing gradient pro...
Recurrent Neural Networks (RNNs) are designed to handle sequential data ...
Designing deep neural networks is an art that often involves an expensiv...
Recurrent neural networks (RNNs) have been successfully used on a wide r...
Advancements in parallel processing have lead to a surge in multilayer
Recurrent Neural Networks (RNNs) are rich models for the processing of
We introduce a novel approach to perform first-order optimization with
Sequential non-normal initializers for recurrent neural networks
Modeling long-term dependencies with recurrent neural networks (RNNs) is a hard problem due to degeneracies inherent in the optimization landscapes of these models, a problem also known as the vanishing/exploding gradients problem (Hochreiter, 1991; Bengio et al., 1994). One approach to addressing this problem has been designing new RNN architectures that are less prone to such difficulties, hence are better able to capture long-term dependencies in sequential data (Hochreiter & Schmidhuber, 1997; Cho et al., 2014; Chang et al., 2017; Bai et al., 2018). An alternative approach is to stick with the basic vanilla RNN architecture instead, but to constrain its dynamics in some way so as to eliminate or reduce the degeneracies that otherwise afflict the optimization landscape. Previous proposals belonging to this second category generally boil down to orthogonalization of the recurrent dynamics, either at initialization or during the entire training period (Le et al., 2015; Arjovsky et al., 2016; Wisdom et al., 2016). The basic idea behind these methods is that orthogonal transformations are isometries of the Euclidean space, hence they preserve distances and norms, which enables them to deal effectively with the vanishing/exploding gradients problem.
However, this idea ignores the crucial effects of non-linearity and noise
. Orthogonal transformations no longer preserve distances and norms in the presence of a non-linearity, suggesting that alternative transformations might be better suited to non-linear networks. Similarly, in the presence of noise, norm preservation itself ceases to be the ideal objective. One must instead maximize the signal-to-noise ratio (SNR) of the propagated signal. In neural networks, noise comes in both through the stochasticity of the stochastic gradient descent (SGD) algorithm and sometimes also through direct noise injection for regularization purposes, as in dropout. Previous work has shown that even in the linear case, recurrent networks that maximize the SNR display strongly non-normal dynamics and orthogonal networks are highly suboptimal by this measure(Ganguli et al., 2008)
. Motivated by these observations, in this paper, we investigate the potential of non-normal RNNs, i.e. RNNs with a non-normal recurrent connectivity matrix, in sequential processing tasks. Recall that a normal matrix is a matrix with an orthonormal set of eigenvectors, whereas a non-normal matrix does not have an orthonormal set of eigenvectors. This property allows non-normal systems to display interesting transient behaviors that are not available in normal systems. This kind of transient behavior, specifically a particular kind of transient amplification of the signal in certain non-normal systems, underlies their superior memory properties(Ganguli et al., 2008), as will be discussed further below.
Our empirical results show that non-normal vanilla RNNs significantly outperform their orthogonal counterparts in a diverse range of benchmarks.
Ganguli et al. (2008) studied memory properties of linear recurrent networks injected with a scalar temporal signal , and noise :
The noise is assumed to be iid with . Ganguli, et al. (2008) then analyzed the Fisher memory matrix (FMM) of this system, defined as:
For linear networks with Gaussian noise, it is easy to show that is, in fact, independent of the past signal history . Ganguli et al. (2008) specifically analyzed the diagonal of the FMM: , which can be written explicitly as:
where is the noise covariance matrix, and the norm of can be roughly thought of as representing the signal strength. The total Fisher memory is the sum of over all past time steps :
Intuitively, measures the information contained in the current state of the system, , about a signal that entered the system time steps ago, . is then a measure of the total information contained in the current state of the system about the entire past signal history, .
The main result in Ganguli et al. (2008) shows that for all normal matrices (including all orthogonal matrices), whereas in general , where is the network size. Remarkably, the memory upper bound can be achieved by certain highly non-normal systems and several examples are explicitly given in Ganguli et al. (2008). Two of those examples are illustrated in Figure 1a (right): a uni-directional “chain” network and a chain network with feedback. In the chain network, the recurrent connectivity is given by and in the chain with feedback network, it is given by , where and are the feedforward and feedback connection weights, respectively, and
is the Kronecker delta function. In addition, in order to achieve optimal memory, the signal must be fed at the source neuron in these networks, i.e..
Figure 1b compares the Fisher memory curves, , of these non-normal networks with the Fisher memory curves of two example normal networks, namely recurrent networks with identity or random orthogonal connectivity matrices. The two non-normal networks have extensive memory capacity, i.e. , whereas for the normal examples, . The crucial property that enables extensive memory in non-normal networks is transient amplification: after the signal enters the network, it is amplified supralinearly for a time of length before it eventually dies out (Figure 1c). This kind of transient amplification is not possible in normal networks.
The preceding analysis, due to Ganguli et al. (2008), is exact in linear networks. Analysis becomes more difficult in the presence of a non-linearity. However, we now demonstrate that the non-normal networks shown in Figure 1a have advantages that extend beyond the linear case. The advantages in the non-linear case are due to reduced interference in these non-normal networks between signals entering the network at different time points in the past. To demonstrate this, we will ignore the effect of noise and consider the effect of non-linearity on the linear decodability of past signals from the current network activity. We thus consider deterministic non-linear networks of the form:
and ask how well we can linearly decode a signal that entered the network time steps ago, , from the current activity of the network, . Figure 2c compares the decoding performance in a non-linear orthogonal network with the decoding performance in the non-linear chain network. Just as in the linear case with noise (Figure 2b), the chain network outperforms the orthogonal network.
To understand intuitively why this is the case, consider a chain network with and . In this model, the responses of the neurons after time steps (at ) are given by , , …, , respectively, starting from the source neuron. Although the non-linearity makes perfect linear decoding of the past signal impossible, one may still imagine being able to decode the past signal with reasonable accuracy as long as is not “too non-linear”. A similar intuition holds for the chain network with feedback as well, as long as the feedforward connection weight, , is sufficiently stronger than the feedback connection strength, . A condition like this must already be satisfied if the network is to maintain its optimal memory properties and also be dynamically stable at the same time (Ganguli et al., 2008).
In normal networks, however, linear decoding is further degraded by interference from signals entering the network at different time points, in addition to the degradation caused by the non-linearity. This is easiest to see in the identity network (a similar argument holds for the random orthogonal example too), where the responses of the neurons after time steps are identically given by , if one assumes . Linear decoding is harder in this case, because a signal is both distorted by multiple steps of non-linearity and also mixed with signals entering at other time points.
Because assuming an a priori non-normal structure for an RNN runs the risk of being too restrictive, in this paper, we instead explore the promise of non-normal networks as initializers for RNNs. Throughout the paper, we will be primarily comparing the four RNN architectures schematically depicted in Figure 1
a as initializers: two of them normal networks (identity and random orthogonal) and the other two non-normal networks (chain and chain with feedback), the last two being motivated by their optimal memory properties in the linear case, as reviewed above. We provide PyTorch and Keras classes implementing the proposed non-normal initializers at the following public repository:https://github.com/eminorhan/nonnormal-init.
Copy, addition, and permuted sequential MNIST tasks were commonly used as benchmarks in previous RNN studies (Arjovsky et al., 2016; Bai et al., 2018; Chang et al., 2017; Hochreiter & Schmidhuber, 1997; Le et al., 2015; Wisdom et al., 2016). We now briefly describe each of these tasks.
Copy task: The input is a sequence of integers of length . The first integers in the sequence define the target subsequence that is to be copied and consist of integers between and (inclusive). The next integers are set to . The integer after that is set to , which acts as the cue indicating that the model should start copying the target subsequence. The final integers are set to . The output sequence that the model is trained to reproduce consists of s followed by the target subsequence from the input that is to be copied. To make sure that the task requires a sufficiently long memory capacity, we used a large sequence length, , comparable to the largest sequence length considered in Arjovsky, et al. (2016) for the same task.
Addition task: The input consists of two sequences of length . The first one is a sequence of random numbers drawn uniformly from the interval . The second sequence is an indicator sequence with s at exactly two positions and s everywhere else. The positions of the two
s indicate the positions of the numbers to be added in the first sequence. The target output is the sum of the two corresponding numbers. The position of the firstis drawn uniformly from the first half of the sequence and the position of the second is drawn uniformly from the second half of the sequence. Again, to ensure that the task requires a sufficiently long memory capacity, we chose , which is the same as the largest sequence length considered in Arjovsky, et al. (2016) for the same task.
Permuted sequential MNIST (psMNIST): This is a sequential version of the standard MNIST benchmark where the pixels are fed to the model one pixel at a time. To make the task hard enough, we used the permuted version of the sequential MNIST task where a fixed random permutation is applied to the pixels to eliminate any spatial structure before they are fed into the model.
We used the elu nonlinearity for the copy and the permuted sequential MNIST tasks (Clevert et al., 2016), and the relu nonlinearity for the addition problem (because relu proved to be more natural for remembering positive numbers).
As mentioned above, the scaled identity and the scaled random orthogonal networks constituted the normal initializers. In the scaled identity initializer, the recurrent connectivity matrix was initialized as and the input matrix was initialized as . In the random orthogonal initializer, the recurrent connectivity matrix was initialized as , where
is a random dense orthogonal matrix, and the input matrixwas initialized in the same way as in the identity initializer.
The feedforward chain and the chain with feedback networks constituted the non-normal initializers. In the chain initializer, the recurrent connectivity matrix was initialized as and the input matrix was initialized as , where denotes the
-dimensional identity matrix. In the chain with feedback initializer, the recurrent connectivity matrix was initialized asand the input matrix was initialized in the same way as in the chain initializer.
We used the rmsprop optimizer for all models, which we found to be the best method for this set of tasks. The learning rate of the optimizer was a hyperparameter which we tuned separately for each model and each task. The following learning rates were considered in the hyper-parameter search:. We ran each model on each task times using the integers from to as random seeds.
In addition, the following model-specific hyperparameters were searched over for each task:
Chain model: the feedforward connection weight,
Chain with feedback model: the feedback connection weight,
Scaled identity model: the scale,
Random orthogonal model: the scale,
This yields a total of different runs for each experiment in the non-normal models and a total of different runs in the normal models. Note that we ran more extensive hyper-parameter searches for the normal models than for the non-normal models in this set of tasks.
Figure 3a-c shows the validation losses for each model with the best hyper-parameter settings. The non-normal initializers generally outperform the normal initializers. Figure 3d-f shows for each model the number of “successful” runs that converged to a validation loss below a criterion level (which we set to be 50% of the loss for a baseline random model). The chain model outperformed all other models by this measure (despite having a smaller total number of runs than the normal models). In the copy task, for example, none of the runs for the normal models was able to achieve the criterion level, whereas 46 out of 462 runs for the chain model and 11 out of 462 runs for the feedback chain model reached the criterion loss.
To investigate if the benefits of non-normal initializers extend to more realistic problems, we conducted experiments with three standard language modeling tasks: word-level Penn Treebank (PTB), character-level PTB, and character-level enwik8 benchmarks.
For the language modeling experiments in this subsection, we used the code base provided by Salesforce Research (Merity et al., 2018a, b): https://github.com/salesforce/awd-lstm-lm. We refer the reader to Merity et al. (2018a; 2018b) for a more detailed description of the benchmarks. For the experiments in this subsection, we generally preserved the model setup used in Merity et al. (2018a; 2018b), except for the following differences: 1) We replaced the gated RNN architectures (LSTMs and QRNNs) used in Merity et al. (2018a; 2018b) with vanilla RNNs; 2) We observed that vanilla RNNs require weaker regularization than gated RNN architectures. Therefore, in the word-level PTB task, we set all dropout rates to . In the character-level PTB task, all dropout rates except dropoute were set to , which was set to . In the enwik8 benchmark, all dropout rates were set to ; 3) We trained the word-level PTB models for 60 epochs, the character-level PTB models for 500 epochs and the enwik8 models for 35 epochs.
We compared the same four models described in the previous subsection. As in Merity et al. (2018a), we used the Adam optimizer and thus only optimized the , , hyper-parameters for the experiments in this subsection. For the hyper-parameter in the chain model and the hyper-parameter in the scaled identity and random orthogonal models, we searched over values uniformly spaced between and (inclusive); whereas for the chain with feedback model, we set the feedforward connection weight, , to the optimal value it had in the chain model and searched over values uniformly spaced between and (inclusive). In addition, we repeated each experiment 3 times using different random seeds, yielding a total of 63 runs for each model and each benchmark.
The results are shown in Figure 4 and in Table 1. Figure 4 shows the validation loss over the course of training in units of bits per character (bpc). Table 1 reports the test losses at the end of training. The non-normal models outperform the normal models on the word-level and character-level PTB benchmarks. The differences between the models are less clear on the enwik8 benchmark. However, in terms of the test loss, the non-normal feedback chain model significantly outperforms the other models on all three benchmarks (Table 1).
|Model||PTB word||PTB char.||enwik8|
|Identity||6.550 0.002||1.312 0.000||1.783 0.003|
|Ortho.||6.557 0.002||1.312 0.001||1.843 0.046|
|Chain||6.514 0.001||1.308 0.000||1.803 0.017|
|Fb. chain||6.510 0.001||1.307 0.000||1.774 0.002|
We note that the vanilla RNN models perform significantly worse than the gated RNN architectures considered in Merity et al. (2018a; 2018b). We conjecture that this is because gated architectures are generally better at modeling contextual dependencies, hence they have inductive biases better suited to language modeling tasks. The primary benefit of non-normal dynamics, on the other hand, is enabling a longer memory capacity. Below, we will discuss whether non-normal dynamics can be used in gated RNN architectures to improve performance as well.
Next, we conducted experiments with an RL agent trained in the car racing environment CarRacing-v0 in OpenAI Gym. Specifically, we used the model introduced in Ha & Schmidhuber (2018) for this environment. For the experiments reported in this subsection, we also used the code base provided by the authors: https://github.com/hardmaru/WorldModelsExperiments. Briefly, in this model, the agent first collects a large number of roll-outs from the environment using a random policy. These random roll-outs are then used as training data for a variational auto-encoder (VAE), learning a compact, low-dimensional representation, , of the agent’s high-dimensional observations. Then, a predictive model of this latent representation is learned via an RNN. More specifically, at each time step, the RNN takes as input the current action of the agent, , and the current latent state of the environment, , and predicts the next latent state, . Using an RNN as a predictive model enables the agent to learn potentially complex dependencies between the histories of the agent’s actions and of the state of the environment. In the final step, using the hidden state of the predictive RNN model and the latent state of the environment, , a simple linear controller is trained to perform the actual car racing task. Ha & Schmidhuber (2018) train the predictive RNN model and the controller separately (i.e. the entire model is not trained end-to-end), thus we only consider the training of the RNN in our experiments and ignore the training of the controller. Accordingly, the loss values reported below are the validation losses (i.e. negative log-likelihoods) for the predictive model only. For further details, we refer the reader to Ha & Schmidhuber (2018). We essentially use the same set-up that they use except for a few differences: 1) We replace the LSTM with a vanilla RNN (with the same number of units) as the predictive model; 2) We use a smaller number of random roll-outs (300 vs. 10000); 3) We use the Adam optimizer with a learning rate of 0.0005, instead of the rmsprop optimizer.
For the experiments in this subsection, we only compared RNNs initialized with a scaled identity matrix with RNNs initialized with a chain structure. The hyper-parameter searches conducted were identical to the searches described above for the language modeling experiments. Table 2 shows the results. The chain model outperformed the identity model in terms of the final validation loss for the predictive model.
We observed that training made vanilla RNNs initialized with orthogonal recurrent connectivity matrices non-normal. We quantified the non-normality of the trained recurrent connectivity matrices using a measure introduced by Henrici (1962): , where denotes the Frobenius norm and is the
-th eigenvalue of. This measure equals for all normal matrices and is positive for non-normal matrices. We found that became positive for all successfully trained RNNs initialized with orthogonal recurrent connectivity matrices. Table 3 reports the aggregate statistics of for orthogonally initialized RNNs trained on the toy benchmarks.
|Addition-750||2.33 1.02||2.74 0.07|
|psMNIST||1.01 0.12||2.72 0.08|
Although increased non-normality in trained RNNs is an interesting observation, the Henrici index, by itself, does not tell us what structural features in trained RNNs contribute to this increased non-normality. Given the benefits of chain-like feedforward non-normal structures in RNNs for improved memory, we hypothesized that training might have installed hidden chain-like feedforward structures in trained RNNs and that these feedforward structures were responsible for their increased non-normality.
To uncover these hidden feedforward structures, we performed an analysis suggested by Rajan et al. (2016). In this analysis, we first injected a unit pulse of input to the network at the beginning of the trial and let the network evolve for time steps afterwards according to its recurrent dynamics with no direct input. We then ordered the recurrent units by the time of their peak activity (using a small amount of jitter to break potential ties between units) and plotted the mean recurrent connection weights, , as a function of the order difference between two units, . Positive values correspond to connections from earlier peaking units to later peaking units, and vice versa for negative values. In trained RNNs, the mean recurrent weight profile as a function of had an asymmetric peak, with connections in the “forward” direction being, on average, stronger than those in the opposite direction. Figure 5 shows examples with orthogonally initialized RNNs trained on the addition and the permuted sequential MNIST tasks. Note that for a purely feedforward chain, the weight profile would have a single peak at and would be zero elsewhere. Although the weight profiles for trained RNNs are not this extreme, the prominent asymmetric bump with a peak at a positive value indicates a hidden chain-like feedforward structure in these networks.
So far, we have only considered vanilla RNNs. An important question is whether the benefits of non-normal dynamics demonstrated above for vanilla RNNs also extend to gated RNN architectures like LSTMs or GRUs (Hochreiter & Schmidhuber, 1997; Cho et al., 2014). Gated RNN architectures have better inductive biases than vanilla RNNs in many practical tasks of interest such as language modeling (e.g. see Table 1 for a comparison of vanilla RNN architectures with an LSTM architecture of similar size in the language modeling benchmarks), thus it would be practically very useful if their performance could be improved through an inductive bias for non-normal dynamics.
|Model||PTB word||PTB char.||enwik8|
|Ortho.||5.937 0.002||1.230 0.001||1.583 0.001|
|Chain||5.935 0.001||1.230 0.001||1.586 0.000|
|Plain||5.949 0.007||1.245 0.001||1.584 0.002|
|Mixed||5.944 0.004||1.227 0.000||1.577 0.001|
To address this question, we treated the input, forget, output and update gates of the LSTM architecture as analogous to vanilla RNNs and initialized the recurrent and input matrices inside these gates in the same way as in the chain or the orthogonal initialization of vanilla RNNs above. We also compared these with a more standard initialization scheme where all the weights were drawn from a uniform distributionwhere is the reciprocal of the hidden layer size (labeled plain in Table 4). This is the default initializer for the LSTM weight matrices in PyTorch: https://pytorch.org/docs/stable/nn.html#lstm.
We compared these initializers in the language modeling benchmarks. The chain initializer did not perform better than the orthogonal initializer (Table 4), suggesting that non-normal dynamics in gated RNN architectures may not be as helpful as it is in vanilla RNNs. In hindsight, this is not too surprising, because our initial motivation for introducing non-normal dynamics heavily relied on the vanilla RNN architecture and gated RNNs can be dynamically very different from vanilla RNNs.
When we looked at the trained LSTM weight matrices more closely, we found that, although still non-normal, the recurrent weight matrices inside the input, forget, and output gates (i.e. the sigmoid gates) did not have the same signatures of hidden chain-like feedforward structures observed in vanilla RNNs. Specifically, the weight profiles in the LSTM recurrent weight matrices inside these three gates did not display the asymmetric bump characteristic of a prominent chain-like feedforward structure, but were instead monotonic functions of (Figure 6a-c), suggesting a qualitatively different kind of dynamics where the individual units are more persistent over time. The recurrent weight matrix inside the update gate (the tanh gate), on the other hand, did display the signature of a hidden chain-like feedforward structure (Figure 6d). When we incorporated these two different structures in different gates of the LSTMs, by using a chain initializer for the update gate and a monotonically increasing recurrent weight profile for the other gates (labeled mixed in Table 4), the resulting initializer outperformed the other initializers on the character-level PTB and enwik8 benchmarks.
Motivated by their optimal memory properties in a simplified linear setting (Ganguli et al., 2008), in this paper, we investigated the potential benefits of certain highly non-normal chain-like RNN architectures in capturing long-term dependencies in sequential tasks. Our results clearly demonstrate an advantage for such non-normal architectures as initializers for vanilla RNNs, compared to the commonly used orthogonal initializers. We further found evidence for the induction of such chain-like feedforward structures in trained vanilla RNNs even when these RNNs are initialized with orthogonal recurrent connectivity matrices.
The benefits of these chain-like non-normal initializers do not directly carry over to more complex, gated RNN architectures such as LSTMs and GRUs. In some important practical problems such as language modeling, the gains from using these kinds of gated architectures seem to far outweigh the gains obtained from the non-normal initializers in vanilla RNNs (see Table 1). However, we also uncovered important regularities in trained LSTM weight matrices, namely that the recurrent weight profiles of the input, forget, and output gates (the sigmoid gates) in trained LSTMs display a monotonically increasing pattern, whereas the recurrent matrix inside the update gate (the tanh gate) displays a chain-like feedforward structure similar to that observed in vanilla RNNs (Figure 6). We showed that these regularities can be exploited to improve the training and/or generalization performance of these gated RNN architectures by introducing them as useful inductive biases to these models.
There is a close connection between the identity initialization of RNNs (Le et al., 2015)
and the widely used identity skip connections (or residual connections) in deep feedforward networks(He et al., 2016). Given the superior performance of chain-like non-normal initializers over the identity initialization demonstrated in the context of vanilla RNNs in this paper, it could be interesting to look for similar chain-like non-normal architectural motifs that could be used in deep feedforward networks in place of the identity skip connections.
Proceedings of the 33rd International Conference on Machine Learning, 2016.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734, 2014.