Improved memory in recurrent neural networks with sequential non-normal dynamics

05/31/2019 ∙ by A. Emin Orhan, et al. ∙ 0

Training recurrent neural networks (RNNs) is a hard problem due to degeneracies in the optimization landscape, a problem also known as the vanishing/exploding gradients problem. Short of designing new RNN architectures, various methods for dealing with this problem that have been previously proposed usually boil down to orthogonalization of the recurrent dynamics, either at initialization or during the entire training period. The basic motivation behind these methods is that orthogonal transformations are isometries of the Euclidean space, hence they preserve (Euclidean) norms and effectively deal with the vanishing/exploding gradients problem. However, this idea ignores the crucial effects of non-linearity and noise. In the presence of a non-linearity, orthogonal transformations no longer preserve norms, suggesting that alternative transformations might be better suited to non-linear networks. Moreover, in the presence of noise, norm preservation itself ceases to be the ideal objective. A more sensible objective is maximizing the signal-to-noise ratio (SNR) of the propagated signal instead. Previous work has shown that in the linear case, recurrent networks that maximize the SNR display strongly non-normal dynamics and orthogonal networks are highly suboptimal by this measure. Motivated by this finding, in this paper, we investigate the potential of non-normal RNNs, i.e. RNNs with a non-normal recurrent connectivity matrix, in sequential processing tasks. Our experimental results show that non-normal RNNs significantly outperform their orthogonal counterparts in a diverse range of benchmarks. We also find evidence for increased non-normality and hidden chain-like feedforward structures in trained RNNs initialized with orthogonal recurrent connectivity matrices.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

nonnormal-init

Sequential non-normal initializers for recurrent neural networks


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modeling long-term dependencies with recurrent neural networks (RNNs) is a hard problem due to degeneracies inherent in the optimization landscapes of these models, a problem also known as the vanishing/exploding gradients problem (Hochreiter, 1991; Bengio et al., 1994). One approach to addressing this problem has been designing new RNN architectures that are less prone to such difficulties, hence are better able to capture long-term dependencies in sequential data (Hochreiter & Schmidhuber, 1997; Cho et al., 2014; Chang et al., 2017; Bai et al., 2018). An alternative approach is to stick with the basic vanilla RNN architecture instead, but to constrain its dynamics in some way so as to eliminate or reduce the degeneracies that otherwise afflict the optimization landscape. Previous proposals belonging to this second category generally boil down to orthogonalization of the recurrent dynamics, either at initialization or during the entire training period (Le et al., 2015; Arjovsky et al., 2016; Wisdom et al., 2016). The basic idea behind these methods is that orthogonal transformations are isometries of the Euclidean space, hence they preserve distances and norms, which enables them to deal effectively with the vanishing/exploding gradients problem.

However, this idea ignores the crucial effects of non-linearity and noise

. Orthogonal transformations no longer preserve distances and norms in the presence of a non-linearity, suggesting that alternative transformations might be better suited to non-linear networks. Similarly, in the presence of noise, norm preservation itself ceases to be the ideal objective. One must instead maximize the signal-to-noise ratio (SNR) of the propagated signal. In neural networks, noise comes in both through the stochasticity of the stochastic gradient descent (SGD) algorithm and sometimes also through direct noise injection for regularization purposes, as in dropout. Previous work has shown that even in the linear case, recurrent networks that maximize the SNR display strongly non-normal dynamics and orthogonal networks are highly suboptimal by this measure

(Ganguli et al., 2008)

. Motivated by these observations, in this paper, we investigate the potential of non-normal RNNs, i.e. RNNs with a non-normal recurrent connectivity matrix, in sequential processing tasks. Recall that a normal matrix is a matrix with an orthonormal set of eigenvectors, whereas a non-normal matrix does not have an orthonormal set of eigenvectors. This property allows non-normal systems to display interesting transient behaviors that are not available in normal systems. This kind of transient behavior, specifically a particular kind of transient amplification of the signal in certain non-normal systems, underlies their superior memory properties

(Ganguli et al., 2008), as will be discussed further below.

Our empirical results show that non-normal vanilla RNNs significantly outperform their orthogonal counterparts in a diverse range of benchmarks.

2 Results

2.1 Memory in linear recurrent networks with noise

Ganguli et al. (2008) studied memory properties of linear recurrent networks injected with a scalar temporal signal , and noise :

(1)

The noise is assumed to be iid with . Ganguli, et al. (2008) then analyzed the Fisher memory matrix (FMM) of this system, defined as:

(2)

For linear networks with Gaussian noise, it is easy to show that is, in fact, independent of the past signal history . Ganguli et al. (2008) specifically analyzed the diagonal of the FMM: , which can be written explicitly as:

(3)

where is the noise covariance matrix, and the norm of can be roughly thought of as representing the signal strength. The total Fisher memory is the sum of over all past time steps :

(4)

Intuitively, measures the information contained in the current state of the system, , about a signal that entered the system time steps ago, . is then a measure of the total information contained in the current state of the system about the entire past signal history, .

Figure 1: a Schematic diagrams of different recurrent networks and the corresponding recurrent connectivity matrices (upper panel). b Memory curves, (Equation 3), for the four recurrent networks shown in a. The non-normal networks, chain and chain with feedback, have extensive memory capacity: , whereas the normal networks, identity and random orthogonal, have . c Extensive memory is made possible in non-normal networks by transient amplification: the signal is amplified for a time of length before it dies out, abruptly in the case of the chain network and more gradually in the case of the chain network with feedback. In b and c, the network size is for all four networks.

The main result in Ganguli et al. (2008) shows that for all normal matrices (including all orthogonal matrices), whereas in general , where is the network size. Remarkably, the memory upper bound can be achieved by certain highly non-normal systems and several examples are explicitly given in Ganguli et al. (2008). Two of those examples are illustrated in Figure 1a (right): a uni-directional “chain” network and a chain network with feedback. In the chain network, the recurrent connectivity is given by and in the chain with feedback network, it is given by , where and are the feedforward and feedback connection weights, respectively, and

is the Kronecker delta function. In addition, in order to achieve optimal memory, the signal must be fed at the source neuron in these networks, i.e.

.

Figure 1b compares the Fisher memory curves, , of these non-normal networks with the Fisher memory curves of two example normal networks, namely recurrent networks with identity or random orthogonal connectivity matrices. The two non-normal networks have extensive memory capacity, i.e. , whereas for the normal examples, . The crucial property that enables extensive memory in non-normal networks is transient amplification: after the signal enters the network, it is amplified supralinearly for a time of length before it eventually dies out (Figure 1c). This kind of transient amplification is not possible in normal networks.

2.2 A toy non-linear example: Non-linearity and noise induce similar effects

The preceding analysis, due to Ganguli et al. (2008), is exact in linear networks. Analysis becomes more difficult in the presence of a non-linearity. However, we now demonstrate that the non-normal networks shown in Figure 1a have advantages that extend beyond the linear case. The advantages in the non-linear case are due to reduced interference in these non-normal networks between signals entering the network at different time points in the past. To demonstrate this, we will ignore the effect of noise and consider the effect of non-linearity on the linear decodability of past signals from the current network activity. We thus consider deterministic non-linear networks of the form:

(5)

and ask how well we can linearly decode a signal that entered the network time steps ago, , from the current activity of the network, . Figure 2c compares the decoding performance in a non-linear orthogonal network with the decoding performance in the non-linear chain network. Just as in the linear case with noise (Figure 2b), the chain network outperforms the orthogonal network.

To understand intuitively why this is the case, consider a chain network with and . In this model, the responses of the neurons after time steps (at ) are given by , , …, , respectively, starting from the source neuron. Although the non-linearity makes perfect linear decoding of the past signal impossible, one may still imagine being able to decode the past signal with reasonable accuracy as long as is not “too non-linear”. A similar intuition holds for the chain network with feedback as well, as long as the feedforward connection weight, , is sufficiently stronger than the feedback connection strength, . A condition like this must already be satisfied if the network is to maintain its optimal memory properties and also be dynamically stable at the same time (Ganguli et al., 2008).

In normal networks, however, linear decoding is further degraded by interference from signals entering the network at different time points, in addition to the degradation caused by the non-linearity. This is easiest to see in the identity network (a similar argument holds for the random orthogonal example too), where the responses of the neurons after time steps are identically given by , if one assumes . Linear decoding is harder in this case, because a signal is both distorted by multiple steps of non-linearity and also mixed with signals entering at other time points.

Figure 2: Linear decoding experiments. a In a linear network with no noise, the past signal

can be perfectly reconstructed from the current activity vector

using a linear decoder. b When noise is added, the chain network outperforms the orthogonal network as predicted from the theory in Ganguli, et al. (2008). c In a completely deterministic system, introducing a non-linearity has a similar effect to that of noise. The chain network again outperforms the orthogonal one when the signal is reconstructed with a linear decoder. As discussed further in the text, this is because the signal is subject to more interference in the orthogonal network than in the chain network. All simulations in this figure used networks with recurrent units. In c, we used the elu non-linearity for (Clevert et al., 2016). For the chain network, we assume that the signal is always fed at the source neuron.

2.3 Experiments

Because assuming an a priori non-normal structure for an RNN runs the risk of being too restrictive, in this paper, we instead explore the promise of non-normal networks as initializers for RNNs. Throughout the paper, we will be primarily comparing the four RNN architectures schematically depicted in Figure 1

a as initializers: two of them normal networks (identity and random orthogonal) and the other two non-normal networks (chain and chain with feedback), the last two being motivated by their optimal memory properties in the linear case, as reviewed above. We provide PyTorch and Keras classes implementing the proposed non-normal initializers at the following public repository:

https://github.com/eminorhan/nonnormal-init.

2.3.1 Copy, addition, permuted sequential MNIST

Copy, addition, and permuted sequential MNIST tasks were commonly used as benchmarks in previous RNN studies (Arjovsky et al., 2016; Bai et al., 2018; Chang et al., 2017; Hochreiter & Schmidhuber, 1997; Le et al., 2015; Wisdom et al., 2016). We now briefly describe each of these tasks.

Copy task: The input is a sequence of integers of length . The first integers in the sequence define the target subsequence that is to be copied and consist of integers between and (inclusive). The next integers are set to . The integer after that is set to , which acts as the cue indicating that the model should start copying the target subsequence. The final integers are set to . The output sequence that the model is trained to reproduce consists of s followed by the target subsequence from the input that is to be copied. To make sure that the task requires a sufficiently long memory capacity, we used a large sequence length, , comparable to the largest sequence length considered in Arjovsky, et al. (2016) for the same task.

Addition task: The input consists of two sequences of length . The first one is a sequence of random numbers drawn uniformly from the interval . The second sequence is an indicator sequence with s at exactly two positions and s everywhere else. The positions of the two

s indicate the positions of the numbers to be added in the first sequence. The target output is the sum of the two corresponding numbers. The position of the first

is drawn uniformly from the first half of the sequence and the position of the second is drawn uniformly from the second half of the sequence. Again, to ensure that the task requires a sufficiently long memory capacity, we chose , which is the same as the largest sequence length considered in Arjovsky, et al. (2016) for the same task.

Permuted sequential MNIST (psMNIST): This is a sequential version of the standard MNIST benchmark where the pixels are fed to the model one pixel at a time. To make the task hard enough, we used the permuted version of the sequential MNIST task where a fixed random permutation is applied to the pixels to eliminate any spatial structure before they are fed into the model.

We used the elu nonlinearity for the copy and the permuted sequential MNIST tasks (Clevert et al., 2016), and the relu nonlinearity for the addition problem (because relu proved to be more natural for remembering positive numbers).

As mentioned above, the scaled identity and the scaled random orthogonal networks constituted the normal initializers. In the scaled identity initializer, the recurrent connectivity matrix was initialized as and the input matrix was initialized as . In the random orthogonal initializer, the recurrent connectivity matrix was initialized as , where

is a random dense orthogonal matrix, and the input matrix

was initialized in the same way as in the identity initializer.

The feedforward chain and the chain with feedback networks constituted the non-normal initializers. In the chain initializer, the recurrent connectivity matrix was initialized as and the input matrix was initialized as , where denotes the

-dimensional identity matrix. In the chain with feedback initializer, the recurrent connectivity matrix was initialized as

and the input matrix was initialized in the same way as in the chain initializer.

We used the rmsprop optimizer for all models, which we found to be the best method for this set of tasks. The learning rate of the optimizer was a hyperparameter which we tuned separately for each model and each task. The following learning rates were considered in the hyper-parameter search:

. We ran each model on each task times using the integers from to as random seeds.

In addition, the following model-specific hyperparameters were searched over for each task:

  • Chain model: the feedforward connection weight,

  • Chain with feedback model: the feedback connection weight,

  • Scaled identity model: the scale,

  • Random orthogonal model: the scale,

This yields a total of different runs for each experiment in the non-normal models and a total of different runs in the normal models. Note that we ran more extensive hyper-parameter searches for the normal models than for the non-normal models in this set of tasks.

Figure 3: Results on the toy benchmarks. a-c

Validation losses with the best hyper-parameter settings. Solid lines are the means and shaded regions are standard errors over different runs using different random seeeds. For the copy and addition tasks, we also show the loss values for random baseline models (dashed lines). For the psMNIST task, the mean cross-entropy loss for a random classifier is

, thus all four models comfortably outperform this random baseline right from the end of the first training epoch.

d-f Number of “successful” runs (or hyperparameter configurations) that converged to a validation loss below 50% of the loss for the random baseline model. Note that the total number of runs was higher for the normal models vs. the non-normal models (594 vs. 462 runs per experiment). Despite this, the non-normal models generally outperformed the normal models even by this measure.

Figure 3a-c shows the validation losses for each model with the best hyper-parameter settings. The non-normal initializers generally outperform the normal initializers. Figure 3d-f shows for each model the number of “successful” runs that converged to a validation loss below a criterion level (which we set to be 50% of the loss for a baseline random model). The chain model outperformed all other models by this measure (despite having a smaller total number of runs than the normal models). In the copy task, for example, none of the runs for the normal models was able to achieve the criterion level, whereas 46 out of 462 runs for the chain model and 11 out of 462 runs for the feedback chain model reached the criterion loss.

2.3.2 Language modeling experiments

To investigate if the benefits of non-normal initializers extend to more realistic problems, we conducted experiments with three standard language modeling tasks: word-level Penn Treebank (PTB), character-level PTB, and character-level enwik8 benchmarks.

For the language modeling experiments in this subsection, we used the code base provided by Salesforce Research (Merity et al., 2018a, b): https://github.com/salesforce/awd-lstm-lm. We refer the reader to Merity et al. (2018a; 2018b) for a more detailed description of the benchmarks. For the experiments in this subsection, we generally preserved the model setup used in Merity et al. (2018a; 2018b), except for the following differences: 1) We replaced the gated RNN architectures (LSTMs and QRNNs) used in Merity et al. (2018a; 2018b) with vanilla RNNs; 2) We observed that vanilla RNNs require weaker regularization than gated RNN architectures. Therefore, in the word-level PTB task, we set all dropout rates to . In the character-level PTB task, all dropout rates except dropoute were set to , which was set to . In the enwik8 benchmark, all dropout rates were set to ; 3) We trained the word-level PTB models for 60 epochs, the character-level PTB models for 500 epochs and the enwik8 models for 35 epochs.

We compared the same four models described in the previous subsection. As in Merity et al. (2018a), we used the Adam optimizer and thus only optimized the , , hyper-parameters for the experiments in this subsection. For the hyper-parameter in the chain model and the hyper-parameter in the scaled identity and random orthogonal models, we searched over values uniformly spaced between and (inclusive); whereas for the chain with feedback model, we set the feedforward connection weight, , to the optimal value it had in the chain model and searched over values uniformly spaced between and (inclusive). In addition, we repeated each experiment 3 times using different random seeds, yielding a total of 63 runs for each model and each benchmark.

Figure 4: Results on the language modeling benchmarks. Solid lines are the means and shaded regions are standard errors over 3 different runs using different random seeeds.

The results are shown in Figure 4 and in Table 1. Figure 4 shows the validation loss over the course of training in units of bits per character (bpc). Table 1 reports the test losses at the end of training. The non-normal models outperform the normal models on the word-level and character-level PTB benchmarks. The differences between the models are less clear on the enwik8 benchmark. However, in terms of the test loss, the non-normal feedback chain model significantly outperforms the other models on all three benchmarks (Table 1).


Model PTB word PTB char. enwik8
Identity 6.550 0.002 1.312 0.000 1.783 0.003
Ortho. 6.557 0.002 1.312 0.001 1.843 0.046
Chain 6.514 0.001 1.308 0.000 1.803 0.017
Fb. chain 6.510 0.001 1.307 0.000 1.774 0.002
3-layer LSTM 5.878 1.175 1.232
Table 1: Test losses (bpc) on the language modeling benchmarks. The numbers represent mean s.e.m. over 3 independent runs. LSTM results are from Merity et al. (2018a; 2018b).

We note that the vanilla RNN models perform significantly worse than the gated RNN architectures considered in Merity et al. (2018a; 2018b). We conjecture that this is because gated architectures are generally better at modeling contextual dependencies, hence they have inductive biases better suited to language modeling tasks. The primary benefit of non-normal dynamics, on the other hand, is enabling a longer memory capacity. Below, we will discuss whether non-normal dynamics can be used in gated RNN architectures to improve performance as well.

2.3.3 Reinforcement learning (RL) experiments

Next, we conducted experiments with an RL agent trained in the car racing environment CarRacing-v0 in OpenAI Gym. Specifically, we used the model introduced in Ha & Schmidhuber (2018) for this environment. For the experiments reported in this subsection, we also used the code base provided by the authors: https://github.com/hardmaru/WorldModelsExperiments. Briefly, in this model, the agent first collects a large number of roll-outs from the environment using a random policy. These random roll-outs are then used as training data for a variational auto-encoder (VAE), learning a compact, low-dimensional representation, , of the agent’s high-dimensional observations. Then, a predictive model of this latent representation is learned via an RNN. More specifically, at each time step, the RNN takes as input the current action of the agent, , and the current latent state of the environment, , and predicts the next latent state, . Using an RNN as a predictive model enables the agent to learn potentially complex dependencies between the histories of the agent’s actions and of the state of the environment. In the final step, using the hidden state of the predictive RNN model and the latent state of the environment, , a simple linear controller is trained to perform the actual car racing task. Ha & Schmidhuber (2018) train the predictive RNN model and the controller separately (i.e. the entire model is not trained end-to-end), thus we only consider the training of the RNN in our experiments and ignore the training of the controller. Accordingly, the loss values reported below are the validation losses (i.e. negative log-likelihoods) for the predictive model only. For further details, we refer the reader to Ha & Schmidhuber (2018). We essentially use the same set-up that they use except for a few differences: 1) We replace the LSTM with a vanilla RNN (with the same number of units) as the predictive model; 2) We use a smaller number of random roll-outs (300 vs. 10000); 3) We use the Adam optimizer with a learning rate of 0.0005, instead of the rmsprop optimizer.


Model Validation loss
Identity 1.409 0.004
Chain 1.392 0.005
Table 2: Validation losses (negative log-likelihoods) for the predictive RNN model trained in the car racing environment CarRacing-v0. The numbers are mean s.e.m. over 3 independent runs.

For the experiments in this subsection, we only compared RNNs initialized with a scaled identity matrix with RNNs initialized with a chain structure. The hyper-parameter searches conducted were identical to the searches described above for the language modeling experiments. Table 2 shows the results. The chain model outperformed the identity model in terms of the final validation loss for the predictive model.

2.4 Hidden feedforward structures in trained RNNs

We observed that training made vanilla RNNs initialized with orthogonal recurrent connectivity matrices non-normal. We quantified the non-normality of the trained recurrent connectivity matrices using a measure introduced by Henrici (1962): , where denotes the Frobenius norm and is the

-th eigenvalue of

. This measure equals for all normal matrices and is positive for non-normal matrices. We found that became positive for all successfully trained RNNs initialized with orthogonal recurrent connectivity matrices. Table 3 reports the aggregate statistics of for orthogonally initialized RNNs trained on the toy benchmarks.


Task Identity Orthogonal
Addition-750 2.33 1.02 2.74 0.07
psMNIST 1.01 0.12 2.72 0.08
Table 3: Henrici indices, , of trained RNNs initialized with orthogonal recurrent connectivity matrices. The numbers represent mean s.e.m. over all successfully trained networks. We define training success as having a validation loss below 50% of a random baseline model. Note that by this measure, none of the orthogonally initialized RNNs was successful on the copy task (Figure 3d).

Although increased non-normality in trained RNNs is an interesting observation, the Henrici index, by itself, does not tell us what structural features in trained RNNs contribute to this increased non-normality. Given the benefits of chain-like feedforward non-normal structures in RNNs for improved memory, we hypothesized that training might have installed hidden chain-like feedforward structures in trained RNNs and that these feedforward structures were responsible for their increased non-normality.

To uncover these hidden feedforward structures, we performed an analysis suggested by Rajan et al. (2016). In this analysis, we first injected a unit pulse of input to the network at the beginning of the trial and let the network evolve for time steps afterwards according to its recurrent dynamics with no direct input. We then ordered the recurrent units by the time of their peak activity (using a small amount of jitter to break potential ties between units) and plotted the mean recurrent connection weights, , as a function of the order difference between two units, . Positive values correspond to connections from earlier peaking units to later peaking units, and vice versa for negative values. In trained RNNs, the mean recurrent weight profile as a function of had an asymmetric peak, with connections in the “forward” direction being, on average, stronger than those in the opposite direction. Figure 5 shows examples with orthogonally initialized RNNs trained on the addition and the permuted sequential MNIST tasks. Note that for a purely feedforward chain, the weight profile would have a single peak at and would be zero elsewhere. Although the weight profiles for trained RNNs are not this extreme, the prominent asymmetric bump with a peak at a positive value indicates a hidden chain-like feedforward structure in these networks.

Figure 5: Training induces hidden chain-like feedforward structures in vanilla RNNs. The units are first ordered by the time of their peak activity. Then, the mean recurrent connection weight is plotted as a function of the order difference between two units, . Results are shown for RNNs trained on the addition (a) and the permuted sequential MNIST (b) tasks. The left column shows the results for RNNs initialized with a scaled identity matrix, the right column shows the results for RNNs initialized with random orthogonal matrices. In each case, training induces hidden chain-like feedforward structures in the networks, as indicated by an asymmetric bump peaked at a positive value in the weight profile. This kind of structure is either non-existent (identity) or much less prominent (orthogonal) in the initial untrained networks. For the results shown here, we only considered sufficiently well-trained networks that achieved a validation loss below 50% of the loss for a baseline random model at the end of training. The solid lines and shaded regions represent means and standard errors of the mean weight profiles over these networks.

2.5 Do the benefits of non-normal dynamics extend to gated RNN architectures?

So far, we have only considered vanilla RNNs. An important question is whether the benefits of non-normal dynamics demonstrated above for vanilla RNNs also extend to gated RNN architectures like LSTMs or GRUs (Hochreiter & Schmidhuber, 1997; Cho et al., 2014). Gated RNN architectures have better inductive biases than vanilla RNNs in many practical tasks of interest such as language modeling (e.g. see Table 1 for a comparison of vanilla RNN architectures with an LSTM architecture of similar size in the language modeling benchmarks), thus it would be practically very useful if their performance could be improved through an inductive bias for non-normal dynamics.


Model PTB word PTB char. enwik8
Ortho. 5.937 0.002 1.230 0.001 1.583 0.001
Chain 5.935 0.001 1.230 0.001 1.586 0.000
Plain 5.949 0.007 1.245 0.001 1.584 0.002
Mixed 5.944 0.004 1.227 0.000 1.577 0.001
Table 4: Test losses (bpc) on the language modeling benchmarks using 3-layer LSTMs (adapted from Merity et al. (2018a; 2018b)) with different initialization schemes. The numbers represent mean s.e.m. over 3 independent runs.

To address this question, we treated the input, forget, output and update gates of the LSTM architecture as analogous to vanilla RNNs and initialized the recurrent and input matrices inside these gates in the same way as in the chain or the orthogonal initialization of vanilla RNNs above. We also compared these with a more standard initialization scheme where all the weights were drawn from a uniform distribution

where is the reciprocal of the hidden layer size (labeled plain in Table 4). This is the default initializer for the LSTM weight matrices in PyTorch: https://pytorch.org/docs/stable/nn.html#lstm.

We compared these initializers in the language modeling benchmarks. The chain initializer did not perform better than the orthogonal initializer (Table 4), suggesting that non-normal dynamics in gated RNN architectures may not be as helpful as it is in vanilla RNNs. In hindsight, this is not too surprising, because our initial motivation for introducing non-normal dynamics heavily relied on the vanilla RNN architecture and gated RNNs can be dynamically very different from vanilla RNNs.

Figure 6: The recurrent weight matrices inside the input, forget, and output LSTM gates do not display the characteristic signature of a prominent chain-like feedforward structure. The weight profiles are instead a monotonic function of . The recurrent weight matrix inside the update (tanh) gate, however, does display a chain-like structure similar to that observed in vanilla RNNs. The examples shown in this figure are from the input (a), forget (b), output (c), and update gates (d) of the second layer LSTM in a 3-layer LSTM architecture trained on the word-level PTB task. The weight matrices shown here were initialized with orthogonal initializers. Other layers and models trained on other tasks display qualitatively similar properties.

When we looked at the trained LSTM weight matrices more closely, we found that, although still non-normal, the recurrent weight matrices inside the input, forget, and output gates (i.e. the sigmoid gates) did not have the same signatures of hidden chain-like feedforward structures observed in vanilla RNNs. Specifically, the weight profiles in the LSTM recurrent weight matrices inside these three gates did not display the asymmetric bump characteristic of a prominent chain-like feedforward structure, but were instead monotonic functions of (Figure 6a-c), suggesting a qualitatively different kind of dynamics where the individual units are more persistent over time. The recurrent weight matrix inside the update gate (the tanh gate), on the other hand, did display the signature of a hidden chain-like feedforward structure (Figure 6d). When we incorporated these two different structures in different gates of the LSTMs, by using a chain initializer for the update gate and a monotonically increasing recurrent weight profile for the other gates (labeled mixed in Table 4), the resulting initializer outperformed the other initializers on the character-level PTB and enwik8 benchmarks.

3 Discussion

Motivated by their optimal memory properties in a simplified linear setting (Ganguli et al., 2008), in this paper, we investigated the potential benefits of certain highly non-normal chain-like RNN architectures in capturing long-term dependencies in sequential tasks. Our results clearly demonstrate an advantage for such non-normal architectures as initializers for vanilla RNNs, compared to the commonly used orthogonal initializers. We further found evidence for the induction of such chain-like feedforward structures in trained vanilla RNNs even when these RNNs are initialized with orthogonal recurrent connectivity matrices.

The benefits of these chain-like non-normal initializers do not directly carry over to more complex, gated RNN architectures such as LSTMs and GRUs. In some important practical problems such as language modeling, the gains from using these kinds of gated architectures seem to far outweigh the gains obtained from the non-normal initializers in vanilla RNNs (see Table 1). However, we also uncovered important regularities in trained LSTM weight matrices, namely that the recurrent weight profiles of the input, forget, and output gates (the sigmoid gates) in trained LSTMs display a monotonically increasing pattern, whereas the recurrent matrix inside the update gate (the tanh gate) displays a chain-like feedforward structure similar to that observed in vanilla RNNs (Figure 6). We showed that these regularities can be exploited to improve the training and/or generalization performance of these gated RNN architectures by introducing them as useful inductive biases to these models.

There is a close connection between the identity initialization of RNNs (Le et al., 2015)

and the widely used identity skip connections (or residual connections) in deep feedforward networks

(He et al., 2016). Given the superior performance of chain-like non-normal initializers over the identity initialization demonstrated in the context of vanilla RNNs in this paper, it could be interesting to look for similar chain-like non-normal architectural motifs that could be used in deep feedforward networks in place of the identity skip connections.

References

  • Arjovsky et al. (2016) Arjovsky, M., Shah, A., and Bengio, Y. Unitary evolution recurrent neural networks. In

    Proceedings of the 33rd International Conference on Machine Learning

    , 2016.
  • Bai et al. (2018) Bai, S., Kolter, J. Z., and Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271, 2018.
  • Bengio et al. (1994) Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural. Netw., 5:157–66, 1994.
  • Chang et al. (2017) Chang, S., Zhang, Y., Han, W., Yu, M., Guo, X., Tan, W., Cui, X., Witbrock, M., Hasegawa-Johnson, M., and Huang, T. Dilated recurrent neural networks. In Advances in Neural Information Processing Systems 30, 2017.
  • Cho et al. (2014) Cho, K., van Merriënboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pp. 1724–1734, 2014.
  • Clevert et al. (2016) Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). In International Conference on Learning Representations (ICLR), 2016.
  • Ganguli et al. (2008) Ganguli, S., Huh, D., and Sompolinsky, H. Memory traces in dynamical systems. PNAS, 105(48):18970–18975, 2008.
  • Ha & Schmidhuber (2018) Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems 31, pp. 2455–2467, 2018.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 770–778, 2016.
  • Henrici (1962) Henrici, P. Bounds for iterates, inverses, spectral variation and fields of values of non-normal matrices. Numerische Mathematik, 4:24–40, 1962.
  • Hochreiter (1991) Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, Institut f. Informatik, Technische Univ. Munich, 1991.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • Le et al. (2015) Le, Q., Jaitly, N., and Hinton, G. A simple way to initialize recurrent networks of rectified linear units. 2015. URL https://arxiv.org/abs/1504.00941.
  • Merity et al. (2018a) Merity, S., Keskar, N. S., and Socher, R. An analysis of neural language modeling at multiple scales. arXiv:1803.08240, 2018a.
  • Merity et al. (2018b) Merity, S., Keskar, N. S., and Socher, R. Regularizing and optimizing lstm language models. In International Conference on Learning Representations (ICLR), 2018b.
  • Rajan et al. (2016) Rajan, K., Harvey, C. D., and Tank, D. W. Recurrent network models of sequence generation and memory. Neuron, 90(1):128–142, 2016.
  • Wisdom et al. (2016) Wisdom, S., Powers, T., Hershey, J., Roux, J. L., and Atlas, L. Full-capacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems 29, 2016.