Memory Augmented Neural Networks with Wormhole Connections

01/30/2017 ∙ by Caglar Gulcehre, et al. ∙ Université de Montréal 0

Recent empirical results on long-term dependency tasks have shown that neural networks augmented with an external memory can learn the long-term dependency tasks more easily and achieve better generalization than vanilla recurrent neural networks (RNN). We suggest that memory augmented neural networks can reduce the effects of vanishing gradients by creating shortcut (or wormhole) connections. Based on this observation, we propose a novel memory augmented neural network model called TARDIS (Temporal Automatic Relation Discovery in Sequences). The controller of TARDIS can store a selective set of embeddings of its own previous hidden states into an external memory and revisit them as and when needed. For TARDIS, memory acts as a storage for wormhole connections to the past to propagate the gradients more effectively and it helps to learn the temporal dependencies. The memory structure of TARDIS has similarities to both Neural Turing Machines (NTM) and Dynamic Neural Turing Machines (D-NTM), but both read and write operations of TARDIS are simpler and more efficient. We use discrete addressing for read/write operations which helps to substantially to reduce the vanishing gradient problem with very long sequences. Read and write operations in TARDIS are tied with a heuristic once the memory becomes full, and this makes the learning problem simpler when compared to NTM or D-NTM type of architectures. We provide a detailed analysis on the gradient propagation in general for MANNs. We evaluate our models on different long-term dependency tasks and report competitive results in all of them.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent Neural Networks (RNNs) are neural network architectures that are designed to handle temporal dependencies in sequential prediction problems. However it is well known that RNNs suffer from the issue of vanishing gradients as the length of the sequence and the dependencies increases (Hochreiter, 1991; Bengio et al., 1994)

. Long Short Term Memory (LSTM) units

(Hochreiter and Schmidhuber, 1997)

were proposed as an alternative architecture which can handle long range dependencies better than a vanilla RNN. A simplified version of LSTM unit called Gated Recurrent Unit (GRU), proposed in

(Cho et al., 2014), has proven to be successful in a number of applications (Bahdanau et al., 2015; Xu et al., 2015; Trischler et al., 2016; Kaiser and Sutskever, 2015; Serban et al., 2016)

. Even though LSTMs and GRUs attempt to solve the vanishing gradient problem, the memory in both architectures is stored in a single hidden vector as it is done in an RNN and hence accessing the information too far in the past can still be difficult. In other words, LSTM and GRU models have a limited ability to perform a search through its past memories when it needs to access a relevant information for making a prediction. Extending the capabilities of neural networks with a memory component has been explored in the literature on different applications with different architectures

(Weston et al., 2015; Graves et al., 2014; Joulin and Mikolov, 2015; Grefenstette et al., 2015; Sukhbaatar et al., 2015; Bordes et al., 2015; Chandar et al., 2016; Gulcehre et al., 2016; Graves et al., 2016; Rae et al., 2016).

Memory augmented neural networks (MANN) such as neural Turing machines (NTM) (Graves et al., 2014; Rae et al., 2016), dynamic NTM (D-NTM) (Gulcehre et al., 2016), and Differentiable Neural Computers (DNC) (Graves et al., 2016) use an external memory (usually a matrix) to store information and the MANN’s controller can learn to both read from and write into the external memory. As we show here, it is in general possible to use particular MANNs to explicitly store the previous hidden states of an RNN in the memory and that will provide shortcut connections through time, called here wormhole connections, to look into the history of the states of the RNN controller. Learning to read and write into an external memory by using neural networks gives the model more freedom or flexibility to retrieve information from its past, forget or store new information into the memory. However, if the addressing mechanism for read and/or write operations are continuous (like in the NTM and continuous D-NTM), then the access may be too diffuse, especially early on during training. This can hurt especially the writing

operation, since a diffused write operation will overwrite a large fraction of the memory at each step, yielding fast vanishing of the memories (and gradients). On the other hand, discrete addressing, as used in the discrete D-NTM, should be able to perform this search through the past, but prevents us from using straight backpropagation for learning how to choose the address.

We investigate the flow of the gradients and how the wormhole connections introduced by the controller effects it. Our results show that the wormhole connections created by the controller of the MANN can significantly reduce the effects of the vanishing gradients by shortening the paths that the signal needs to travel between the dependencies. We also discuss how the MANNs can generalize to the sequences longer than the ones seen during the training.

In a discrete D-NTM, the controller must learn to read from and write into the external memory by itself and additionally, it should also learn the reader/writer synchronization. This can make the learning to be more challenging. In spite of this difficulty, Gulcehre et al. (2016) reported that the discrete D-NTM can learn faster than the continuous D-NTM on some of the bAbI tasks. We provide a formal analysis of gradient flow in MANNs based on discrete addressing and justify this result. In this paper, we also propose a new MANN based on discrete addressing called TARDIS (Temporal Automatic Relation Discovery in Sequences). In TARDIS, memory access is based on tying the write and read heads of the model after memory is filled up. When the memory is not full, the write head store information in memory in the sequential order.

The main characteristics of TARDIS are as follows, TARDIS is a simple memory augmented neural network model which can represent long-term dependencies efficiently by using a external memory of small size. TARDIS represents the dependencies between the hidden states inside the memory. We show both theoretically and experimentally that TARDIS fixes to a large extent the problems related to long-term dependencies. Our model can also store sub-sequences or sequence chunks into the memory. As a consequence, the controller can learn to represent the high-level temporal abstractions as well. TARDIS performs well on several structured output prediction tasks as verified in our experiments.

The idea of using external memory with attention can be justified with the concept of mental-time travel which humans do occasionally to solve daily tasks. In particular, in the cognitive science literature, the concept of chronesthesia is known to be a form of consciousness which allows human to think about time subjectively and perform mental time-travel (Tulving, 2002). TARDIS is inspired by this ability of humans which allows one to look up past memories and plan for the future using the episodic memory.

2 TARDIS: A Memory Augmented Neural Network

Neural network architectures with an external memory represent the memory in a matrix form, such that at each time step the model can both read from and write to the external memory. The whole content of the external memory can be considered as a generalization of hidden state vector in a recurrent neural network. Instead of storing all the information into a single hidden state vector, our model can store them in a matrix which has a higher capacity and with more targeted ability to substantially change or use only a small subset of the memory at each time step. The neural Turing machine (NTM) (Graves et al., 2014) is such an example of a MANN, with both reading and writing into the memory.

2.1 Model Outline

In this subsection, we describe the basic structure of TARDIS 111Name of the model is inspired from the time-machine in a popular TV series Dr. Who. (Temporal Automatic Relation Discovery In Sequences). TARDIS is a MANN which has an external memory matrix where is the number of memory cells and is the dimensionality of each cell. The model has an RNN controller which can read and write from the external memory at every time step. To read from the memory, the controller generates the read weights and the reading operation is typically achieved by computing the dot product between the read weights and the memory , resulting in the content vector :


TARDIS uses discrete addressing and hence is a one-hot vector and the dot-product chooses one of the cells in the memory matrix (Zaremba and Sutskever, 2015; Gulcehre et al., 2016). The controller generates the write weights , to write into the memory which is also a one hot vector, with discrete addressing. We will omit biases from our equations for the simplicity in the rest of the paper. Let be the index of the non-zero entry in the one-hot vector , then the controller writes a linear projection of the current hidden state to the memory location :


where is the projection matrix that projects the dimensional hidden state vector to a dimensional micro-state vector such that .

At every time step, the hidden state of the controller is also conditioned on the content read from the memory. The wormhole connections are created by conditioning on :


As each cell in the memory is a linear projection of one of the previous hidden states, the conditioning of the controller’s hidden state with the content read from the memory can be interpreted as a way of creating short-cut connections across time (from the time that was written to the time when it was read through ) which can help to the flow of gradients across time. This is possible because of the discrete addressing used for read and write operations.

However, the main challenge for the model is to learn proper read and write mechanisms so that it can write the hidden states of the previous time steps that will be useful for future predictions and read them at the right time step. We call this the reader/writer synchronization problem. Instead of designing complicated addressing mechanisms to mitigate the difficulty of learning how to properly address the external memory, TARDIS side-steps the reader/writer synchronization problem by using the following heuristics. For the first time steps, our model writes the micro-states into the cells of the memory in a sequential order. When the memory becomes full, the most effective strategy in terms of preserving the information stored in the memory would be to replace the memory cell that has been read with the micro-state generated from the hidden state of the controller after it is conditioned on the memory cell that has been read. If the model needs to perfectly retain the memory cell that it has just overwritten, the controller can in principle learn to do that by copying its read input to its write output (into the same memory cell). The pseudocode and the details of the memory update algorithm for TARDIS is presented in Algorithm 1.

  for  do
     Compute the read weights
     Sample from/discretize and obtain
     Read from the memory, .
     Compute a new controller hidden state,
     if  then
        Write into the memory,
        Select the memory location to write into
        Write into the memory,
     end if
  end for
Algorithm 1 Pseudocode for the controller and memory update mechanism of TARDIS.

There are two missing pieces in Algorithm 1: How to generate the read weights? What is the structure of the controller function ? We will answer these two questions in detail in next two sub-sections.

2.2 Addressing mechanism

Similar to D-NTM, memory matrix of TARDIS has disjoint address section and content section , and for . However, unlike D-NTM address vectors are fixed to random sparse vectors. The controller reads both the address and the content parts of the memory, but it will only write into the content section of the memory.

The continuous read weights are generated by an MLP which uses the information coming from , , and the usage vector (described below). The MLP is parametrized as follows:


where are learnable parameters. is a one-hot vector obtained by either sampling from or by using argmax over .

is the usage vector which denotes the frequency of accesses to each cell in the memory. is computed from the sum of discrete address vectors and normalizing them.


applied in Equation 6

is a simple feature-wise computation of centering and divisive variance normalization. This normalization step makes the training easier with the usage vectors. The introduction of the usage vector can help the attention mechanism to choose between the different memory cells based on their frequency of accesses to each cell of the memory. For example, if a memory cell is very rarely accessed by the controller, for the next time step, it can learn to assign more weights to those memory cells by looking into the usage vector. By this way, the controller can learn an LRU access mechanism

(Santoro et al., 2016; Gulcehre et al., 2016).

Further, in order to prevent the model to learn deficient addressing mechanisms, for e.g. reading the same memory cell which will not increase the memory capacity of the model, we decrease the probability of the last read memory location by subtracting

from the logit of

for that particular memory location.

2.3 TARDIS Controller

We use an LSTM controller, and its gates are modified to take into account the content of the cell read from the memory:


where , , and are forget gate, input gate, and output gate respectively.   are the scalar RESET gates which control the magnitude of the information flowing from the memory and the previous hidden states to the cell of the LSTM . By controlling the flow of information into the LSTM cell, those gates will allow the model to store the sub-sequences or chunks of sequences into the memory instead of the entire context.

We use Gumbel sigmoid (Maddison et al., 2016; Jang et al., 2016) for and due to its behavior close to binary.


As in Equation 8 empirically, we find gumbel-sigmoid to be easier to train than the regular sigmoid. The temperature of the Gumbel-sigmoid is fixed to in all our experiments.

The cell of the LSTM controller, is computed according to the Equation 2.3 with the and RESET gates.


The hidden state of the LSTM controller is computed as follows:


In Figure 1, we illustrate the interaction between the controller and the memory with various heads and components of the controller.

Figure 1: At each time step controller takes , the memory cell that has been read and the hidden state of the previous timestep . Then, it generates which controls the contribution of the into the internal dynamics of the new controller’s state (We omit the in this visualization). Once the memory becomes full, discrete addressing weights is generated by the controller which will be used to both read from and write into the memory. To the predict the target , the model will have to use both and .

2.4 Micro-states and Long-term Dependencies

A micro-state of the LSTM for a particular time step is the summary of the information that has been stored in the LSTM controller of the model. By attending over the cells of the memory which contains previous micro-states of the LSTM, the model can explicitly learn to restore information from its own past.

Figure 2: TARDIS’s controller can learn to represent the dependencies among the inputs tokens by choosing which cells to read and write and creating wormhole connections. represents the input to the controller at timestep and the is the hidden state of the controller RNN.

The controller can learn to represent high-level temporal abstractions by creating wormhole connections through the memory as illustrated in Figure 2. In this example, the model takes the token at the first timestep and stores its representation to the first memory cell with address . In the second timestep, the controller takes as input and writes into the second memory cell with the address . Furthermore, gater blocks the connection from to . At the third timestep, the controller starts reading. It receives as input and reads the first memory cell where micro-state of was stored. After reading, it computes the hidden-state and writes the micro-state of into the first memory cell. The length of the path passing through the microstates of and would be . The wormhole connection from to would skip a timestep.

A regular single-layer RNN has a fixed graphical representation of a linear-chain when considering only the connections through its recurrent states or the temporal axis. However, TARDIS is more flexible in terms of that and it can learn directed graphs with more diverse structures using the wormhole connections and the RESET gates. The directed graph that TARDIS can learn through its recurrent states have at most the degree of 4 at each vertex (maximum 2 incoming and 2 outgoing edges) and it depends on the number of cells () that can be stored in the memory.

In this work, we focus on a variation of TARDIS, where the controller maintains a fixed-size external memory. However as in (Cheng et al., 2016), it is possible to use a memory that grows with respect to the length of its input sequences, but that would not scale and can be more difficult to train with discrete addressing.

3 Training TARDIS

In this section, we explain how to train TARDIS as a language model. We use language modeling as an example application. However, we would like to highlight that TARDIS can also be applied to any complex sequence to sequence learning tasks.

Consider training examples where each example is a sequence of length . At every time-step , the model receives the input which is a one-hot vector of size equal to the size of the vocabulary and should produce the output which is also a one-hot vector of size equal to the size of the vocabulary .

The output of the model for -th example and -th time-step is computed as follows:


where is the learnable parameters and is a single layer MLP which combines both and as in deep fusion by (Pascanu et al., 2013a). The task loss would be the categorical cross-entropy between the targets and model-outputs. Super-script denotes that the variable is the output for the sample in the training set.


However, the discrete decisions taken for memory access during every time-step makes the model not differentiable and hence we need to rely on approximate methods of computing gradients with respect to the discrete address vectors. In this paper we explore two such approaches: REINFORCE (Williams, 1992)

and straight-through estimator 

(Bengio et al., 2013).


REINFORCE is a likelihood-ratio method, which provides a convenient and simple way of estimating the gradients of the stochastic actions. In this paper, we focus on application of REINFORCE on sequential prediction tasks, such as language modelling. For example , let be the reward for the action at timestep . We are interested in maximizing the expected return for the whole episode as defined below:


Ideally we would like to compute the gradients for Equation 13, however computing the gradient of the expectation may not be feasible. We would have to use a Monte-Carlo approximation and compute the gradients by using the REINFORCE for the sequential prediction task which can be written as in Equation 14.


where is the reward baseline. However, we can further assume that the future actions do not depend on the past rewards in the episode/trajectory and further reduce the variance of REINFORCE as in Equation 15.


In our preliminary experiments, we find out that the training of the model is easier with the discounted returns, instead of using the centered undiscounted return:

Training REINFORCE with an Auxiliary Cost

Training models with REINFORCE can be difficult, due to the variance imposed into the gradients. In the recent years, researchers have developed several tricks in order to mitigate the effect of high-variance in the gradients. As proposed by (Mnih and Gregor, 2014), we also use variance normalization on the REINFORCE gradients.

For TARDIS, reward at timestep () is the log-likelihood of the prediction at that timestep. Our initial experiments showed that REINFORCE with this reward structue often tends to under-utilize the memory and mainly rely on the internal memory of the LSTM controller. Especially, in the beginning of the training model, it can just decrease the loss by relying on the memory of the controller and this can cause the REINFORCE to increase the log-likelihood of the random actions.

In order to deal with this issue, instead of using the log-likelihood of the model as reward, we introduce an auxiliary cost to use as the reward which is computed based on predictions which are only based on the memory cell which is read by the controller and not the hidden state of the controller:


In Equation 17, we only train the parameters where is the dimensionality of the output size and (for language modelling both and would be ) is the dimensionality of the input of the model. We do not backpropagate through and thus we denote it as in our equations.

3.2 Using Gumbel Softmax

Training with REINFORCE can be challenging due to the high variance of the gradients, gumbel-softmax provides a good alternative with straight-through estimator for REINFORCE to tackle the variance issue. Unlike (Maddison et al., 2016; Jang et al., 2016) instead of annealing the temperature or fixing it, our model learns the inverse-temperature with an MLP which has a single scalar output conditioned on the hidden state of the controller.


We replace the softmax in Equation 5 with gumbel-softmax defined above. During forward computation, we sample from and use the generated one-hot vector for memory access. However, during backprop, we use for gradient computation and hence the entire model becomes differentiable.

Learning the temperature of the Gumbel-Softmax reduces the burden of performing extensive hyper-parameter search for the temperature.

4 Related Work

Neural Turing Machine (NTM) (Graves et al., 2014) is the most related class of architecture to our model. NTMs have proven to be successful in terms of generalizing over longer sequences than the sequences that it has been trained on. Also NTM has been shown to be more effective in terms of solving algorithmic tasks than the gated models such as LSTMs. However NTM can have limitations due to some of its design choices. Due to the controller’s lack of precise knowledge on the contents of the information, the contents of the memory can overlap. These memory augmented models are also known to be complicated, which yields to the difficulties in terms of implementing the model and training it. The controller has no information about the sequence of operations and the information such as frequency of the read and write access to the memory. TARDIS tries to address these issues.

Gulcehre et al. (2016) proposed a variant of NTM called dynamic NTM (D-NTM) which had learnable location based addressing. D-NTM can be used with both continuous addressing and discrete addressing. Discrete D-NTM is related to TARDIS in the sense that both models use discrete addressing for all the memory operations. However, discrete D-NTM expects the controller to learn to read/write and also learn reader/writer synchronization. TARDIS do not have this synchronization problem since both reader and writer are tied. Rae et al. (2016) proposed sparse access memory (SAM) mechanism for NTMs which can be seen as a hybrid of continuous and discrete addressing. SAM uses continuous addressing over a selected set of top- relevant memory cells. Recently, Graves et al. (2016) proposed a differentiable neural computer (DNC) which is a successor of NTM.

Rocktäschel et al. (2015) and (Cheng et al., 2016) proposed models that generate weights to attend over the previous hidden states of the RNN. However, since those models attend over the whole context, the computation of the attention can be inefficient.

Grefenstette et al. (2015) has proposed a model that can store the information in a data structure, such as in a stack, dequeue or queue in a differentiable manner.

Grave et al. (2016) has proposed to use a cache based memory representation which stores the last states of the RNN in the memory and similar to the traditional cache-based models the model learns to choose a state of the memory for the prediction in the language modeling tasks (Kuhn and De Mori, 1990).

5 Gradient Flow through the External Memory

In this section, we analyze the flow of the gradients through the external memory and will also investigate its efficiency in terms of dealing with the vanishing gradients problem (Hochreiter, 1991; Bengio et al., 1994). First, we describe the vanishing gradient problem in an RNN and then describe how an external memory model can deal with it. For the sake of simplicity, we will focus on vanilla RNNs during the entire analysis, but the same analysis can be extended to LSTMs. In our analysis, we also assume that the weights for the read/write heads are discrete.

We will show that the rate of the gradients vanishing through time for a memory-augmented recurrent neural network is much smaller than of a regular vanilla recurrent neural network.

Consider an RNN which at each timestep takes an input and produces an output . The hidden state of the RNN can be written as,


where and are the recurrent and the input weights of the RNN respectively and

is a non-linear activation function. Let

be the loss function that the RNN is trying to minimize. Given an input sequence of length

, we can write the derivative of the loss with respect to parameters as,


The multiplication of many Jacobians in the form of to obtain is the main reason of the vanishing and the exploding gradients (Pascanu et al., 2013b):


Let us assume that the singular values of a matrix

are ordered as, . Let be an upper bound on the singular values of , s.t. , then the norm of the Jacobian will satisfy (Zilly et al., 2016),


Pascanu et al. (2013b) showed that for , the following inequality holds:


Since and the norm of the product of Jacobians grows exponentially on , the norm of the gradients will vanish exponentially fast.

Now consider the MANN where the contents of the memory are linear projections of the previous hidden states as described in Equation 2. Let us assume that both reading and writing operation use discrete addressing. Let the content read from the memory at time step correspond to some memory location :


where corresponds to the hidden state of the controller at some previous timestep .

Now the hidden state of the controller in the external memory model can be written as,


If the controller reads at time step and its memory content is as described above, then the Jacobians associated with Equation 5 can be computed as follows:


where and are defined as below,


As shown in Equation 29, Jacobians of the MANN can be rewritten as a summation of two matrices, and . The gradients flowing through do not necessarily vanish through time, because it is the sum of jacobians computed over the shorter paths.

The norm of the Jacobian can be lower bounded as follows by using Minkowski inequality:


Assuming that the length of the dependency is very long would vanish to 0. Then we will have,


As one can see that the rate of the gradients vanishing through time depends on the length of the sequence passes through . This is typically lesser than the length of the sequence passing through . Thus the gradients vanish at lesser rate than in an RNN. In particular the rate would strictly depend on the length of the shortest paths from to , because for the long enough dependencies, gradients through the longer paths would still vanish.

We can also derive an upper bound for norm of the Jacobian as follows:


Using the result from (Loyka, 2015), we can lower bound as follows:


For long sequences we know that will go to (see equation 25). Hence,


The rate at which reaches zero is strictly smaller than the rate at which reaches zero and with ideal memory access, it will not reach zero. Hence unlike vanilla RNNs, Equation 38 states that the upper bound of the norm of the Jacobian will not reach to zero for a MANN with ideal memory access.

Consider a memory augmented neural network with memory cells for a sequence of length , and each hidden state of the controller is stored in different cells of the memory. If the prediction at time step has only a long-term dependency to and the prediction at is independent from the tokens appear before , and the memory reading mechanism is perfect, the model will not suffer from vanishing gradients when we back-propagate from to .222Let us note that, unlike an Markovian -gram assumption, here we assume that at each time step the can be different.


If the input sequence has a longest-dependency to from , we would only be interested in gradients propagating from to and the Jacobians from to , i.e. . If the controller learns a perfect reading mechanism at time step it would read memory cell where the hidden state of the RNN at time step is stored at. Thus following the jacobians defined in the Equation 29, we can rewrite the jacobians as,


In Equation 39, the first two terms might vanish as grows. However, the singular values of the third term do not change as grows. As a result, the gradients propagated from to will not necessarily vanish through time. However, in order to obtain stable dynamics for the network, the initialization of the matrices, and is important.

This analysis highlights the fact that an external memory model with optimal read/write mechanism can handle long-range dependencies much better than an RNN. However, this is applicable only when we use discrete addressing for read/write operations. Both NTM and D-NTM still have to learn how to read and write from scratch which is a challenging optimization problem. For TARDIS tying the read/write operations make the learning to become much simpler for the model. In particular, the results of the Theorem 5 points the importance of coming up with better ways of designing attention mechanisms over the memory.

The controller of a MANN may not be able learn to use the memory efficiently. For example, some cells of the memory may remain empty or may never be read. The controller can overwrite the memory cells which have not been read. As a result the information stored in those overwritten memory cells can be lost completely. However TARDIS avoids most of these issues by the construction of the algorithm.

6 On the Length of the Paths Through the Wormhole Connections

As we have discussed in Section 5, the rate at which the gradients vanish for a MANN depends on the length of the paths passing along the wormhole connections. In this section we will analyse those lengths in depth for untrained models such that the model will assign uniform probability to read or write all memory cells. This will give us a better idea on how each untrained model uses the memory at the beginning of the training.

A wormhole connection can be created by reading a memory cell and writing into the same cell in TARDIS. For example, in Figure 2, while the actual path from to is of length 4, memory cell creates a shorter path of length 2 . We call the length of the actual path as and length of the shorter path created by wormhole connection as .

Consider a TARDIS model which has cells in its memory. If TARDIS access each memory cell uniformly random, then the probability of accessing a random cell , . The expected length of the shorter path created by wormhole connections () would be proportional to the number of reads and writes into a memory cell. For TARDIS with reader choosing a memory cell uniformly random this would be at the end of the sequence. We verify this result by simulating the read and write heads of TARDIS as in Figure 3 a).

a) b)
Figure 3: In these figures we visualized the expected path length in the memory cells for a sequence of length , memory size with simulations. a) shows the results for the TARDIS and b) shows the simulation for a MANN with uniformly random read and write heads.

Now consider a MANN with separate read and write heads each accessing the memory in discrete and uniformly random fashion. Let us call it as uMANN. We will compute the expected length of the shorter path created by wormhole connections () for uMANN. and are the read and write head weights, each sampled from a multinomial distribution with uniform probability for each memory cells respectively. Let be the index of the memory cell read at timestep . For any memory cell , , defined below, is a recursive function that computes the length of the path created by wormhole connections in that cell.


It is possible to prove that will be by induction for every memory cell. However, for proof assumes that when is less than or equal to , the length of all paths stored in the memory should be . We have run simulations to compute the expected path length in a memory cell of uMANN as in Figure 3 (b).

This analysis shows that while TARDIS with uniform read head maintains the same expected length of the shorter path created by wormhole connections as uMANN, it completely avoids the reader/writer synchronization problem.

If is large enough, should hold. In expectation, will decay proportionally to whereas will decay proportional 333Exponentially when the Equation 25 holds. to . With ideal memory access, the rate at which reaches zero would be strictly smaller than the rate at which reaches zero. Hence, as per Equation 38, the upper bound of the norm of the Jacobian will vanish at a much smaller rate. However, this result assumes that the dependencies which the prediction relies are accessible through the memory cell which has been read by the controller.

Figure 4: Assuming that the prediction at depends on the , a wormhole connection can shorten the path by creating a connection from to . A wormhole connection may not directly create a connection from to , but it can create shorter paths which the gradients can flow without vanishing. In this figure, we consider the case where a wormhole connection is created from to . This connections skips all the tokens in between and .

In the more general case, consider a MANN with . The writer just fills in the memory cells in a sequential manner and the reader chooses a memory cell uniformly at random. Let us call this model as urMANN. Let us assume that there is a dependency between two timesteps and as shown in Figure 4. If was taken uniformly between and , then there is a probability that the read address invoked at time will be greater than or equal to (proof by symmetry). In that case, the expected shortest path length through that wormhole connection would be , but this still would not scale well. If the reader is very well trained, it could pick exactly and the path length will be 1.

Let us consider all the paths of length less than or equal to of the form in Figure 4. Also, let and . Then, the shortest path from to now has length , using a wormhole connection that connects the state at with the state at . There are such paths that are realized, but we leave the distribution of the length of that shortest path as an open question. However, the probability of hitting a very short path (of length less than or equal to ) increases exponentially with . Let the probability of the read at to hit the interval be . Then the probability that the shorter paths over the last reads hits that interval is , where is on the order of . On the other hand, the probability of not hitting that interval approaches to 0 exponentially with .

Figure 4 illustrates how wormhole connections can creater shorter paths. In Figure 5 (b), we show that the expected length of the path travelled outside the wormhole connections obtained from the simulations decreases as the size of the memory decreases. In particular, for urMANN and TARDIS the trend is very close to exponential. As shown in Figure 5 (a), this also influences the total length of the paths travelled from timestep 50 to 5 as well. Writing into the memory by using weights sampled with uniform probability for all memory cells can not use the memory as efficiently as other approaches that we compare to. In particular fixing the writing mechanism seems to be useful.

Even if the reader does not manage to learn where to read, there are many "short paths" which can considerably reduce the effect of vanishing gradients.

a) b)
Figure 5: We have run simulations for TARDIS, MANN with uniform read and write mechanisms (uMANN) and MANN with uniform read and write head is fixed with a heuristic (urMANN). In our simulations, we assume that there is a dependency from timestep 50 to 5. We run 200 simulations for each one of them with different memory sizes for each model. In plot a) we show the results for the expected length of the shortest path from timestep 50 to 5. In the plots, as the size of the memory gets larger for both models, the length of the shortest path decreases dramatically. In plot b), we show the expected length of the shortest path travelled outside the wormhole connections with respect to different memory sizes. TARDIS seems to use the memory more efficiently compared to other models in particular when the size of the memory is small by creating shorter paths.

7 On Generalization over the Longer Sequences

Graves et al. (2014) have shown that the LSTMs can not generalize well on the sequences longer than the ones seen during the training. Whereas a MANN such as an NTM or a D-NTM has been shown to generalize to sequences longer than the ones seen during the training set on a set of toy tasks.

We believe that the main reason of why LSTMs typically do not generalize to the sequences longer than the ones that are seen during the training is mainly because the hidden state of an LSTM network utilizes an unbounded history of the input sequence and as a result, its parameters are optimized using the maximum likelihood criterion to fit on the sequences with lengths of the training examples. However, an n-gram language model or an HMM does not suffer from this issue. In comparison, an n-gram LM would use an input context with a fixed window size and an HMM has the Markov property in its latent space. As argued below, we claim that while being trained a MANN can also learn the ability to generalize for sequences with a longer length than the ones that appear in the training set by modifying the contents of the memory and reading from it.

A regular RNN will minimize the negative log-likelihood objective function for the targets by using the unbounded history represented with the hidden state of the RNN, and it will model the parametrized conditional distribution for the prediction at timestep and a MANN would learn . If we assume that represents all the dependencies that depends on in the input sequence, we will have where represents the dependencies in a limited context window that only contains paths shorter than the sequences seen during the training set. Due to this property, we claim that MANNs such as NTM, D-NTM or TARDIS can generalize to the longer sequences more easily. In our experiments on PennTreebank, we show that a TARDIS language model trained to minimize the log-likelihood for and on the test set both and for the same model yields to very close results. On the other hand, the fact that the best results on bAbI dataset obtained in (Gulcehre et al., 2016) is with feedforward controller and similarly in (Graves et al., 2014) feedforward controller was used to solve some of the toy tasks also confirms our hypothesis. As a result, what has been written into the memory and what has been read becomes very important to be able to generalize to the longer sequences.

8 Experiments

8.1 Character-level Language Modeling on PTB

As a preliminary study on the performance of our model we consider character-level language modelling. We have evaluated our models on Penn TreeBank (PTB) corpus (Marcus et al., 1993) based on the train, valid and test used in (Mikolov et al., 2012). On this task, we are using layer-normalization (Ba et al., 2016) and recurrent dropout (Semeniuta et al., 2016) as those are also used by the SOTA results on this task. Using layer-normalization and the recurrent dropout improves the performance significantly and reduces the effects of overfitting. We train our models with Adam (Kingma and Ba, 2014) over the sequences of length 150. We show our results in Table 1.

In addition to the regular char-LM experiments, in order to confirm our hypothesis regarding to the ability of MANNs generalizing to the sequences longer than the ones seen during the training. We have trained a language model which learns

by using a softmax layer as described in Equation

11. However to measure the performance of on test set, we have used the softmax layer that gets into the auxiliary cost defined for the REINFORCE as in Equation 17 for a model trained with REINFORCE and with the auxiliary cost. As in Table 1, the model’s performance by using is 1.26, however by using it becomes 1.28. This gap is small enough to confirm our assumption that .


Model BPC
CW-RNN (Koutnik et al., 2014) 1.46
HF-MRNN (Sutskever et al., 2011) 1.41
ME -gram (Mikolov et al., 2012) 1.37
BatchNorm LSTM (Cooijmans et al., 2016) 1.32
Zoneout RNN (Krueger et al., 2016) 1.27
LayerNorm LSTM (Ha et al., 2016) 1.27
LayerNorm HyperNetworks (Ha et al., 2016) 1.23
LayerNorm HM-LSTM & Step Fn. & Slope Annealing(Chung et al., 2016) 1.24
Our LSTM + Layer Norm + Dropout 1.28
TARDIS + REINFORCE + Auxiliary Cost 1.28
TARDIS + REINFORCE + Auxiliary Cost + R 1.26
TARDIS + Gumbel Softmax + ST + R 1.25


Table 1:

Character-level language modelling results on Penn TreeBank Dataset. TARDIS with Gumbel Softmax and straight-through (ST) estimator performs better than REINFORCE and it performs competitively compared to the SOTA on this task. "+ R" notifies the use of RESET gates

and .

8.2 Sequential Stroke Multi-digit MNIST task

In this subsection, we introduce a new pen-stroke based sequential multi-digit MNIST prediction task as a benchmark for long term dependency modelling. We also benchmark the performance of LSTM and TARDIS in this challenging task.

8.2.1 Task and Dataset

Recently (de Jong, 2016) introduced an MNIST pen stroke classification task and also provided dataset which consisted of pen stroke sequences representing the skeleton of the digits in the MNIST dataset. Each MNIST digit image is represented as a sequence of quadruples , where is the number of pen strokes to define the digit, denotes the pen offset from the previous to the current stroke (can be 1, -1 or 0), is a binary valued feature to denote end of stroke and is another binary valued feature to denote end of the digit. In the original dataset, first quadruple contains absolute value instead of offsets . Without loss of generality, we set the starting position to in our experiments. Each digit is represented by 40 strokes on an average and the task is to predict the digit at the end of the stroke sequence.

While this dataset was proposed for incremental sequence learning in (de Jong, 2016), we consider the multi-digit version of this dataset to benchmark models that can handle long term dependencies. Specifically, given a sequence of pen-stroke sequences, the task is to predict the sequence of digits corresponding to each pen-stroke sequences in the given order. This is a challenging task since it requires the model to learn to predict the digit based on the pen-stroke sequence, count the number of digits and remember them and generate them in the same order after seeing all the strokes. In our experiments we consider 3 versions of this task with 5,10, and 15 digit sequences respectively. We generated 200,000 training data points by randomly sampling digits from the training set of the MNIST dataset. Similarly we generated 20,000 validation and test data points by randomly sampling digits from the validation set and test set of the MNIST dataset respectively. Average length of the stroke sequences in each of these tasks are 199, 399, and 599 respectively.

Figure 6: An illustration of the sequential MNIST strokes task with multiple digits. The network is first provided with the sequence of strokes information for each MNIST digits(location information) as input, during the prediction the network tries to predict the MNIST digits that it has just seen. When the model tries to predict the predictions from the previous time steps are fed back into the network. For the first time step the model receives a special <bos> token which is fed into the model in the first time step when the prediction starts.

8.2.2 Results

We benchmark the performance of LSTM and TARDIS in this new task. Both models receive the sequence of pen strokes and at the end of the sequence are expected to generate the sequence of digits followed by a particular <bos> token. The tasks is illustrated in Figure 6. We evaluate the models based on per-digit error rate. We also compare the performance of TARDIS with REINFORCE with that of TARDIS with gumbel softmax. All the models were trained for same number of updates with early stopping based on the per-digit error rate in the validation set. Results for all 3 versions of the task are reported in Table-2. From the table, we can see that TARDIS performs better than LSTM in all the three versions of the task. Also TARDIS with gumbel-softmax performs slightly better than TARDIS with REINFORCE, which is consistent with our other experiments.

Model 5-digits 10-digits 15-digits
LSTM 3.00% 3.54% 8.81%
TARDIS with REINFORCE 2.09% 2.56% 3.67%
TARDIS with gumbel softmax 1.89% 2.23% 3.09%
Table 2: Per-digit based test error in sequential stroke multi-digit MNIST task with 5,10, and 15 digits.

We also compare the learning curves of all the three models in Figure-7. From the figure we can see that TARDIS learns to solve the task faster that LSTM by effectively utilizing the given memory slots. Also, TARDIS with gumbel softmax converges faster than TARDIS with REINFORCE.

Figure 7: Learning curves for LSTM and TARDIS for sequential stroke multi-digit MNIST task with 5, 10, and 15 digits respectively.

8.3 NTM Tasks

Graves et al. (2014) proposed associative recall and the copy tasks to evaluate a model’s ability to learn simple algorithms and generalize to the sequences longer than the ones seen during the training. We trained a TARDIS model with 4 features for the address and 32 features for the memory content part of the model. We used a model with hidden state of size 120. Our model uses a memory of size 16. We train our model with Adam and used the learning rate of 3e-3. We show the results of our model in Table 3. TARDIS model was able to solve the both tasks, both with Gumbel-softmax and REINFORCE.

Copy Task Associative Recall
D-NTM cont. (Gulcehre et al., 2016) Success Success
D-NTM discrete (Gulcehre et al., 2016) Success Failure
NTM (Graves et al., 2014) Success Success
TARDIS + Gumbel Softmax + ST Success Success
TARDIS REINFORCE + Auxiliary Cost Success Success
Table 3: In this table, we consider a model to be successful on copy or associative recall if its validation cost (binary cross-entropy) is lower than 0.02 over the sequences of maximum length seen during the training. We set the threshold to 0.02 to determine whether a model is successful on a task as in (Gulcehre et al., 2016).

8.4 Stanford Natural Language Inference

Bowman et al. (2015)

proposed a new task to test the machine learning algorithms’ ability to infer whether two given sentences entail, contradict or are neutral(semantic independence) from each other. However, this task can be considered as a long-term dependency task, if the premise and the hypothesis are presented to the model in sequential order as also explored by

Rocktäschel et al. (2015). Because the model should learn the dependency relationship between the hypothesis and the premise. Our model first reads the premise, then the hypothesis and at the end of the hypothesis the model predicts whether the premise and the hypothesis contradicts or entails. The model proposed by Rocktäschel et al. (2015), applies attention over its previous hidden states over premise when it reads the hypothesis. In that sense their model can still be considered to have some task-specific architectural design choice. TARDIS and our baseline LSTM models do not include any task-specific architectural design choices. In Table 4, we compare the results of different models. Our model, performs significantly better than other models. However recently it has been shown that with architectural tweaks, it is possible to design a model specifically to solve this task and achieve 88.2% test accuracy (Chen et al., 2016).


Model Test Accuracy
Word by Word Attention(Rocktäschel et al., 2015) 83.5
Word by Word Attention two-way(Rocktäschel et al., 2015) 83.2
LSTM + LayerNorm + Dropout 81.7
TARDIS + REINFORCE + Auxiliary Cost 82.4
TARDIS + Gumbel Softmax + ST 84.3


Table 4:

Comparisons of different baselines on SNLI Task.

9 Conclusion

In this paper, we propose a simple and efficient memory augmented neural network model which can perform well both on algorithmic tasks and more realistic tasks. Unlike the previous approaches, we show better performance on real-world NLP tasks, such as language modelling and SNLI. We have also proposed a new task to measure the performance of the models dealing with long-term dependencies.

We provide a detailed analysis on the effects of using external memory for the gradients and justify the reason why MANNs generalize better on the sequences longer than the ones seen in the training set. We have also shown that the gradients will vanish at a much slower rate (if they vanish) when an external memory is being used. Our theoretical results should encourage further studies in the direction of developing better attention mechanisms that can create wormhole connections efficiently.


We thank Chinnadhurai Sankar for suggesting the phrase "wormhole connections" and proof-reading the paper. We would like to thank Dzmitry Bahdanau for the comments and feedback for the earlier version of this paper. We would like to also thank the developers of Theano 

444, for developing such a powerful tool for scientific computing Theano Development Team (2016)

. We acknowledge the support of the following organizations for research funding and computing support: NSERC, Samsung, Calcul Québec, Compute Canada, the Canada Research Chairs and CIFAR. SC is supported by a FQRNT-PBEEE scholarship.