1 Introduction
Recurrent Neural Networks (RNNs) are neural network architectures that are designed to handle temporal dependencies in sequential prediction problems. However it is well known that RNNs suffer from the issue of vanishing gradients as the length of the sequence and the dependencies increases (Hochreiter, 1991; Bengio et al., 1994)
. Long Short Term Memory (LSTM) units
(Hochreiter and Schmidhuber, 1997)were proposed as an alternative architecture which can handle long range dependencies better than a vanilla RNN. A simplified version of LSTM unit called Gated Recurrent Unit (GRU), proposed in
(Cho et al., 2014), has proven to be successful in a number of applications (Bahdanau et al., 2015; Xu et al., 2015; Trischler et al., 2016; Kaiser and Sutskever, 2015; Serban et al., 2016). Even though LSTMs and GRUs attempt to solve the vanishing gradient problem, the memory in both architectures is stored in a single hidden vector as it is done in an RNN and hence accessing the information too far in the past can still be difficult. In other words, LSTM and GRU models have a limited ability to perform a search through its past memories when it needs to access a relevant information for making a prediction. Extending the capabilities of neural networks with a memory component has been explored in the literature on different applications with different architectures
(Weston et al., 2015; Graves et al., 2014; Joulin and Mikolov, 2015; Grefenstette et al., 2015; Sukhbaatar et al., 2015; Bordes et al., 2015; Chandar et al., 2016; Gulcehre et al., 2016; Graves et al., 2016; Rae et al., 2016).Memory augmented neural networks (MANN) such as neural Turing machines (NTM) (Graves et al., 2014; Rae et al., 2016), dynamic NTM (DNTM) (Gulcehre et al., 2016), and Differentiable Neural Computers (DNC) (Graves et al., 2016) use an external memory (usually a matrix) to store information and the MANN’s controller can learn to both read from and write into the external memory. As we show here, it is in general possible to use particular MANNs to explicitly store the previous hidden states of an RNN in the memory and that will provide shortcut connections through time, called here wormhole connections, to look into the history of the states of the RNN controller. Learning to read and write into an external memory by using neural networks gives the model more freedom or flexibility to retrieve information from its past, forget or store new information into the memory. However, if the addressing mechanism for read and/or write operations are continuous (like in the NTM and continuous DNTM), then the access may be too diffuse, especially early on during training. This can hurt especially the writing
operation, since a diffused write operation will overwrite a large fraction of the memory at each step, yielding fast vanishing of the memories (and gradients). On the other hand, discrete addressing, as used in the discrete DNTM, should be able to perform this search through the past, but prevents us from using straight backpropagation for learning how to choose the address.
We investigate the flow of the gradients and how the wormhole connections introduced by the controller effects it. Our results show that the wormhole connections created by the controller of the MANN can significantly reduce the effects of the vanishing gradients by shortening the paths that the signal needs to travel between the dependencies. We also discuss how the MANNs can generalize to the sequences longer than the ones seen during the training.
In a discrete DNTM, the controller must learn to read from and write into the external memory by itself and additionally, it should also learn the reader/writer synchronization. This can make the learning to be more challenging. In spite of this difficulty, Gulcehre et al. (2016) reported that the discrete DNTM can learn faster than the continuous DNTM on some of the bAbI tasks. We provide a formal analysis of gradient flow in MANNs based on discrete addressing and justify this result. In this paper, we also propose a new MANN based on discrete addressing called TARDIS (Temporal Automatic Relation Discovery in Sequences). In TARDIS, memory access is based on tying the write and read heads of the model after memory is filled up. When the memory is not full, the write head store information in memory in the sequential order.
The main characteristics of TARDIS are as follows, TARDIS is a simple memory augmented neural network model which can represent longterm dependencies efficiently by using a external memory of small size. TARDIS represents the dependencies between the hidden states inside the memory. We show both theoretically and experimentally that TARDIS fixes to a large extent the problems related to longterm dependencies. Our model can also store subsequences or sequence chunks into the memory. As a consequence, the controller can learn to represent the highlevel temporal abstractions as well. TARDIS performs well on several structured output prediction tasks as verified in our experiments.
The idea of using external memory with attention can be justified with the concept of mentaltime travel which humans do occasionally to solve daily tasks. In particular, in the cognitive science literature, the concept of chronesthesia is known to be a form of consciousness which allows human to think about time subjectively and perform mental timetravel (Tulving, 2002). TARDIS is inspired by this ability of humans which allows one to look up past memories and plan for the future using the episodic memory.
2 TARDIS: A Memory Augmented Neural Network
Neural network architectures with an external memory represent the memory in a matrix form, such that at each time step the model can both read from and write to the external memory. The whole content of the external memory can be considered as a generalization of hidden state vector in a recurrent neural network. Instead of storing all the information into a single hidden state vector, our model can store them in a matrix which has a higher capacity and with more targeted ability to substantially change or use only a small subset of the memory at each time step. The neural Turing machine (NTM) (Graves et al., 2014) is such an example of a MANN, with both reading and writing into the memory.
2.1 Model Outline
In this subsection, we describe the basic structure of TARDIS ^{1}^{1}1Name of the model is inspired from the timemachine in a popular TV series Dr. Who. (Temporal Automatic Relation Discovery In Sequences). TARDIS is a MANN which has an external memory matrix where is the number of memory cells and is the dimensionality of each cell. The model has an RNN controller which can read and write from the external memory at every time step. To read from the memory, the controller generates the read weights and the reading operation is typically achieved by computing the dot product between the read weights and the memory , resulting in the content vector :
(1) 
TARDIS uses discrete addressing and hence is a onehot vector and the dotproduct chooses one of the cells in the memory matrix (Zaremba and Sutskever, 2015; Gulcehre et al., 2016). The controller generates the write weights , to write into the memory which is also a one hot vector, with discrete addressing. We will omit biases from our equations for the simplicity in the rest of the paper. Let be the index of the nonzero entry in the onehot vector , then the controller writes a linear projection of the current hidden state to the memory location :
(2) 
where is the projection matrix that projects the dimensional hidden state vector to a dimensional microstate vector such that .
At every time step, the hidden state of the controller is also conditioned on the content read from the memory. The wormhole connections are created by conditioning on :
(3) 
As each cell in the memory is a linear projection of one of the previous hidden states, the conditioning of the controller’s hidden state with the content read from the memory can be interpreted as a way of creating shortcut connections across time (from the time that was written to the time when it was read through ) which can help to the flow of gradients across time. This is possible because of the discrete addressing used for read and write operations.
However, the main challenge for the model is to learn proper read and write mechanisms so that it can write the hidden states of the previous time steps that will be useful for future predictions and read them at the right time step. We call this the reader/writer synchronization problem. Instead of designing complicated addressing mechanisms to mitigate the difficulty of learning how to properly address the external memory, TARDIS sidesteps the reader/writer synchronization problem by using the following heuristics. For the first time steps, our model writes the microstates into the cells of the memory in a sequential order. When the memory becomes full, the most effective strategy in terms of preserving the information stored in the memory would be to replace the memory cell that has been read with the microstate generated from the hidden state of the controller after it is conditioned on the memory cell that has been read. If the model needs to perfectly retain the memory cell that it has just overwritten, the controller can in principle learn to do that by copying its read input to its write output (into the same memory cell). The pseudocode and the details of the memory update algorithm for TARDIS is presented in Algorithm 1.
There are two missing pieces in Algorithm 1: How to generate the read weights? What is the structure of the controller function ? We will answer these two questions in detail in next two subsections.
2.2 Addressing mechanism
Similar to DNTM, memory matrix of TARDIS has disjoint address section and content section , and for . However, unlike DNTM address vectors are fixed to random sparse vectors. The controller reads both the address and the content parts of the memory, but it will only write into the content section of the memory.
The continuous read weights are generated by an MLP which uses the information coming from , , and the usage vector (described below). The MLP is parametrized as follows:
(4)  
(5) 
where are learnable parameters. is a onehot vector obtained by either sampling from or by using argmax over .
is the usage vector which denotes the frequency of accesses to each cell in the memory. is computed from the sum of discrete address vectors and normalizing them.
(6) 
applied in Equation 6
is a simple featurewise computation of centering and divisive variance normalization. This normalization step makes the training easier with the usage vectors. The introduction of the usage vector can help the attention mechanism to choose between the different memory cells based on their frequency of accesses to each cell of the memory. For example, if a memory cell is very rarely accessed by the controller, for the next time step, it can learn to assign more weights to those memory cells by looking into the usage vector. By this way, the controller can learn an LRU access mechanism
(Santoro et al., 2016; Gulcehre et al., 2016).Further, in order to prevent the model to learn deficient addressing mechanisms, for e.g. reading the same memory cell which will not increase the memory capacity of the model, we decrease the probability of the last read memory location by subtracting
from the logit of
for that particular memory location.2.3 TARDIS Controller
We use an LSTM controller, and its gates are modified to take into account the content of the cell read from the memory:
(7) 
where , , and are forget gate, input gate, and output gate respectively. are the scalar RESET gates which control the magnitude of the information flowing from the memory and the previous hidden states to the cell of the LSTM . By controlling the flow of information into the LSTM cell, those gates will allow the model to store the subsequences or chunks of sequences into the memory instead of the entire context.
We use Gumbel sigmoid (Maddison et al., 2016; Jang et al., 2016) for and due to its behavior close to binary.
(8) 
As in Equation 8 empirically, we find gumbelsigmoid to be easier to train than the regular sigmoid. The temperature of the Gumbelsigmoid is fixed to in all our experiments.
The cell of the LSTM controller, is computed according to the Equation 2.3 with the and RESET gates.
(9) 
The hidden state of the LSTM controller is computed as follows:
(10) 
In Figure 1, we illustrate the interaction between the controller and the memory with various heads and components of the controller.
2.4 Microstates and Longterm Dependencies
A microstate of the LSTM for a particular time step is the summary of the information that has been stored in the LSTM controller of the model. By attending over the cells of the memory which contains previous microstates of the LSTM, the model can explicitly learn to restore information from its own past.
The controller can learn to represent highlevel temporal abstractions by creating wormhole connections through the memory as illustrated in Figure 2. In this example, the model takes the token at the first timestep and stores its representation to the first memory cell with address . In the second timestep, the controller takes as input and writes into the second memory cell with the address . Furthermore, gater blocks the connection from to . At the third timestep, the controller starts reading. It receives as input and reads the first memory cell where microstate of was stored. After reading, it computes the hiddenstate and writes the microstate of into the first memory cell. The length of the path passing through the microstates of and would be . The wormhole connection from to would skip a timestep.
A regular singlelayer RNN has a fixed graphical representation of a linearchain when considering only the connections through its recurrent states or the temporal axis. However, TARDIS is more flexible in terms of that and it can learn directed graphs with more diverse structures using the wormhole connections and the RESET gates. The directed graph that TARDIS can learn through its recurrent states have at most the degree of 4 at each vertex (maximum 2 incoming and 2 outgoing edges) and it depends on the number of cells () that can be stored in the memory.
In this work, we focus on a variation of TARDIS, where the controller maintains a fixedsize external memory. However as in (Cheng et al., 2016), it is possible to use a memory that grows with respect to the length of its input sequences, but that would not scale and can be more difficult to train with discrete addressing.
3 Training TARDIS
In this section, we explain how to train TARDIS as a language model. We use language modeling as an example application. However, we would like to highlight that TARDIS can also be applied to any complex sequence to sequence learning tasks.
Consider training examples where each example is a sequence of length . At every timestep , the model receives the input which is a onehot vector of size equal to the size of the vocabulary and should produce the output which is also a onehot vector of size equal to the size of the vocabulary .
The output of the model for th example and th timestep is computed as follows:
(11) 
where is the learnable parameters and is a single layer MLP which combines both and as in deep fusion by (Pascanu et al., 2013a). The task loss would be the categorical crossentropy between the targets and modeloutputs. Superscript denotes that the variable is the output for the sample in the training set.
(12) 
However, the discrete decisions taken for memory access during every timestep makes the model not differentiable and hence we need to rely on approximate methods of computing gradients with respect to the discrete address vectors. In this paper we explore two such approaches: REINFORCE (Williams, 1992)
and straightthrough estimator
(Bengio et al., 2013).3.1 Using REINFORCE
REINFORCE is a likelihoodratio method, which provides a convenient and simple way of estimating the gradients of the stochastic actions. In this paper, we focus on application of REINFORCE on sequential prediction tasks, such as language modelling. For example , let be the reward for the action at timestep . We are interested in maximizing the expected return for the whole episode as defined below:
(13) 
Ideally we would like to compute the gradients for Equation 13, however computing the gradient of the expectation may not be feasible. We would have to use a MonteCarlo approximation and compute the gradients by using the REINFORCE for the sequential prediction task which can be written as in Equation 14.
(14) 
where is the reward baseline. However, we can further assume that the future actions do not depend on the past rewards in the episode/trajectory and further reduce the variance of REINFORCE as in Equation 15.
(15) 
In our preliminary experiments, we find out that the training of the model is easier with the discounted returns, instead of using the centered undiscounted return:
(16) 
Training REINFORCE with an Auxiliary Cost
Training models with REINFORCE can be difficult, due to the variance imposed into the gradients. In the recent years, researchers have developed several tricks in order to mitigate the effect of highvariance in the gradients. As proposed by (Mnih and Gregor, 2014), we also use variance normalization on the REINFORCE gradients.
For TARDIS, reward at timestep () is the loglikelihood of the prediction at that timestep. Our initial experiments showed that REINFORCE with this reward structue often tends to underutilize the memory and mainly rely on the internal memory of the LSTM controller. Especially, in the beginning of the training model, it can just decrease the loss by relying on the memory of the controller and this can cause the REINFORCE to increase the loglikelihood of the random actions.
In order to deal with this issue, instead of using the loglikelihood of the model as reward, we introduce an auxiliary cost to use as the reward which is computed based on predictions which are only based on the memory cell which is read by the controller and not the hidden state of the controller:
(17) 
In Equation 17, we only train the parameters where is the dimensionality of the output size and (for language modelling both and would be ) is the dimensionality of the input of the model. We do not backpropagate through and thus we denote it as in our equations.
3.2 Using Gumbel Softmax
Training with REINFORCE can be challenging due to the high variance of the gradients, gumbelsoftmax provides a good alternative with straightthrough estimator for REINFORCE to tackle the variance issue. Unlike (Maddison et al., 2016; Jang et al., 2016) instead of annealing the temperature or fixing it, our model learns the inversetemperature with an MLP which has a single scalar output conditioned on the hidden state of the controller.
(18)  
(19) 
We replace the softmax in Equation 5 with gumbelsoftmax defined above. During forward computation, we sample from and use the generated onehot vector for memory access. However, during backprop, we use for gradient computation and hence the entire model becomes differentiable.
Learning the temperature of the GumbelSoftmax reduces the burden of performing extensive hyperparameter search for the temperature.
4 Related Work
Neural Turing Machine (NTM) (Graves et al., 2014) is the most related class of architecture to our model. NTMs have proven to be successful in terms of generalizing over longer sequences than the sequences that it has been trained on. Also NTM has been shown to be more effective in terms of solving algorithmic tasks than the gated models such as LSTMs. However NTM can have limitations due to some of its design choices. Due to the controller’s lack of precise knowledge on the contents of the information, the contents of the memory can overlap. These memory augmented models are also known to be complicated, which yields to the difficulties in terms of implementing the model and training it. The controller has no information about the sequence of operations and the information such as frequency of the read and write access to the memory. TARDIS tries to address these issues.
Gulcehre et al. (2016) proposed a variant of NTM called dynamic NTM (DNTM) which had learnable location based addressing. DNTM can be used with both continuous addressing and discrete addressing. Discrete DNTM is related to TARDIS in the sense that both models use discrete addressing for all the memory operations. However, discrete DNTM expects the controller to learn to read/write and also learn reader/writer synchronization. TARDIS do not have this synchronization problem since both reader and writer are tied. Rae et al. (2016) proposed sparse access memory (SAM) mechanism for NTMs which can be seen as a hybrid of continuous and discrete addressing. SAM uses continuous addressing over a selected set of top relevant memory cells. Recently, Graves et al. (2016) proposed a differentiable neural computer (DNC) which is a successor of NTM.
Rocktäschel et al. (2015) and (Cheng et al., 2016) proposed models that generate weights to attend over the previous hidden states of the RNN. However, since those models attend over the whole context, the computation of the attention can be inefficient.
Grefenstette et al. (2015) has proposed a model that can store the information in a data structure, such as in a stack, dequeue or queue in a differentiable manner.
Grave et al. (2016) has proposed to use a cache based memory representation which stores the last states of the RNN in the memory and similar to the traditional cachebased models the model learns to choose a state of the memory for the prediction in the language modeling tasks (Kuhn and De Mori, 1990).
5 Gradient Flow through the External Memory
In this section, we analyze the flow of the gradients through the external memory and will also investigate its efficiency in terms of dealing with the vanishing gradients problem (Hochreiter, 1991; Bengio et al., 1994). First, we describe the vanishing gradient problem in an RNN and then describe how an external memory model can deal with it. For the sake of simplicity, we will focus on vanilla RNNs during the entire analysis, but the same analysis can be extended to LSTMs. In our analysis, we also assume that the weights for the read/write heads are discrete.
We will show that the rate of the gradients vanishing through time for a memoryaugmented recurrent neural network is much smaller than of a regular vanilla recurrent neural network.
Consider an RNN which at each timestep takes an input and produces an output . The hidden state of the RNN can be written as,
(20)  
(21) 
where and are the recurrent and the input weights of the RNN respectively and
is a nonlinear activation function. Let
be the loss function that the RNN is trying to minimize. Given an input sequence of length
, we can write the derivative of the loss with respect to parameters as,(22) 
The multiplication of many Jacobians in the form of to obtain is the main reason of the vanishing and the exploding gradients (Pascanu et al., 2013b):
(23) 
Let us assume that the singular values of a matrix
are ordered as, . Let be an upper bound on the singular values of , s.t. , then the norm of the Jacobian will satisfy (Zilly et al., 2016),(24) 
Pascanu et al. (2013b) showed that for , the following inequality holds:
(25) 
Since and the norm of the product of Jacobians grows exponentially on , the norm of the gradients will vanish exponentially fast.
Now consider the MANN where the contents of the memory are linear projections of the previous hidden states as described in Equation 2. Let us assume that both reading and writing operation use discrete addressing. Let the content read from the memory at time step correspond to some memory location :
(26) 
where corresponds to the hidden state of the controller at some previous timestep .
Now the hidden state of the controller in the external memory model can be written as,
(27) 
If the controller reads at time step and its memory content is as described above, then the Jacobians associated with Equation 5 can be computed as follows:
(28)  
(29) 
where and are defined as below,
(30)  
(31) 
As shown in Equation 29, Jacobians of the MANN can be rewritten as a summation of two matrices, and . The gradients flowing through do not necessarily vanish through time, because it is the sum of jacobians computed over the shorter paths.
The norm of the Jacobian can be lower bounded as follows by using Minkowski inequality:
(32)  
(33) 
Assuming that the length of the dependency is very long would vanish to 0. Then we will have,
(34) 
As one can see that the rate of the gradients vanishing through time depends on the length of the sequence passes through . This is typically lesser than the length of the sequence passing through . Thus the gradients vanish at lesser rate than in an RNN. In particular the rate would strictly depend on the length of the shortest paths from to , because for the long enough dependencies, gradients through the longer paths would still vanish.
We can also derive an upper bound for norm of the Jacobian as follows:
(35)  
(36) 
Using the result from (Loyka, 2015), we can lower bound as follows:
(37) 
For long sequences we know that will go to (see equation 25). Hence,
(38) 
The rate at which reaches zero is strictly smaller than the rate at which reaches zero and with ideal memory access, it will not reach zero. Hence unlike vanilla RNNs, Equation 38 states that the upper bound of the norm of the Jacobian will not reach to zero for a MANN with ideal memory access.
Consider a memory augmented neural network with memory cells for a sequence of length , and each hidden state of the controller is stored in different cells of the memory. If the prediction at time step has only a longterm dependency to and the prediction at is independent from the tokens appear before , and the memory reading mechanism is perfect, the model will not suffer from vanishing gradients when we backpropagate from to .^{2}^{2}2Let us note that, unlike an Markovian gram assumption, here we assume that at each time step the can be different.
Proof:
If the input sequence has a longestdependency to from , we would only be interested in gradients propagating from to and the Jacobians from to , i.e. . If the controller learns a perfect reading mechanism at time step it would read memory cell where the hidden state of the RNN at time step is stored at. Thus following the jacobians defined in the Equation 29, we can rewrite the jacobians as,
(39) 
In Equation 39, the first two terms might vanish as grows. However, the singular values of the third term do not change as grows. As a result, the gradients propagated from to will not necessarily vanish through time. However, in order to obtain stable dynamics for the network, the initialization of the matrices, and is important.
This analysis highlights the fact that an external memory model with optimal read/write mechanism can handle longrange dependencies much better than an RNN. However, this is applicable only when we use discrete addressing for read/write operations. Both NTM and DNTM still have to learn how to read and write from scratch which is a challenging optimization problem. For TARDIS tying the read/write operations make the learning to become much simpler for the model. In particular, the results of the Theorem 5 points the importance of coming up with better ways of designing attention mechanisms over the memory.
The controller of a MANN may not be able learn to use the memory efficiently. For example, some cells of the memory may remain empty or may never be read. The controller can overwrite the memory cells which have not been read. As a result the information stored in those overwritten memory cells can be lost completely. However TARDIS avoids most of these issues by the construction of the algorithm.
6 On the Length of the Paths Through the Wormhole Connections
As we have discussed in Section 5, the rate at which the gradients vanish for a MANN depends on the length of the paths passing along the wormhole connections. In this section we will analyse those lengths in depth for untrained models such that the model will assign uniform probability to read or write all memory cells. This will give us a better idea on how each untrained model uses the memory at the beginning of the training.
A wormhole connection can be created by reading a memory cell and writing into the same cell in TARDIS. For example, in Figure 2, while the actual path from to is of length 4, memory cell creates a shorter path of length 2 . We call the length of the actual path as and length of the shorter path created by wormhole connection as .
Consider a TARDIS model which has cells in its memory. If TARDIS access each memory cell uniformly random, then the probability of accessing a random cell , . The expected length of the shorter path created by wormhole connections () would be proportional to the number of reads and writes into a memory cell. For TARDIS with reader choosing a memory cell uniformly random this would be at the end of the sequence. We verify this result by simulating the read and write heads of TARDIS as in Figure 3 a).
Now consider a MANN with separate read and write heads each accessing the memory in discrete and uniformly random fashion. Let us call it as uMANN. We will compute the expected length of the shorter path created by wormhole connections () for uMANN. and are the read and write head weights, each sampled from a multinomial distribution with uniform probability for each memory cells respectively. Let be the index of the memory cell read at timestep . For any memory cell , , defined below, is a recursive function that computes the length of the path created by wormhole connections in that cell.
(40) 
It is possible to prove that will be by induction for every memory cell. However, for proof assumes that when is less than or equal to , the length of all paths stored in the memory should be . We have run simulations to compute the expected path length in a memory cell of uMANN as in Figure 3 (b).
This analysis shows that while TARDIS with uniform read head maintains the same expected length of the shorter path created by wormhole connections as uMANN, it completely avoids the reader/writer synchronization problem.
If is large enough, should hold. In expectation, will decay proportionally to whereas will decay proportional ^{3}^{3}3Exponentially when the Equation 25 holds. to . With ideal memory access, the rate at which reaches zero would be strictly smaller than the rate at which reaches zero. Hence, as per Equation 38, the upper bound of the norm of the Jacobian will vanish at a much smaller rate. However, this result assumes that the dependencies which the prediction relies are accessible through the memory cell which has been read by the controller.
In the more general case, consider a MANN with . The writer just fills in the memory cells in a sequential manner and the reader chooses a memory cell uniformly at random. Let us call this model as urMANN. Let us assume that there is a dependency between two timesteps and as shown in Figure 4. If was taken uniformly between and , then there is a probability that the read address invoked at time will be greater than or equal to (proof by symmetry). In that case, the expected shortest path length through that wormhole connection would be , but this still would not scale well. If the reader is very well trained, it could pick exactly and the path length will be 1.
Let us consider all the paths of length less than or equal to of the form in Figure 4. Also, let and . Then, the shortest path from to now has length , using a wormhole connection that connects the state at with the state at . There are such paths that are realized, but we leave the distribution of the length of that shortest path as an open question. However, the probability of hitting a very short path (of length less than or equal to ) increases exponentially with . Let the probability of the read at to hit the interval be . Then the probability that the shorter paths over the last reads hits that interval is , where is on the order of . On the other hand, the probability of not hitting that interval approaches to 0 exponentially with .
Figure 4 illustrates how wormhole connections can creater shorter paths. In Figure 5 (b), we show that the expected length of the path travelled outside the wormhole connections obtained from the simulations decreases as the size of the memory decreases. In particular, for urMANN and TARDIS the trend is very close to exponential. As shown in Figure 5 (a), this also influences the total length of the paths travelled from timestep 50 to 5 as well. Writing into the memory by using weights sampled with uniform probability for all memory cells can not use the memory as efficiently as other approaches that we compare to. In particular fixing the writing mechanism seems to be useful.
Even if the reader does not manage to learn where to read, there are many "short paths" which can considerably reduce the effect of vanishing gradients.
7 On Generalization over the Longer Sequences
Graves et al. (2014) have shown that the LSTMs can not generalize well on the sequences longer than the ones seen during the training. Whereas a MANN such as an NTM or a DNTM has been shown to generalize to sequences longer than the ones seen during the training set on a set of toy tasks.
We believe that the main reason of why LSTMs typically do not generalize to the sequences longer than the ones that are seen during the training is mainly because the hidden state of an LSTM network utilizes an unbounded history of the input sequence and as a result, its parameters are optimized using the maximum likelihood criterion to fit on the sequences with lengths of the training examples. However, an ngram language model or an HMM does not suffer from this issue. In comparison, an ngram LM would use an input context with a fixed window size and an HMM has the Markov property in its latent space. As argued below, we claim that while being trained a MANN can also learn the ability to generalize for sequences with a longer length than the ones that appear in the training set by modifying the contents of the memory and reading from it.
A regular RNN will minimize the negative loglikelihood objective function for the targets by using the unbounded history represented with the hidden state of the RNN, and it will model the parametrized conditional distribution for the prediction at timestep and a MANN would learn . If we assume that represents all the dependencies that depends on in the input sequence, we will have where represents the dependencies in a limited context window that only contains paths shorter than the sequences seen during the training set. Due to this property, we claim that MANNs such as NTM, DNTM or TARDIS can generalize to the longer sequences more easily. In our experiments on PennTreebank, we show that a TARDIS language model trained to minimize the loglikelihood for and on the test set both and for the same model yields to very close results. On the other hand, the fact that the best results on bAbI dataset obtained in (Gulcehre et al., 2016) is with feedforward controller and similarly in (Graves et al., 2014) feedforward controller was used to solve some of the toy tasks also confirms our hypothesis. As a result, what has been written into the memory and what has been read becomes very important to be able to generalize to the longer sequences.
8 Experiments
8.1 Characterlevel Language Modeling on PTB
As a preliminary study on the performance of our model we consider characterlevel language modelling. We have evaluated our models on Penn TreeBank (PTB) corpus (Marcus et al., 1993) based on the train, valid and test used in (Mikolov et al., 2012). On this task, we are using layernormalization (Ba et al., 2016) and recurrent dropout (Semeniuta et al., 2016) as those are also used by the SOTA results on this task. Using layernormalization and the recurrent dropout improves the performance significantly and reduces the effects of overfitting. We train our models with Adam (Kingma and Ba, 2014) over the sequences of length 150. We show our results in Table 1.
In addition to the regular charLM experiments, in order to confirm our hypothesis regarding to the ability of MANNs generalizing to the sequences longer than the ones seen during the training. We have trained a language model which learns
by using a softmax layer as described in Equation
11. However to measure the performance of on test set, we have used the softmax layer that gets into the auxiliary cost defined for the REINFORCE as in Equation 17 for a model trained with REINFORCE and with the auxiliary cost. As in Table 1, the model’s performance by using is 1.26, however by using it becomes 1.28. This gap is small enough to confirm our assumption that .


Model  BPC 
CWRNN (Koutnik et al., 2014)  1.46 
HFMRNN (Sutskever et al., 2011)  1.41 
ME gram (Mikolov et al., 2012)  1.37 
BatchNorm LSTM (Cooijmans et al., 2016)  1.32 
Zoneout RNN (Krueger et al., 2016)  1.27 
LayerNorm LSTM (Ha et al., 2016)  1.27 
LayerNorm HyperNetworks (Ha et al., 2016)  1.23 
LayerNorm HMLSTM & Step Fn. & Slope Annealing(Chung et al., 2016)  1.24 
Our LSTM + Layer Norm + Dropout  1.28 
TARDIS + REINFORCE + R  1.28 
TARDIS + REINFORCE + Auxiliary Cost  1.28 
TARDIS + REINFORCE + Auxiliary Cost + R  1.26 
TARDIS + Gumbel Softmax + ST + R  1.25 

8.2 Sequential Stroke Multidigit MNIST task
In this subsection, we introduce a new penstroke based sequential multidigit MNIST prediction task as a benchmark for long term dependency modelling. We also benchmark the performance of LSTM and TARDIS in this challenging task.
8.2.1 Task and Dataset
Recently (de Jong, 2016) introduced an MNIST pen stroke classification task and also provided dataset which consisted of pen stroke sequences representing the skeleton of the digits in the MNIST dataset. Each MNIST digit image is represented as a sequence of quadruples , where is the number of pen strokes to define the digit, denotes the pen offset from the previous to the current stroke (can be 1, 1 or 0), is a binary valued feature to denote end of stroke and is another binary valued feature to denote end of the digit. In the original dataset, first quadruple contains absolute value instead of offsets . Without loss of generality, we set the starting position to in our experiments. Each digit is represented by 40 strokes on an average and the task is to predict the digit at the end of the stroke sequence.
While this dataset was proposed for incremental sequence learning in (de Jong, 2016), we consider the multidigit version of this dataset to benchmark models that can handle long term dependencies. Specifically, given a sequence of penstroke sequences, the task is to predict the sequence of digits corresponding to each penstroke sequences in the given order. This is a challenging task since it requires the model to learn to predict the digit based on the penstroke sequence, count the number of digits and remember them and generate them in the same order after seeing all the strokes. In our experiments we consider 3 versions of this task with 5,10, and 15 digit sequences respectively. We generated 200,000 training data points by randomly sampling digits from the training set of the MNIST dataset. Similarly we generated 20,000 validation and test data points by randomly sampling digits from the validation set and test set of the MNIST dataset respectively. Average length of the stroke sequences in each of these tasks are 199, 399, and 599 respectively.
8.2.2 Results
We benchmark the performance of LSTM and TARDIS in this new task. Both models receive the sequence of pen strokes and at the end of the sequence are expected to generate the sequence of digits followed by a particular <bos> token. The tasks is illustrated in Figure 6. We evaluate the models based on perdigit error rate. We also compare the performance of TARDIS with REINFORCE with that of TARDIS with gumbel softmax. All the models were trained for same number of updates with early stopping based on the perdigit error rate in the validation set. Results for all 3 versions of the task are reported in Table2. From the table, we can see that TARDIS performs better than LSTM in all the three versions of the task. Also TARDIS with gumbelsoftmax performs slightly better than TARDIS with REINFORCE, which is consistent with our other experiments.
Model  5digits  10digits  15digits 
LSTM  3.00%  3.54%  8.81% 
TARDIS with REINFORCE  2.09%  2.56%  3.67% 
TARDIS with gumbel softmax  1.89%  2.23%  3.09% 
We also compare the learning curves of all the three models in Figure7. From the figure we can see that TARDIS learns to solve the task faster that LSTM by effectively utilizing the given memory slots. Also, TARDIS with gumbel softmax converges faster than TARDIS with REINFORCE.
8.3 NTM Tasks
Graves et al. (2014) proposed associative recall and the copy tasks to evaluate a model’s ability to learn simple algorithms and generalize to the sequences longer than the ones seen during the training. We trained a TARDIS model with 4 features for the address and 32 features for the memory content part of the model. We used a model with hidden state of size 120. Our model uses a memory of size 16. We train our model with Adam and used the learning rate of 3e3. We show the results of our model in Table 3. TARDIS model was able to solve the both tasks, both with Gumbelsoftmax and REINFORCE.
Copy Task  Associative Recall  
DNTM cont. (Gulcehre et al., 2016)  Success  Success 
DNTM discrete (Gulcehre et al., 2016)  Success  Failure 
NTM (Graves et al., 2014)  Success  Success 
TARDIS + Gumbel Softmax + ST  Success  Success 
TARDIS REINFORCE + Auxiliary Cost  Success  Success 
8.4 Stanford Natural Language Inference
Bowman et al. (2015)
proposed a new task to test the machine learning algorithms’ ability to infer whether two given sentences entail, contradict or are neutral(semantic independence) from each other. However, this task can be considered as a longterm dependency task, if the premise and the hypothesis are presented to the model in sequential order as also explored by
Rocktäschel et al. (2015). Because the model should learn the dependency relationship between the hypothesis and the premise. Our model first reads the premise, then the hypothesis and at the end of the hypothesis the model predicts whether the premise and the hypothesis contradicts or entails. The model proposed by Rocktäschel et al. (2015), applies attention over its previous hidden states over premise when it reads the hypothesis. In that sense their model can still be considered to have some taskspecific architectural design choice. TARDIS and our baseline LSTM models do not include any taskspecific architectural design choices. In Table 4, we compare the results of different models. Our model, performs significantly better than other models. However recently it has been shown that with architectural tweaks, it is possible to design a model specifically to solve this task and achieve 88.2% test accuracy (Chen et al., 2016).


Model  Test Accuracy 
Word by Word Attention(Rocktäschel et al., 2015)  83.5 
Word by Word Attention twoway(Rocktäschel et al., 2015)  83.2 
LSTM + LayerNorm + Dropout  81.7 
TARDIS + REINFORCE + Auxiliary Cost  82.4 
TARDIS + Gumbel Softmax + ST  84.3 

9 Conclusion
In this paper, we propose a simple and efficient memory augmented neural network model which can perform well both on algorithmic tasks and more realistic tasks. Unlike the previous approaches, we show better performance on realworld NLP tasks, such as language modelling and SNLI. We have also proposed a new task to measure the performance of the models dealing with longterm dependencies.
We provide a detailed analysis on the effects of using external memory for the gradients and justify the reason why MANNs generalize better on the sequences longer than the ones seen in the training set. We have also shown that the gradients will vanish at a much slower rate (if they vanish) when an external memory is being used. Our theoretical results should encourage further studies in the direction of developing better attention mechanisms that can create wormhole connections efficiently.
Acknowledgments
We thank Chinnadhurai Sankar for suggesting the phrase "wormhole connections" and proofreading the paper. We would like to thank Dzmitry Bahdanau for the comments and feedback for the earlier version of this paper. We would like to also thank the developers of Theano
^{4}^{4}4http://deeplearning.net/software/theano/, for developing such a powerful tool for scientific computing Theano Development Team (2016). We acknowledge the support of the following organizations for research funding and computing support: NSERC, Samsung, Calcul Québec, Compute Canada, the Canada Research Chairs and CIFAR. SC is supported by a FQRNTPBEEE scholarship.References
 Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings Of The International Conference on Representation Learning (ICLR 2015), 2015.
 Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.
 Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 Bordes et al. (2015) Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Largescale simple question answering with memory networks. arXiv preprint arXiv:1506.02075, 2015.
 Bowman et al. (2015) Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.
 Chandar et al. (2016) Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Pascal Vincent, Gerald Tesauro, and Yoshua Bengio. Hierarchical memory networks. arXiv preprint arXiv:1605.07427, 2016.
 Chen et al. (2016) Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Hui Jiang. Enhancing and combining sequential and tree lstm for natural language inference. arXiv preprint arXiv:1609.06038, 2016.
 Cheng et al. (2016) Jianpeng Cheng, Li Dong, and Mirella Lapata. Long shortterm memorynetworks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
 Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 Chung et al. (2016) Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016.
 Cooijmans et al. (2016) Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron Courville. Recurrent batch normalization. arXiv preprint arXiv:1603.09025, 2016.
 de Jong (2016) Edwin D. de Jong. Incremental sequence learning. arXiv preprint arXiv:1611.03068, 2016.
 Grave et al. (2016) Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426, 2016.
 Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwińska, Sergio G. Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adrià P. Badia, Karl M. Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, advance online publication, October 2016. ISSN 00280836. doi: 10.1038/nature20101. URL http://dx.doi.org/10.1038/nature20101.
 Grefenstette et al. (2015) Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce with unbounded memory. In Advances in Neural Information Processing Systems, pages 1819–1827, 2015.
 Gulcehre et al. (2016) Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho, and Yoshua Bengio. Dynamic neural turing machine with soft and hard addressing schemes. arXiv preprint arXiv:1607.00036, 2016.
 Ha et al. (2016) David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
 Hochreiter (1991) Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München, page 91, 1991.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Joulin and Mikolov (2015) Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stackaugmented recurrent nets. In Advances in Neural Information Processing Systems, pages 190–198, 2015.
 Kaiser and Sutskever (2015) Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Koutnik et al. (2014) Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. arXiv preprint arXiv:1402.3511, 2014.
 Krueger et al. (2016) David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron Courville, et al. Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2016.
 Kuhn and De Mori (1990) Roland Kuhn and Renato De Mori. A cachebased natural language model for speech recognition. IEEE transactions on pattern analysis and machine intelligence, 12(6):570–583, 1990.
 Loyka (2015) Sergey Loyka. On singular value inequalities for the sum of two matrices. arXiv preprint arXiv:1507.06630, 2015.
 Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
 Mikolov et al. (2012) Tomáš Mikolov, Ilya Sutskever, Anoop Deoras, HaiSon Le, Stefan Kombrink, and J Cernocky. Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf), 2012.
 Mnih and Gregor (2014) Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
 Pascanu et al. (2013a) Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013a.
 Pascanu et al. (2013b) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. ICML (3), 28:1310–1318, 2013b.
 Rae et al. (2016) Jack W. Rae, Jonathan J. Hunt, Tim Harley, Ivo Danihelka, Andrew W. Senior, Greg Wayne, Alex Graves, and Timothy P. Lillicrap. Scaling memoryaugmented neural networks with sparse reads and writes. CoRR, abs/1610.09027, 2016.
 Rocktäschel et al. (2015) Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, and Phil Blunsom. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664, 2015.
 Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Oneshot learning with memoryaugmented neural networks. arXiv preprint arXiv:1605.06065, 2016.
 Semeniuta et al. (2016) Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory loss. arXiv preprint arXiv:1603.05118, 2016.

Serban et al. (2016)
Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle
Pineau.
Building endtoend dialogue systems using generative hierarchical
neural network models.
In
Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI16)
, 2016.  Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. Endtoend memory networks. arXiv preprint arXiv:1503.08895, 2015.
 Sutskever et al. (2011) Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 1017–1024, 2011.
 Theano Development Team (2016) Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.
 Trischler et al. (2016) Adam Trischler, Zheng Ye, Xingdi Yuan, and Kaheer Suleman. Natural language comprehension with the epireader. arXiv preprint arXiv:1606.02270, 2016.
 Tulving (2002) Endel Tulving. Chronesthesia: Conscious awareness of subjective time. 2002.
 Weston et al. (2015) Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In Proceedings Of The International Conference on Representation Learning (ICLR 2015), 2015. In Press.

Williams (1992)
Ronald J. Williams.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine Learning, 8:229–256, 1992.  Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings Of The International Conference on Representation Learning (ICLR 2015), 2015.
 Zaremba and Sutskever (2015) Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. CoRR, abs/1505.00521, 2015.
 Zilly et al. (2016) Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.