Tensorflow implementation of a Neural Turing Machine
Neural Turing Machines (NTMs) are an instance of Memory Augmented Neural Networks, a new class of recurrent neural networks which decouple computation from memory by introducing an external memory unit. NTMs have demonstrated superior performance over Long Short-Term Memory Cells in several sequence learning tasks. A number of open source implementations of NTMs exist but are unstable during training and/or fail to replicate the reported performance of NTMs. This paper presents the details of our successful implementation of a NTM. Our implementation learns to solve three sequential learning tasks from the original NTM paper. We find that the choice of memory contents initialization scheme is crucial in successfully implementing a NTM. Networks with memory contents initialized to small constant values converge on average 2 times faster than the next best memory contents initialization scheme.READ FULL TEXT VIEW PDF
Tensorflow implementation of a Neural Turing Machine
Neural Turing Machines (NTMs)  are one instance of several new neural network architectures [4, 5, 11]classified as Memory Augmented Neural Networks (MANNs). MANNs defining attribute is the existence of an external memory unit. This contrasts with gated recurrent neural networks such as Long Short-Term Memory Cells (LSTMs) 
whose memory is an internal vector maintained over time. LSTMs have achieved state-of-the-art performance in many commercially important sequence learning tasks, such as handwriting recognition, machine translation  and speech recognition . But, MANNs have been shown to outperform LSTMs on several artificial sequence learning tasks that require a large memory and/or complicated memory access patterns, for example memorization of long sequences and graph traversal [4, 5, 6, 11].
The authors of the original NTM paper, did not provide source code for their implementation. Open source implementations of NTMs exist111https://github.com/snowkylin/ntm222https://github.com/chiggum/Neural-Turing-Machines333https://github.com/yeoedward/Neural-Turing-Machine444https://github.com/loudinthecloud/pytorch-ntm555https://github.com/camigord/Neural-Turing-Machine666https://github.com/snipsco/ntm-lasagne777https://github.com/carpedm20/NTM-tensorflow but a number of these implementations567 report that the gradients of their implementation sometimes become NaN during training, causing training to fail. While others report slow convergence or do not report the speed of learning of their implementation. The lack of a stable open source implementation of NTMs makes it more difficult for practitioners to apply NTMs to new problems and for researchers to improve upon the NTM architecture.
In this paper we define a successful NTM implementation888Source code at: https://github.com/MarkPKCollier/NeuralTuringMachine which learns to solve three benchmark sequential learning tasks 
. We specify the set of choices governing our NTM implementation. We conduct an empirical comparison of a number of memory contents initialization schemes identified in other open source NTM implementations. We find that the choice of how to initialize the contents of memory in a NTM is a key factor in a successful NTM implementation. We base our Tensorflow implementation on another open source NTM implementation1, but following our experimental results make significant changes to the memory contents initialization, controller head parameter computation and interface which result in faster convergence, more reliable optimization and easier integration with existing Tensorflow methods. Our implementation is available publicly under an open source license8.
NTMs consist of a controller network which can be a feed-forward neural network or a recurrent neural network and an external memory unit which is amemory matrix, where represents the number of memory locations and
the dimension of each memory cell. Whether the controller is a recurrent neural network or not, the entire architecture is recurrent as the contents of the memory matrix are maintained over time. The controller has read and write heads which access the memory matrix. The effect of a read or write operation on a particular memory cell is weighted by a soft attentional mechanism. This addressing mechanism is similar to attention mechanisms used in neural machine translation[1, 9] except that it combines location based addressing with the content based addressing found in these attention mechanisms.
In particular for a NTM, at each timestep (), for each read and write head the controller outputs a set of parameters: , , , (s.t. and ) and which are used to compute the weighting over the N memory locations in the memory matrix as follows:
allows for content based addressing where represents a lookup key into memory and
is some similarity measure such as cosine similarity:
Through a series of operations NTMs also enable iteration from current or previously computed memory weights as follows:
where (3) enables the network to choose whether to use the current content based weights or the previous weight vector, (4) enables iteration through memory by convolving the current weighting by a 1-D convolutional shift kernel and (5) corrects for any blurring occurring as a result of the convolution operation.
The vector read by a particular read head at timestep is computed as:
Each write head modifies the memory matrix at timestep by outputting additional erase () and add () vectors:
Equations (1) to (8) define how addresses are computed and used to read and write from memory in a NTM, but many implementation details of a NTM are open to choice. In particular the choice of the similarity measure , the initial weightings for all read and write heads, the initial state of the memory matrix , the choice of non-linearity to apply to the parameters outputted by each read and write head and the initial read vector are all undefined in a NTM’s specification.
While any choices for these satisfying the constraints on the parameters outputted by the controller would be a valid NTM, in practice these choices have a significant effect on the ability of a NTM to learn.
Memory contents initialization - We hypothesize that how the memory contents of a NTM are initialized may be a defining factor in the success of a NTM implementation. We compare the three memory contents initialization schemes that we identified in open source implementations of NTMs. In particular, we compare constant initialization where all memory locations are initialized to , learned initialization
where we backpropagate through initialization andrandom initialization
where each memory location is initialized to a value drawn from a truncated Normal distribution with mean. We note that five of the seven implementations12345 we identified randomly initialize the NTM’s memory contents. We also identified an implementation which initialized memory contents to a small constant value6 and an implementation where the memory initialization was learned7.
Constant initialization has the advantage of requiring no additional parameters and providing a stable known memory initialization during inference. Learned initialization has the potential advantage of learning an initialization that would enable complex non-linear addressing schemes  while also providing stable initialization after training. This comes at the cost of extra parameters. Random initialization has the potential advantage of acting as a regularizer, but it is possible that during inference memory contents may be in a space not encountered during training.
Other parameter initialization - Instead of initializing the previously read vectors and address weights to bias values as per 
we backpropagate through their initialization and thus initialize them to a learned bias vector. We argue that this initialization scheme provides sufficient generality for tasks that require more flexible initialization with little cost in extra parameters (the number of additional parameters iswhere is the number of read heads and is the number of write heads). For example, if a NTM with multiple write heads wishes to write to different memory locations at timestep using location based addressing then must be initialized differently for each write head. Having to hard code this for each task is an added burden on the engineer, particularly when the need for such addressing may not be known a priori for a given task, thus we allow the network to learn this initialization.
Controller inputs - At each timestep the controller is fed the concatenation of the input coming externally into the NTM and the previously read vectors from all of the read heads of the NTM. We note that such a setup has achieved performance gains for attentional encoder-decoders in neural machine translation .
Parameter non-linearities - Similarly to a LSTM we force the contents of the memory matrix to be in the range , by applying the function to the outputs of the controller corresponding to and
while we apply the sigmoid function to the corresponding erase vector. We apply the function to satisfy the constraint . We apply the logistic sigmoid function to satisfy the constraint . In order to make the convolutional shift vector
a valid probability distribution we apply the softmax function. In order to satisfywe first apply the function and then add .
We test our NTM implementation on three of the five artificial sequence learning tasks described in the original NTM paper .
Copy - for the Copy task, the network is fed a sequence of random bit vectors followed by an end of sequence marker. The network must then output the input sequence. This requires the network to store the input sequence and then read it back from memory. In our experiments we train and test our networks on 8-bit random vectors with sequences of length sampled uniformly from .
Repeat Copy - similarly to the Copy task, with Repeat Copy the network is fed an input sequence of random bit vectors. Unlike the Copy task, this is followed by a scalar that indicates how many times the network should repeat the input sequence in its output sequence. We train and test our networks on 8-bit random vectors with sequences of length sampled uniformly from and number of repeats also sampled uniformly from .
Associative Recall - Associative Recall is also a sequence learning problem with sequences consisting of random bit vectors. In this case the inputs are divided into items, with each item consisting of 3 x 6-dimensional vectors. After being fed a sequence of items and an end of sequence marker, the network is then fed a query item which is an item from the input sequence. The correct output is the next item in the input sequence after the query item. We train and test our networks on sequences with the number of items sampled uniformly from .
We first run a set of experiments to establish the best memory contents initialization scheme. We compare the constant, random and learned initialization schemes on the above three tasks. We demonstrate below (Sec. 5) that the best such scheme is the constant initialization scheme. We then compare the NTM implementation described above (Sec. 3) under the constant initialization scheme to two other architectures on the Copy, Repeat Copy and Associative Recall tasks. We follow the NTM authors  in comparing our NTM implementation to a LSTM network. As no official NTM implementation has been made open source, as a further benchmark, we compare our NTM implementation to the official implementation999https://github.com/deepmind/dnc of a Differentiable Neural Computer (DNC) , a successor to the NTM. This provides a guide as to how a stable MANN implementation performs on the above tasks.
In all of our experiments for each network we run training 10 times from different random initializations. To measure the learning speed, every 200 steps during training we evaluate the network on a validation set of 640 examples with the same distribution as the training set.
For all tasks the MANNs had 1 read and 1 write head, with an external memory unit of size 128 X 20 and a LSTM controller with 100 units. The controller outputs were clipped elementwise to the range . The LSTM networks were all a stack of 3 X 256 units. All networks were trained with the Adam optimizer  with learning rate 0.001 and on the backward pass gradients were clipped to a maximum gradient norm of 50 as described in .
We hypothesized that how the memory contents of a NTM were initialized would be a key factor in a successful NTM implementation. We compare the three memory initialization schemes we identified in open source NTM implementations. We then use the best identified memory contents initialization scheme as the default for our NTM implementation.
Copy - Our NTM initialized according the constant initialization scheme converges to near zero error approximately 3.5 times faster than the learned initialization scheme, while the random initialization scheme fails to solve the Copy task in the allotted time (Fig. 1). The learning curves suggest that initializing the memory contents to small constant values offers a substantial speed-up in convergence over the other two memory initialization schemes for the Copy task.
Repeat Copy - A NTM initialized according the constant initialization scheme converges to near zero error approximately 1.43 times faster than the learned initialization scheme and 1.35 times faster than the random initialization scheme (Fig. 2). The relative speed of convergence between learned and random initialization is reversed as compared with the Copy task, but again the constant initialization scheme demonstrates substantially faster learning than either alternative.
Associative Recall - A NTM initialized according the constant initialization scheme converges to near zero error approximately 1.15 times faster than the learned initialization scheme and 5.3 times faster than the random initialization scheme (Fig. 3).
The constant initialization scheme demonstrates fastest convergence to near zero error on all three tasks. We conclude that initializing the memory contents of a NTM to small constant values results in faster learning than backpropagating through memory contents initialization or randomly initializing memory contents. Thus, we use the constant initialization scheme as the default scheme for our NTM implementation.
Now that we have established the best memory contents initialization scheme is constant initialization we wish to test whether our NTM implementation using this scheme is stable and has similar speed of learning and generalization ability as claimed in the original NTM paper. We compare the performance of our NTM to a LSTM and a DNC on the same three tasks as for our memory contents initialization experiments.
Copy - Our NTM implementation converges to zero error in a number of steps comparable to the best published results on this task  (Fig. 4). Our NTM converges to zero error 1.2 times slower than the DNC and as expected both MANNs learn substantially faster (4-5 times) than a LSTM.
Repeat Copy - As per , we also find that the LSTM performs better relative to the MANNs on Repeat Copy compared to the Copy task, converging only 1.44 times slower than a NTM, perhaps due to the shorter input sequences involved (Fig. 5). While both the DNC and the NTM demonstrate slow learning during the first third of training both architectures then rapidly fall to near zero error before the LSTM. Despite the NTM learning slower than the DNC during early training, the DNC converges to near zero error just 1.06 times faster than the NTM.
Associative Recall - Our NTM implementation converges to zero error in a number of steps almost identical to the best published results on this task  and at the same rate as the DNC (Fig. 6). The LSTM network fails to solve the task in the time provided.
Our NTM implementation learns to solve all three of the five tasks proposed in the original NTM paper  that we tested. Our implementation’s speed to convergence and relative performance to LSTMs is similar to the results reported in the NTM paper. Speed to convergence for our NTM is only slightly slower than a DNC - another MANN. We conclude that our NTM implementation can be used reliably in new applications of MANNs.
NTMs are an exciting new neural network architecture that achieve impressive performance on a range of artificial tasks. But the specification of a NTM leaves many free choices to the implementor and no source code is provided that makes these choices and replicates the published NTM results. In practice the choices left to the implementor have a significant impact on the ability of a NTM to learn. We observe great diversity in how these choices are made amongst open source efforts to implement a NTM, many of which fail to replicate these published results.
We have demonstrated that the choice of memory contents initialization scheme is crucial to successfully implementing a NTM. We conclude from the learning curves on three sequential learning tasks that learning is fastest under the constant initialization scheme. We note that the random initialization scheme which was used in five of the seven identified open source implementations was the slowest to converge on two of the three tasks and the second slowest on the Repeat Copy task.
We have made our NTM implementation with the constant initialization scheme open source. Our implementation has learned the Copy, Repeat Copy and Associative Recall tasks at a comparable speed to previously published results and the official implementation of a DNC. Training of our NTM is stable and does not suffer from problems such as gradients becoming NaN reported in other implementations. Our implementation can be reliably used for new applications of NTMs. Additionally, further research on NTMs will be aided by a stable, performant open source NTM implementation.
This publication emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number 13/RC/2106.
Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pp. 1412–1421
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning. pp. 1310–1318 (2013)