Memory, in the form of generic, high-capacity, long-term storage, is likely to play a critical role in expanding the application domain of neural networks. A trainable neural memory subsystem with such properties could be a transformative technology—pushing neural networks within grasp of tasks traditionally associated with general intelligence and an extended sequence of reasoning steps. Development of architectures for integrating memory units with neural networks spans a good portion of the history of neural networks themselves (e.g. from LSTMs (Hochreiter and Schmidhuber, 1997)
to the recent Neural Turing Machines (NTMs)(Graves et al., 2014)). Yet, while useful, none has elevated neural networks to be capable of learning from and processing data on size and time scales commensurate with traditional computing systems. Recent successes of deep neural networks, though dramatic, are focused on tasks, such as visual perception or natural language translation, with relatively short latency— e.g. hundreds of steps, which is often also the depth of the network itself.
We present a design for neural memory subsystems that differs dramatically from prior architectures in the aspect of read/write interface, and, as a consequence, facilitates parameter-efficient scaling of memory capacity. Our design draws upon both established ideas, as well a key recent advance: multigrid convolutional networks implicitly capable of learning dynamic routing mechanisms and attentional behavior (Ke et al., 2017). LSTMs serve as a fine-grained component, but are wrapped within a larger multigrid connection topology, from which novel capabilities emerge.
In contrast to NTMs and the subsequent Differentiable Neural Computer (DNC) (Graves et al., 2016), we intertwine memory units throughout the interior of a deep network. Memory is a first-class citizen, rather than a separate data store accessed via a special controller. We introduce a new kind of network layer—a multigrid memory layer—and use it as a stackable building block to create deep memory networks. Contrasting with simpler LSTMs, our memory is truly deep; accessing an arbitrary memory location requires passing through several layers. Figure 1 provides a visualization of our approach; we defer the full details to Section 3. There are major benefits to this design strategy, in particular:
Memory scalability. Distributing storage over a multigrid hierarchy, we can instantiate large amounts while remaining parameter-efficient. Read and write operations are similarly distributed.
The low-level mechanism backing these operations is essentially convolution, and we inherit the parameter-sharing efficiencies of convolutional neural networks (CNNs). Our parameterized filters simply act across a spatially organized collection of memory cells, rather than the spatial extent of an image. Increasing the number of feature channels stored in each spatial cell costs parameters, but adding more cells incurs no such cost.
The multigrid layout of our memory network provides an information routing mechanism that is efficient with respect to overall network depth. We can grow the spatial extent of memory exponentially with depth, while guaranteeing there is a connection pathway between the network input and every memory unit. This allows experimentation with substantially larger memories.
Unification of compute and storage. Our memory layers incorporate convolutional and LSTM components. Stacking such layers, we create not only a memory network, but also a generalization of both a CNN and an LSTM. Our memory networks are standard networks with additional capabilities. Though not our primary experimental focus, Section 4 shows they can learn tasks which require performing classification alongside storage and recall.
A further advantage to this unification is that it opens a wide design space for connecting our memory networks to each other, as well as standard neural network architectures. For example, within a larger system, we could easily plug the internal state of our memory into a standard CNN—essentially granting that CNN read-only memory access. Sections 3 and 4 develop and experimentally validate two such memory interface approaches.
A diverse array of synthetic tasks serves as our experimental testbed. Mapping and localization, an inherently spatial task with relevance to robotics, is one focus. However, we want to avoid only experimenting with tasks naturally fit to the architecturally-induced biases of our memory networks. Therefore, we also train them to perform the same kind of algorithmic tasks used in analyzing the capabilities of NTMs and DNCs. Throughout all experimental settings, DNC accuracy serves as a comparison baseline. We observe significant advantages for multigrid memory, including:
Long-term large-capacity retention. On spatial mapping tasks, our architecture retains large, long-term memory. It correctly remembers observations of an external environment collected over paths consisting of thousands of time steps. Visualizing internal memory unit activations actually reveals the representation and algorithmic strategy our network learns in order to solve the problem. The DNC, in contrast, fails to master this category of task.
Generality. On tasks decoupled from any notion of spatial geometry, such as associative recall or sorting, our memory networks prove equally as capable as DNCs.
2 Related Work
There is an extensive history of work that seeks to grant neural networks the ability to read to and write from memory (Das et al., 1992, 1993; Mozer and Das, 1993; Zeng et al., 1994; Hölldobler et al., 1997). Das et al. (1992) propose neural network pushdown automaton, which performs differential push and pop operations on external memory. Schmidhuber (1992) uses two feedforward networks: one that produces context-dependent weights for the second network, whose weights may change quickly and can be used as a form of memory. Schmidhuber (1993)
proposes memory addressing in the form of a “self-referential” recurrent neural network that is able to modify its own weights.
Recurrent neural networks with Long Short-Term Memory (LSTMs)(Hochreiter and Schmidhuber, 1997) have enabled significant progress on a variety of sequential prediction tasks, including machine translation (Sutskever et al., 2014), speech recognition (Graves et al., 2013), and image captioning (Donahue et al., 2017). Such LSTMs are Turing-complete (Siegelmann and Sontag, 1995) and are, in principle, capable of context-dependent storage and retrieval over long time periods (Hermans and Schrauwen, 2013). However, a network’s capacity for long-term read-write is sensitive to the training procedure (Collins et al., 2017) and is limited in practice. In an effort to improve the long-term read-write abilities of recurrent neural networks, several modifications have been recently proposed. These include differentiable attention mechanisms (Graves, 2013; Bahdanau et al., 2014; Mnih et al., 2014; Xu et al., 2015) that provide a form of content-based memory addressing, pointer networks (Vinyals et al., 2015)
that “point to” rather than blend inputs, and architectures that enforce independence among the neurons within each layer(Li et al., 2018).
A number of methods augment the short- and long-term memory internal to recurrent networks with external “working” memory, in order to realize differentiable programming architectures that can learn to model and execute various programs (Graves et al., 2014, 2016; Weston et al., 2015; Sukhbaatar et al., 2015; Joulin and Mikolov, 2015; Reed and de Freitas, 2015; Grefenstette et al., 2015; Kurach et al., 2015). Unlike our approach, these methods explicitly decouple memory from computation, mimicking a standard computer architecture. A neural controller (analogous to a CPU) interfaces with specialized external memory (e.g., random-access memory or tapes).
The Neural Turing Machine (NTM) augments neural networks with a hand-designed attention-based mechanism to read from and write to external memory in a differentiable fashion. This enables the NTM to learn to perform various algorithmic tasks, including copying, sorting, and associative recall. The Differential Neural Computer (Graves et al., 2016) improves upon the NTM with support for dynamic memory allocation and additional memory addressing modes.
Other methods enhance recurrent layers with differentiable forms of a restricted class of memory structures, including stacks, queues, and dequeues (Grefenstette et al., 2015; Joulin and Mikolov, 2015). Gemici et al. (2017) augment structured dynamic models for temporal processes with various external memory architectures (Graves et al., 2014, 2016; Santoro et al., 2016).
Similar memory-explicit architectures have been proposed for deep reinforcement learning (RL) tasks. While deep RL has been applied successfully to several challenging domains(Mnih et al., 2015; Hausknecht and Stone, 2015; Levine et al., 2016), most approaches reason over short-term representations of the state, which limits their ability to deal with partial observability inherent in many RL tasks. Several methods augment deep RL architectures with external memory to facilitate long-term reasoning. Oh et al. (2016) maintain a fixed number of recent states in memory and then read from the memory using a soft attention-based read operation. Parisotto and Salakhutdinov (2018) propose a specialized write operator, together with a hand-designed 2D memory structure, both specifically designed for navigation in maze-like environments.
Rather than learn when to write to memory (e.g., as done by NTM and DNC), Pritzel et al. (2017) continuously write the experience of a reinforcement learning agent to a dictionary-like memory module that is queried in a key-based fashion (allowing for large memories). Building on this framework, Fraccaro et al. (2018) augment a generative temporal model with a specialized form of spatial memory that exploits a priori knowledge of the problem structure, including an explicit representation of the agent’s position in the environment.
Though we experiment with RL, our memory implementation contrasts with this past work. Our multigrid memory architecture jointly couples computation with memory read and write operations, and learns how to use a generic memory structure rather than one specialized to a particular task.
3 Multigrid Memory Architectures
To endow neural networks with long-term memory, we craft an architecture that generalizes modern convolutional and recurrent designs, embedding memory cells within the feed-forward computational flow of a deep network. Convolutional neural networks and LSTMs (specifically, the convolutional LSTM variety (Xingjian et al., 2015)) exist as strict subsets of the full connection set comprising our multigrid memory network. We even encapsulate modern residual networks (He et al., 2016)
. Though omitted from diagrams for the sake of clarity, in all experiments we utilize residual connections linking the inputs of subsequent layers across the depth (not time) dimension of our memory networks.
Implementing memory addressing behavior is the primary challenge when adopting our design philosophy. If the network structure is uniform, how will it be capable of selectively reading from and writing to only a sparse, input-dependent subset of memory locations?
A common approach is to build an explicit attention mechanism into the network design. Such attention mechanisms, independent of memory, have been hugely influential in natural language processing(Vaswani et al., 2017). NTMs (Graves et al., 2014) and DNCs (Graves et al., 2016) construct a memory addressing mechanism by explicitly computing a soft attention mask over memory locations. This naturally leads to a design reliant on an external memory controller, which produces and then applies that mask when reading from or writing to a separate memory bank.
Ke et al. (2017) recently proposed a multigrid variant of both standard CNNs and residual networks (ResNets). While their primary experiments concern image classification, they also present a striking result on a synthetic image to image transformation task: multigrid CNNs (and multigrid ResNets) are capable of learning to emulate attentional behavior. Their analysis reveals that the network’s multigrid connection structure is both essential to and sufficient for enabling this phenomenon.
The underlying cause is that bi-directional connections across a scale-space hierarchy (Figure 1, left) create exponentially shorter signalling pathways between units at different locations on the spatial grid. Information can be efficiently routed from one spatial location to any other location by traversing only a few network layers, flowing up the scale-space hierarchy, and then back down again.
We convert the inherent attentional capacity of multigrid CNNs into an inherent capacity for distributed memory addressing by replacing convolutional subcomponents with convolutional LSTMs (Xingjian et al., 2015). Grid “levels” no longer correspond to operations on a multiresolution image representation, but instead correspond to accessing smaller or larger storage banks within a distributed memory hierarchy. Dynamic routing across scale space (in the multigrid CNN) now corresponds to dynamic routing into different regions of memory, according to a learned strategy.
3.1 Multigrid Memory Layer
Figure 1 diagrams both the multigrid convolutional layer of Ke et al. (2017) and our corresponding multigrid memory, or MG-conv-LSTM, layer. Activations at a particular depth in our network consist of a pyramid , where indexes the pyramid level and indexes time. denote the output, hidden state, and memory cell contents of a convolutional LSTM (Xingjian et al., 2015), respectively. Following the construction of Ke et al. (2017), outputs at neighboring scales are resized and concatenated, with the resulting tensors fed as inputs to the corresponding scale-specific convolutional LSTM units in the next multigrid layer. The state associated with a conv-LSTM unit at a particular layer and level, say , is computed from memory: and , and the input tensor: , where , , and denote upsampling, downsampling, and concatenation, respectively. Like Ke et al. (2017)
, we include max-pooling as part of downsampling. We utilize a two-dimensional memory geometry, and change resolution by a factor ofin each spatial dimension when moving up or down a level of the pyramid.
Connecting many such memory layers yields a memory network or distributed memory mesh, as shown in the bottom diagram of Figure 1. Note that a single time increment (from to ) consists of running an entire forward pass of the network, propagating the input signal to the deepest layer
. Though not drawn here, we also incorporate batch normalization layers and residual connections along grids of corresponding resolution (i.e., from to ). These details mirror Ke et al. (2017).
3.2 Memory Interfaces
As our multigrid memory networks are multigrid CNNs plus internal memory units, we are able to connect them to other neural network modules as freely and flexibly as one could do with CNNs. Figure 2 diagrams a couple possibilities, which we experimentally explore in Section 4.
On the left, multiple “threads”, two readers and one writer, simultaneously access a shared multigrid memory. The memory itself is located within the writer network (blue), which is structured as a deep multigrid convolutional LSTM. The reader networks (red and orange), are merely multigrid CNNs, containing no internal storage, but observing the hidden state of the multigrid memory network.
The right side of Figure 2 diagrams a deep multigrid analogue of a standard paired recurrent encoder and decoder. This design substantially expands the amount of addressable memory that can be manipulated when learning common sequence-to-sequence tasks.
We first consider a RL-based navigation problem, whereby an agent is tasked with exploring a priori unknown environments with access to only observations of its immediate surroundings. Learning an effective policy requires maintaining a consistent representation of that environment (i.e., a map). Using memory as a form of map, an agent must learn where and when to perform write and read operations as it moves, while retaining the map over long time periods. This task mimics related partially observable spatial navigation scenarios considered by memory-based deep RL frameworks.
Problem Setup: The agent navigates an 2D maze with access to only observations of the grid () centered at the agent’s position. It has no knowledge of its absolute position. Actions consist of one-step motion in each of the four cardinal directions. While navigating, we query the network with a randomly chosen, previously seen, patch () and ask it to identify every location matching that patch in the explored map. See Figure 3 (left).
Multigrid Architecture: We use a deep multigrid network with multigrid memory and multigrid CNN subcomponents, linked together as outlined in Figure 2 (left). Here, our writer consists of 7 MG-conv-LSTM layers, with maximum pyramid spatial scale progressively increasing from 33 to 4848. The reader, structured similarly, has an output attached to its deepest 4848 grid, and is tasked with answering localization queries. Figure 3 provides an illustration. Section 4.2 experiments with an additional reader network that predicts actions that drive the agent to explore the maze.
4.1 Mapping & Localization
In order to understand the network’s ability to maintain a “map” of the environment in memory, we first consider a setting in which the agent executes a pre-defined navigation policy and evaluate its localization performance. We consider different policies (spiraling outwards or a random walk), patch sizes for observation and localization (33 or 99), as well as different trajectory (path) lengths. We compare against the following baselines:
Differentiable Neural Computer (DNC) (Graves et al., 2016)
Ablated MG: a multigrid architecture variant including only the finest pyramid scale at each layer.
ConvLSTM-deep: a deep 23-layer architecture, in which each layer is a convolutional LSTM on a 4848 grid. This contains the same total number of grids as our 7-layer multigrid network.
ConvLSTM-thick: 7 layers of convolutional LSTMs acting on 4848 grids. We set channel counts to the sum of channels distributed across the corresponding layer of our multigrid pyramid.
We train each architecture using RMSProp. We search over learning rates in log scale fromto , and use for multigrid and ConvLSTM, and for DNC. Randomly generated maps are used for training and testing. Training runs for steps with batch size . Test set size is maps. We used a pixel-wise cross-entropy loss over predicted and true locations.
Table 1 reports performance in terms of localization accuracy on the test set. For the simplest setting in which the agent moves in a spiral (i.e., predictable) motion and the observation and query are 33, our multigrid architecture achieves nearly perfect precision (), recall (
), and F-score (), while all baselines struggle. DNC performs similarly to Ablated MG in terms of precision (), at the expense of a significant loss in recall (). If we instead task the DNC with the simpler task of localization in a 1515 grid, we see that the performance improves, yet the rates are still around lower than our architecture on the more challenging 2525 environment. Efficiently addressing large memories is required here, and the DNC design is fundamentally limited: with more parameters than multigrid (0.68M compared to 0.65M) it can only address 8K memory cells compared to the multigrid network’s 77K.
Figure 3 (right) visualizes the contents of the deepest and high-resolution LSTM block within the multigrid memory network of an agent moving in a spiral pattern. The memory clearly mirrors the contents of the true map, demonstrating that the network has learned a correct, and incidentally, an interpretable, procedure for addressing and writing memory.
In more complex settings for motion type and query size (Table 1, bottom) our multigrid network remains accurate. It even generalizes to motions different from those on which it trained, including motion dictated by the learned policy that we describe shortly. Notably, even with the very long trajectory of 1500 time steps, our proposed architecture has no issue retaining a large map memory.
|Architecture||Params||Memory||World||Task Definition||Path||Localization Accuracy|
|Map||FoV||Motion||Query||Length||Prec. (%)||Recall (%)||F|
4.2 Joint Exploration, Mapping, and Localization
We next consider a setting in which the agent learns an exploration policy via reinforcement, on top of a fixed mapping and localization network that has been pre-trained with random walk motion. We implement the policy network as another multigrid reader, and intend to leverage the pre-trained mapping and localization capabilities to learn a more effective policy.
We formulate exploration as a reinforcement learning problem: the agent receives a reward of when visiting a new space cell, if it hits a wall, and otherwise. We use a discount factor , and the A3C (Mnih et al., 2016) algorithm to train the policy subnet within our multigrid architecture.
Figure 4 (left) depicts the localization loss while pre-training the mapping and localization subnets. Freezing these subnets, but continuing to monitor localization loss, we see that localization remains reliable while a rewarding policy is learned (Figure 4, right). The results demonstrate that the learned multigrid memory and query subnets generalize to trajectories that differ from those in their training dataset, as also conveyed in Table 1 (last row). Meanwhile, the multigrid policy network is able to utilize memory from the mapping subnet in order to learn an effective exploration policy. See the supplementary material for a visualization of learned exploratory behavior.
4.3 Algorithmic Tasks
We test the task-agnostic nature of our multigrid memory architecture by evaluating on a series of algorithmic tasks, closely inspired by those appearing in the original NTM work (Graves et al., 2014). For each of the following tasks, we consider two variants, increasing in level of difficulty.
Priority Sort. In the first variant, the network receives a sequence of twenty 33 patches, along with their priority. The task is to output the sequence of patches in the order of priority. Training and testing use randomly generated data. Training takes steps, batch size 32, and testing uses 5000 sequences. We tune hyper-parameters as done for the mapping task. We structure our model as an encoder-decoder architecture (Figure 2, right). As revealed in Table 2, our network performs comparably with DNC, with both architectures achieving near-perfect performance.
The second variant extends the priority sort to require recognition capability. The input is a sequence of twenty 2828 MNIST images (Lecun et al., 1998). The goal is to output the class of the input images in increasing order. Table 2 reveals that our architecture achieves much lower error rate compared to DNC on this task (priority sort + classification), while also learning faster (Figure 6).
Associative Recall. In the first task formulation, the network receives a sequence of ten 33 random patches, followed by a second instance of one of the first nine patches. The task is to output the patch that immediately followed the query in the input sequence. We demonstrate this capability using the multigrid reader/writer architecture (Figure 2, left). Training details are similar to the sorting task. Table 2 shows that both DNC and our architecture achieve near-zero error rate.
In the second variant, the input is a sequence of ten randomly chosen MNIST images (Lecun et al., 1998), where the network needs to output the class of the image immediately following the query (Figure 5). As shown in Table 2 and Figure 6, our multigrid memory network performs this task with significantly greater accuracy than the DNC, and also learns in fewer training steps.
The harder variants of both priority sort and associative recall require a combination of memory and pattern recognition capability. The success of multigrid memory networks (and notable poor performance of DNCs), demonstrates that they are a unique architectural innovation. They are capable of learning to simultaneously perform representational transformations and utilize a large distributed memory store. Furthermore, as Figure6 shows, across all difficult tasks, including mapping and localization, multigrid memory networks train substantially faster and achieve substantially lower loss than all competing methods.
Our multigrid memory architecture represents a new paradigm in the history of designs linking deep neural networks to long-term storage. We gain dramatic flexibility and new capabilities by co-locating memory with computation throughout the network structure. The technical insight driving this design is an identification of attentional capacity with addressing mechanisms, and the recognition that both can be supported by endowing the network with the right structural connectivity and components: multigrid links across a spatial hierarchy. Memory management is thereby implicit, being distributed across the network and learned in an end-to-end manner. Multigrid architectures efficiently address large amounts of storage, and are more accurate than competing approaches across a diverse range of memory-intensive tasks.
Acknowledgments. We thank Gordon Kindlmann for his support in pursuing this project, Chau Huynh for her help with the code, and Pedro Savarese and Hai Nguyen for fruitful discussions. This work was supported in part by the National Science Foundation under grant IIS-1830660.
- Bahdanau et al.  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
- Collins et al.  J. Collins, J. Sohl-Dickstein, and D. Sussillo. Capacity and trainability in recurrent neural networks. ICLR, 2017.
- Das et al.  S. Das, C. L. Giles, and G.-Z. Sun. Learning context-free grammars: Capabilities and limitations of a recurrent neural network with an external stack memory. In CogSci, 1992.
- Das et al.  S. Das, C. L. Giles, and G.-Z. Sun. Using prior knowledge in an NNPDA to learn context-free languages. In NIPS, 1993.
- Donahue et al.  J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- Fraccaro et al.  M. Fraccaro, D. J. Rezende, Y. Zwols, A. Pritzel, S. M. A. Eslami, and F. Viola. Generative temporal models with spatial memory for partially observed environments. ICML, 2018.
- Gemici et al.  M. Gemici, C.-C. Hung, A. Santoro, G. Wayne, S. Mohamed, D. J. Rezende, D. Amos, and T. Lillicrap. Generative temporal models with memory. arXiv:1702.04649, 2017.
- Graves  A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
- Graves et al.  A. Graves, A. rahman Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
- Graves et al.  A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv:1410.5401, 2014.
- Graves et al.  A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 2016.
- Grefenstette et al.  E. Grefenstette, K. M. Hermann, M. Suleyman, and P. Blunsom. Learning to transduce with unbounded memory. In NIPS, 2015.
- Hausknecht and Stone  M. Hausknecht and P. Stone. Deep recurrent Q-learning for partially observable MDPs. In AAAI Fall Symposium, 2015.
- He et al.  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.
- Hermans and Schrauwen  M. Hermans and B. Schrauwen. Training and analysing deep recurrent neural networks. NIPS, 2013.
- Hochreiter and Schmidhuber  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.
Hölldobler et al. 
S. Hölldobler, Y. Kalinke, and H. Lehmann.
Designing a counter: Another case study of dynamics and activation
landscapes in recurrent networks.
Annual Conference on Artificial Intelligence, 1997.
- Joulin and Mikolov  A. Joulin and T. Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets. In NIPS, 2015.
- Ke et al.  T.-W. Ke, M. Maire, and S. X. Yu. Multigrid neural architectures. CVPR, 2017.
- Kurach et al.  K. Kurach, M. Andrychowicz, and I. Sutskever. Neural random-access machines. arXiv:1511.06392, 2015.
- Lecun et al.  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
Levine et al. 
S. Levine, C. Finn, T. Darrell, and P. Abbeel.
End-to-end training of deep visuomotor policies.
The Journal of Machine Learning Research, 2016.
- Li et al.  S. Li, W. Li, C. Cook, C. Zhu, and Y. Gao. Independently recurrent neural network (indrnn): Building a longer and deeper rnn. CVPR, 2018.
- Mnih et al.  V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In NIPS, 2014.
- Mnih et al.  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.
- Mnih et al.  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ICML, 2016.
- Mozer and Das  M. C. Mozer and S. Das. A connectionist symbol manipulator that discovers the structure of context-free languages. In NIPS, 1993.
- Oh et al.  J. Oh, V. Chockalingam, S. Singh, and H. Lee. Control of memory, active perception, and action in Minecraft. arXiv:1605.09128, 2016.
- Parisotto and Salakhutdinov  E. Parisotto and R. Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. ICLR, 2018.
- Pritzel et al.  A. Pritzel, B. Uria, S. Srinivasan, A. Puigdomenech, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell. Neural episodic control. ICML, 2017.
- Reed and de Freitas  S. Reed and N. de Freitas. Neural programmer-interpreters. arXiv:1511.06279, 2015.
- Santoro et al.  A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. One-shot learning with memory-augmented neural networks. arXiv:1605.06065, 2016.
- Schmidhuber  J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 1992.
- Schmidhuber  J. Schmidhuber. A ‘self-referential’weight matrix. In ICANN, 1993.
- Siegelmann and Sontag  H. T. Siegelmann and E. D. Sontag. On the computational power of neural nets. Journal of computer and system sciences, 1995.
- Sukhbaatar et al.  S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In NIPS, 2015.
- Sutskever et al.  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
- Vaswani et al.  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. NeurIPS, 2017.
- Vinyals et al.  O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In NIPS, 2015.
- Weston et al.  J. Weston, S. Chopra, and A. Bordes. Memory networks. ICLR, 2015.
- Xingjian et al.  S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. NIPS, 2015.
- Xu et al.  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044, 2015.
- Zeng et al.  Z. Zeng, R. M. Goodman, and P. Smyth. Discrete recurrent neural networks for grammatical inference. IEEE Transactions on Neural Networks, 1994.