The generalization ability of neural networks is the most important criterion to determine whether they are powerful or not. Recent work on memory-augmented neural networks (MANN) has shown promising results about generalizing perfectly on simple-algorithm tasks (Grefenstette et al., 2015; Joulin and Mikolov, 2015; Graves et al., 2016; Rae et al., 2016; Gulcehre et al., 2017). However, understanding the working mechanism of memory modules has not been well studied. Existing work has stopped at conjecturing the underlying learned strategies based on shallow, single-case visualization without any further verification. This raises concern of the lack of interpretability and hinders us from designing better memory modules. Currently, there are two main difficulties for interpreting the black-box memory-modules.
Firstly, the diversity among the different MANNs makes it hard to isolate the functions of memory-modules. Consequently, we cannot focus on the decisive parts of the models, as the different designs make the comparison work hard to carry out. Secondly, to interpret MANNs is challenging. Although interpretation methods for RNN models are well studied, little attention has been paid on that for MANNs.
To solve the above problems, we formalize a unified framework for MANNs with different implementations of memory modules, which makes comparison among different memory modules feasible. Then, we propose a novel qualitative analysis method based on dimension reduction for interpreting memory cells by verifying hypotheses. We implement neural Turing machine and stack-augmented neural network under the unified framework and carry out detailed analysis on two algorithm tasks consisting of reversing a random sequence and evaluating arithmetic expressions to show the effectiveness of our proposed analysis method.
The experiment and analysis have shown that the external memory can compensate for the need for storing intermediate results to travel along the dependency path. Specifically, neural Turing machine generalizes well on mirror task, and stack-augmented neural network generalizes well on reversing a random sequence and evaluation of arithmetic expressions. To summarize, our main contributions are as follows:
We generalize different MANNs with a unified framework, which ensures the fairness of comparing different memory mechanisms.
We propose a novel analysis method for memory cells. By applying our analysis method, we infer and verify the hypotheses about what strategy has been learned by the model that can generalize well.
2 Our unified MANN framework
In order to compare different types of memory modules, we propose a unified framework for MANNs by fixing the controller and memory access interface. By abstracting the processing components used in stack augmented neural network (Joulin and Mikolov, 2015; Yogatama et al., 2018) and neural Turing machines (Graves et al., 2014), the framework shown in Figure 1 contains an LSTM controller, a memory module at time with cells represented as equipped with a specific read-write method, such as push and pop actions of the stack memory extension.
The input at time step to the controller is a combination of the system input at time step and readout at the last time step. Formally,
where is a learnable function and we here take it simply to be a concatenation operation. The state of the LSTM controller is represented as , updated by the standard LSTM model:
The controller output and the input are then inputted to the write module and the read module, indicating how to interact with the memory. Finally, the readout is combined with the controller output to form the system output at time step :
where is also a concatenation function.
In this paper, we experiment with two typical MANNs whose external memories are a stack memory and a tape memory respectively. To implement the two models under our proposed framework, we only need to specify the detailed read and write methods, which are shown in Section 2.1 and 2.2 respectively. As a special case, an LSTM model is an instance of the framework without memory111The LSTM model contains internal memory actually. The term memory mentioned in this paper all refers to external memory, if without specific explanation..
2.1 Stack memory
We adopt the stack-augmented neural network (SANN) proposed in Yogatama et al. (2018). The readout at each time step is just the top of the stack:
For each write step, there are possible actions ’s to choose:
: Push the transformed current controller state onto the stack after pops.
: Keep the stack unchanged after pops.
is a learnable function and is implemented as a linear transformation.
In order to ensure the model differentiable, the write step for the stack is formalized as the expectation of the memory after one of the specific actions :
where the is the memory after the action
is adopted, and the probabilities of the write actionsis computed using the current memory and the system input 222Although the policy is parameterized in a recursive formula in the original paper, we find this simple setting is powerful enough.:
where the is a two-channel -D convolutional operation with a size- kernel over the memory cells.
2.2 Tape memory
As the memory in a standard Turing machine is called a tape, we here name the class of neural Turing machines by TANNs, i.e. tape augmented neural networks. The tape memory module is adopted as mentioned in Graves et al. (2014), with random access read-write steps totally controlled by the controller.
Both the read and write actions contain two preparation steps before the actual interaction with the memory: 1) an analysis step to the controller state and 2) an addressing step. The detailed formulas of these two steps are omitted here. After the addresses to read and write are determined, the readout is the expectation of the memory cells over distribution :
where means to index the
th row of the vector or matrix. And the write step:
which is combining the influence of erase vector and add vector on the memory over the distribution .
In the experiment, we want to figure out which kind of and how memory module helps generalize on two algorithm tasks including reversing a random sequence (called mirror task) and evaluating arithmetic expressions (called M10AE task). Simple RNN (SimpRNN) and LSTM are adopted as two baseline models, which represent neural networks without external memory modules.
3.1 Experimental settings
For mirror task, the input is a sequence of binary vectors whose size is
. During encoding (input) stage, the inputs are randomly sampled from Bernoulli distribution withand the inputs are zero vector during decoding (output) stage. This setting is similar to that of copy task in Graves et al. (2014). Both the number of and size of memory cells are . The controller dimension is . For M10AE task, the input embeddings ’s are trainable parameters from random initialization. The input embedding dimension . The dimension of a memory cell and a controller state is set to be equal to the input embedding dimension. The number of memory cells is chosen from [, , ]. We adopt training using Adam algorithm with batch size chosen from [, , , ] and the learning rate chosen from [, ]. The hyper-parameters are tuned on a development set.
First, we are interested in exploring whether SANN and TANN are able to learn to output the input sequence in a reverse order, which we call mirror task. An example of this task is shown in the first row of Table 1. We append a delimiter at the end of the input to tell the model when to output, and it is noted as . We adopt the length of input sequence as the difficulty measure of each sample.
The maximum length of input sequence is during the training stage, and the maximum length is extended to when testing. Here we view a prediction as correct only if the whole output sequence is the same as that of the input. The result is shown in Figure 2. We can find that both SANN and TANN can generalize beyond input length of training samples. Figure 2b shows that SANN converges faster than TANN, which corresponds with the intuition that stack memory is more suitable for this task.
|a. test performance||b. learning curves|
Since TANN and SANN can generalize greatly, we here analyze both of them to investigate what strategy they have induced. In order to gain a general averaged insight into what mechanism underlying these two models on mirror task, we generate samples with the same length, whose each input binary vector is restricted in the binary format of . These numbers can be viewed as the labels of the samples, which helps index the input vectors. And all the analysis for mirror task is based on the samples.
|a. controller gate (TANN)||b. read-write policy (TANN)|
|c. controller gate (SANN)||d. push-pop policy (SANN)|
We first are interested in investigating how the controller gates change on mirror task. Specifically, we plot in Figure 3a and 3c the averaged saturation ratio (Karpathy et al., 2015) of the input gates and forget gates of the controller along with each input. Here a gate is defined right-saturated if its value is larger than and defined left-saturated if its value is less than . Comparing these two figures, we can find that both TANN and SANN are sensitive of the delimiter in terms of each controller gates, after which dramatic changes of saturation rate appear. The change of controller gates of TANN seem more complicated than that of SANN, which indicates it is much easier to control a stack memory to finish the mirror task than a tape memory. The early convergence of SANN on mirror task can also support this idea.
We then visualize the read-write and push-pop policies for TANN and SANN respectively. For read-write policy of TANN, we average the expected address over the samples. The expected address for read and write operations at time step can be calculated as:
where the is either or to get the expected address for read or write. For SANN, the push probability is just the probability sum mass of all type of push actions:
and the expected number of times to pop (noted as ) at time step can be calculated as:
The result is shown in Figure 3b and 3d. We can find that both for TANN and SANN the policies for encoding (solid red lines) and decoding (solid blue lines) information are opposite to each other, indicating reversing in their own ways.
Based on these findings above, we make hypothesis about what strategy learned by TANN by the following pseudocode:
And the hypothesis strategy for SANN on mirror task is:
The next goal is to verify these hypotheses above. To this end, we evaluate the hypothesis information encoded in each memory cell by our proposed qualitative verification method. This method is based on the assumption that if the intermediate results are the same in a certain step of the hypothesized strategy, then their distributed representations in the memory should be similar as well. In detail, this includessteps:
Collecting the memory cells at the same position in the memory at the same time step.
Labelling the memory cells with the corresponding results derived from the candidate strategy to get pairs .
Using t-SNE (Maaten and Hinton, 2008) to visualize the labelled memory cell vectors.
If there appear the clear labelled clusters, then the candidate strategy is reasonable.
The examples are shown in Figure 4. The compact labelled clusters in Figure 4a, 4b (for TANN) and Figure 4d, 4e (for SANN) support the hypothesis semantics shown in the captions of each subfigure. We also include negative examples in Figure 4c and 4f to exemplify when the evaluation result shows the hypothesis is not reasonable. Since the input sequence are randomly sampled, labelling guided by a wrong hypothesis can not cover all the samples and this mismatch will also show up in the visualization as chaotic labels.
|a. (, );||b. (, );||c. (, );|
|d. (, );||e. (, );||f. (, );|
After analysis on mirror tasks, we are then interested in extending the interpretation procedure to the harder case. We propose a new simple algorithm task named modulo- arithmetic expressions, M10AE, in which each input sequence is in the form of an arithmetic expression without parentheses and the output is the evaluation result of it. In order to make the task tractable to analyze while preserving the recursive nature inside it, we add the following constraints:
Each numeral is limited in .
The / represent a modulo operator.
Each intermediate result is modulo by .
The *, / have higher priority than + and -.
An example is shown in the second row of Table 1. Based on the constraints, the evaluation results can only range in integers from to , and thus we formalize the task as a -class classification. Since the computation procedure interrupts when encountering low-priority operators and thus larger number of low-priority operators (#LPO) means larger memory burden, we take #LPO to be the difficulty measure for M10AE.
Recently, Hupkes et al. (2018) and Jacob et al. (2018) have proposed two similar tasks. However, they formalize the task as either a regression problem or a two-digit sequence prediction problem, without consideration of the intermediate results. These settings are poor at restricting the space of intermediate results, whose distribution is much sparser, and this hinders us from verifying a candidate strategy empirically. By contrast, in M10AE, the result is given as a classification label, and both the intermediate results and the final results range in integers from to due to modulo- design. This leads to abundant samples for each intermediate category.
and thousand examples are generated for training and valiation, respectively. The maximum #LPO is for training and for validation. The results are shown in Figure 5. Overall, the performance of all the models except SANN drops quickly with the increase of #LPO. Figure 5a indicates the SANN has learned to generalize on this task. In addition, the models with an external memory are better than the ones without any external memory (i.e. LSTM and SimpRNN). By contrast, the baseline model simpRNN performs extremely bad, indicating that the internal memory of LSTM is crutial for this task. As shown in Figure 5b, all the models converge slowly, which indicates the complexity of the task.
|a. test performance||b. learning curves|
As the SANN seems to induce a stable strategy, we then analyze it. We generate samples with the same structure to get general averaged patterns. And the analysis is all based on these examples. These examples are noted as , where represents the th numeral, represents the th low-priority operator (i.e. + or -), and represents the th high-priority operator (i.e. * or /). One samples is 8+6*3/2-4. Other samples generated by substituting the operators and numbers at the same position into the other symbols from the same category, e.g., * to /, + to -, 1 to 2, and etc..
In the same way with Section 3.2.2, we first visualize the controller gates and the read-write policy of SANN. The results are shown in Figure 6. As shown in Figure 6a, The controller gates peak at the time steps when the low-priority operators (i.e. and in Figure 6a) appear, which indicates the controller of SANN track specific categories of symbols as the controllers do in mirror task. As shown in Figure 6, the push/pop actions are adopted regularly with respect to local structures: 1) every time the lower-priority operator (+, -) comes, the push probability goes to zero and the expectation of pop times rise up sharply to around ; and 2) when dealing with high-priority operations (*, /), the push/pop lines go up/down with each numeral input relatively more gently.
|a. controller gate||b. push-pop policy|
However it is still hard to make a hypothesis about what SANN has learned, and we then visualize the averaged memory at each time step to get further clues shown in Figure 7. The pattern is highly regular: the stack pops all its stored items when it comes to the end of a term (i.e. and here), which is consistent with Figure 6b; during processing each term, the stack pushes every time it sees a high-priority operator.
A hypothesis about what strategy SANN has induced is shown below:
where 333As a special case, if the stack is empty or is , the second argument of this function will be returned. is to evaluate the result of given the function pointer and the arguments , .
There are two kinds of storage in this hypothesis, in-controller storage and in-memory storage . An example is shown in Figure 8, where the boxes and arrows in purple (an evaluation step, e.g. 6*3%10=8 at time step ) and red (a combination-pushing step, e.g. push the combination of 6 and * at time step ) directing the information flow indicates a recursive strategy.
As in Section 3.2.2, we then verified the hypothesis by our proposed analysis method. The result is shown Figure 9, where the clear clusters (in Figure 9a and 9b) indicate the hypothesis is reasonable. A negative example is also included in Figure 9c for illustration about an unreasonable hyothesis.
|a. (, );||b. (, );||c. (, );|
4 Related Work
The related work can be divided into two categories, memory-augmented neural networks and visualization methods.
4.1 Memory Augmented Neural Network
The first MANN model is NTM, neural Turing machine (Graves et al., 2014), which is proposed to assign logical flow control over external memory to RNNs. These types of models are associated with the automaton theory. The MANNs can be viewed as RNNs with only internal memory augmented with different types of data structures, as a simple DFA is to form complex automaton like PDA, LBA and Turing machine. Thus some work focus on various memory types from different prior bias for specific tasks. For example, an RNN can learn to generalize on context-free grammars augmented with stacks (Joulin and Mikolov, 2015; Grefenstette et al., 2015), learn to model shortest syntactic dependence with a gated memory (Gulcehre et al., 2017) and learn to solve shortest-path tasks with a more general tape memory (Graves et al., 2016). Some work is dedicated to overcoming the defects of these models, especially for NTMs, such as separating each memory cell into content and address vectors (Gulcehre et al., 2016), introducing memory allocation and de-allocation protocols (Graves et al., 2016; Munkhdalai and Yu, 2017), speeding up the addressing mechanism (Rae et al., 2016) and adding adaptive computational design (Yogatama et al., 2018).
4.2 Understanding Recurrent Networks
Understanding the RNNs and their variants is enjoying renewed interest, as a result of successful applications in a wide range of machine learning problems on sequential data. For one thing, many work focus on what the RNNs remember in their hidden states. Some visualize the dynamics of the LSTM gates in terms of absolute value and saturation rates that keep track of the structure hints(Karpathy et al., 2015; Ghaeini et al., 2018)
. And some plot t-SNE visualization for clause representations and derivative saliency maps to understand the polarity changes in sentiment analysis(Li et al., 2016a). There are also researches trying training multiple decoders to predict the history inputs for checking what and how much information is stored (Koppula et al., 2018). And in Verwimp et al. (2018), the gradient matrix of the states with respect to the input embeddings is decomposed with SVD to find the principal direction in the input space. For another, which part of the input has a key effect on the model decisions is also a popular topic. Many works view this problem as searching the minimum set of word vectors or their dimensions to flip the models’ decision (Li et al., 2016b; Westhuizen and Lasenby, 2018); and there is other work computing different relevance measures between inputs and outputs to explore the contribution of the words (Arras et al., 2017; Ding et al., 2017; van der Westhuizen and Lasenby, 2017).
In this paper, we analyze strategies learned by memory augmented neural networks on task of reversing random sequence and evaluation of arithmetic expressions. By visualizing the controller gates and read-write policy for the memory, we find both models can summarize the input symbols into categories and dynamically change the policy according to these categories. We make hypothesis about what strategy is induced by the models, and verifying them by our proposed novel qualitative analysis method. One can mimic the analysis pipeline for other settings and thus this work helps inspire more researches on the strategy interpretation for MANNs.
Arras et al. 
Leila Arras, Grégoire Montavon, Klaus-Robert Müller, and Wojciech
Explaining recurrent neural network predictions in sentiment analysis.In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 159–168. Association for Computational Linguistics, 2017.
Ding et al. 
Yanzhuo Ding, Yang Liu, Huanbo Luan, and Maosong Sun.
Visualizing and understanding neural machine translation.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1150–1159, 2017.
Ghaeini et al. 
Reza Ghaeini, Xiaoli Fern, and Prasad Tadepalli.
Interpreting recurrent and attention-based neural models: a case
study on natural language inference.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4952–4957. Association for Computational Linguistics, 2018.
- Graves et al.  Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.
- Graves et al.  Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471, 2016.
- Grefenstette et al.  Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce with unbounded memory. In Advances in Neural Information Processing Systems, pages 1828–1836, 2015.
- Gulcehre et al.  Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho, and Yoshua Bengio. Dynamic neural turing machine with soft and hard addressing schemes. CoRR, abs/1607.00036, 2016.
- Gulcehre et al.  Caglar Gulcehre, Sarath Chandar, and Yoshua Bengio. Memory augmented neural networks with wormhole connections. CoRR, abs/1701.08718, 2017.
Hupkes et al. 
Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema.
Visualisation and’diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure.
Journal of Artificial Intelligence Research, 61:907–926, 2018.
- Jacob et al.  Athul Paul Jacob, Zhouhan Lin, Alessandro Sordoni, and Yoshua Bengio. Learning hierarchical structures on-the-fly with a recurrent-recursive model for sequences. In Proceedings of The Third Workshop on Representation Learning for NLP, Rep4NLP@ACL 2018, Melbourne, Australia, July 20, 2018, pages 154–158, 2018.
- Joulin and Mikolov  Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets. In Advances in neural information processing systems, pages 190–198, 2015.
- Karpathy et al.  Andrej Karpathy, Justin Johnson, and Fei-Fei Li. Visualizing and understanding recurrent networks. CoRR, abs/1506.02078, 2015.
- Koppula et al.  Skanda Koppula, Khe Chai Sim, and Kean K. Chin. Understanding recurrent neural state using memory signatures. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018, pages 2396–2400, 2018.
- Li et al. [2016a] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 681–691, 2016.
- Li et al. [2016b] Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. CoRR, abs/1612.08220, 2016.
- Maaten and Hinton  Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
- Munkhdalai and Yu  Tsendsuren Munkhdalai and Hong Yu. Neural semantic encoders. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 1, page 397. NIH Public Access, 2017.
- Rae et al.  Jack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior, Gregory Wayne, Alex Graves, and Tim Lillicrap. Scaling memory-augmented neural networks with sparse reads and writes. In Advances in Neural Information Processing Systems, pages 3621–3629, 2016.
- van der Westhuizen and Lasenby  Jos van der Westhuizen and Joan Lasenby. Visualizing LSTM decisions. CoRR, abs/1705.08153, 2017.
- Verwimp et al.  Lyan Verwimp, Hugo Van hamme, Vincent Renkens, and Patrick Wambacq. State gradients for rnn memory analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 344–346. Association for Computational Linguistics, 2018.
- Westhuizen and Lasenby  Jos Van Der Westhuizen and Joan Lasenby. Techniques for visualizing lstms applied to electrocardiograms. arXiv preprint arXiv:1705.08153, 2018.
- Yogatama et al.  Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom. Memory architectures in recurrent neural network language models. In International Conference on Learning Representations, 2018.