Memory Augmented Recursive Neural Networks

11/05/2019 ∙ by Forough Arabshahi, et al. ∙ 18

Recursive neural networks have shown an impressive performance for modeling compositional data compared to their recurrent counterparts. Although recursive neural networks are better at capturing long range dependencies, their generalization performance starts to decay as the test data becomes more compositional and potentially deeper than the training data. In this paper, we present memory-augmented recursive neural networks to address this generalization performance loss on deeper data points. We augment Tree-LSTMs with an external memory, namely neural stacks. We define soft push and pop operations for filling and emptying the memory to ensure that the networks remain end-to-end differentiable. In order to assess the effectiveness of the external memory, we evaluate our model on a neural programming task introduced in the literature called equation verification. Our results indicate that augmenting recursive neural networks with external memory consistently improves the generalization performance on deeper data points compared to the state-of-the-art Tree-LSTM by up to 10



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Combining Symbolic and Function Evaluation Expressions In Neural Programs

view repo


Tree Stack Memory Units

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compositional data are ubiquitous and comopositional models are naturally the right candidates to fit such data. For example, recursive neural networks such as Tree-RNNs and Tree-LSTMs have been successfully applied to tasks in natural language processing, vision and neural programming

(Tai et al., 2015; Socher et al., 2011; Allamanis et al., 2017; Arabshahi et al., 2018; Evans et al., 2018).

Although recursive networks significantly outperform their flat counterparts, they still fail to generalize to unseen composition complexity. In other words, if we fix the composition complexity of the training data and test the models on unseen data points, the performance of the models start to decay as the complexity of the unseen data grows beyond that of the training data. This is an ongoing challenge for neural networks and it is important to address it because it is unreasonable to assume that the model sees all possible composition complexities during training. Therefore, it is important to build architectures that generalize to unseen composition complexity.

One of the reasons of this lack of generalization is error propagation as the composition complexity and depth grows. The more representative the components of the recursive neural network, the less the error propagation and the better the generalization performance to higher complexity. We show in this paper that external memory improves the generalization performance of recursive neural networks for neural programming.

Neural programming refers to the task of using neural network to learn programs, mathematics and logic from data. Often in neural programming, a core neural network cell is used to learn and represent a certain program such as addition or sort and mathematical or logical function such as multiplication and OR. Most of these programs and functions can be implemented more compactly and efficiently using recursion, potentially making it easier for neural networks to model them. For example, the pseudo code for a recursive and non-recursive implementation of the multiplication function is given in Figure 2. It is known that stacks are used to implement and execute a recursive function.

In this paper, we build upon this observation and intuition and introduce stack augmented recursive neural networks. We hypothesize that augmenting neural networks with additional stacks allows them to better represent recursive functions. We show that on the task of mathematical equation verification (Arabshahi et al., 2018) this structure consistently improves the generalization performance of Tree-LSTM on unseen (higher) depths. An example of a symbolic representation of the recursive multiplication algorithm in Figure 2 is given below:


The compositional structure of Equation 1 is depicted in Figure 1. In this paper, we use recursive neural networks for representing symbolic mathematical expressions where each function is a neural network whose weights are shared with the other appearances of that function.

We show that if we augment the cells of a recursive neural network with additional memory with the structure of a stack the cells are able to learn a better representation of the functions resulting in less error propagation and an improved generalization performance.

Figure 1: recursive symbolic multiplication of given in Equation 1.

1.1 Summary of Results

Our contributions in this paper are twofold. First, we propose to augment recursive neural networks with an external memory that has the data structure of a stack. This memory will allow the network to generalize to data of higher complexity and depths.

We augment Tree-RNNs and Tree-LSTMs with an external data structures, namely differentiable stacks. A stack is a Last In First Out (LIFO) data structure often used for recursive computation and implementation. We develop soft push and pop operations for the stack and allow the model to learn to control these operations.

Second, we test the proposed model on a neural programming task called mathematical equation verification where given a symbolic math equality, the goal is to verify if the equation is correct or incorrect (Allamanis et al., 2017; Arabshahi et al., 2018). We show that stack-augmented recursive neural networks consistently improve the generalization performance of Tree-NNs and Tree-LSTMs on unseen composition complexity. Moreover, we provide a model ablation study that shows the effect of different model components on the output results. We show that stack-augmented recursive neural networks consistently improve the performance of Tree-LSTMs and Tree-RNNs on the equation verification task on data of higher composition complexity.

# Iterative multiplication
def mul(x, y):
  # x and y are lists of digits with the least significant bit in index 0
  prod = zeros(1,len(x)+len(y)-1)
  for x_i in range(0,len(x)):
    carry = 0
    for y_i in range(0,len(y)):
      prod[x_i+y_i] += carry + x[x_i] *
      carry = prod[x_i+y_i] / 10.0
      prod[x_i+y_i] = prod[x_i+y_i] mod
    prod[x_i+len(y)-1] = carry
  return prod
# Recursive multiplication
def recursive_mul(x,y):
  if x<y:
    return recursive_mul(y,x)
  elif y != 0:
    return (x + recursive_mul(x, y-1))
Figure 2: non-recursive vs. recursive multiplication of numbers and in base 10. As can be seen, the recursive implementation is more compact. Note that if and are floating point numbers they should first be pre-processed and converted into integers through multiplication and the result should be converted back to floating point through division.

1.2 Related Work

Recursive neural networks have been used to model compositional data in many applications including natural scene classification

(Socher et al., 2011), sentiment classification, Semantic Relatedness and syntactic parsing (Tai et al., 2015; Socher et al., 2011), neural programming and logic (Allamanis et al., 2017; Zaremba et al., 2014; Evans et al., 2018). In all these problems there is an inherent hierarchy nested in the data and capturing it allows the models to improve significantly.

Recursive neural networks have shown to be good at capturing long term dependencies (Tai et al., 2015)

compared to flat recurrent neural networks

(Graves et al., 2013; Hochreiter and Schmidhuber, 1997). However, their performance significantly drops when generalizing to dependency ranges not seen in the training data.

Recently there have been attempts to provide a global memory to recurrent neural models that will play the role of a working memory and can be used to store information to and read information from. (Graves et al., 2014; Jason Weston, 2015; Grefenstette et al., 2015; Joulin and Mikolov, 2015).

Memory networks and their differentiable counterpart Jason Weston (2015); Sukhbaatar et al. (2015) store instances of the input data into an external memory that can later be read through their recurrent neural network architecture. Neural Programmer Interpreters augment their underlying recurrent LSTM core with a key-value pair style memory and they additionaly enable read and write operations for accessing it Reed and De Freitas (2015); Cai et al. (2017)

. Neural Turing Machines

Graves et al. (2014) define soft read and write operations so that a recurrent controller unit can access this memory for read and write operations. Another line of research proposes to augment recurrent neural networks with specific data structures such as stacks and queues Das et al. (1992); Sun et al. (2017); Joulin and Mikolov (2015); Grefenstette et al. (2015). There has also been an attempt to improve the performance of recurrent neural networks (Trinh et al., 2018)

Despite the amount of effort spent on augmenting recurrent neural networks, to the best of our knowledge, there has been no attempt to enable an external memory to recursive networks which will allow them to generalize to higher recursion depths. In each step of the traversal the nodes get to read from or write to this additional memory. Therefore, in this work, inspired by the recent attempts to augment recurrent neural networks with stacks, we propose to augment recursive neural networks with an external memory that they can access to fill or read from.

In a parallel research direction Kumar et al. (2016) present episodic memory for question answering applications. This is different from the symbolic way of defining memory for models that handle neural programming tasks. Another different line of work are graph memory networks and tree memory networks Pham et al. (2018); Fernando et al. (2018) which construct a memory with a specific structure and are different from augmenting a recursive neural network with an additional global memory

(a) Tree-LSTM
(b) Stack augmented Tree-LSTM
Figure 3: Model architecture of Tree-LSTM vs. stack-augmented Tree-LSTM. The yellow symbol in Figure 2(b) represents the soft push/pop operation and it’s details are shown in the square labeled soft push/pop. Both figures show a sub-tree with a parent block and its two children.

2 Memory augmented recursive neural networks

In this section, we introduce memory augmented recursive neural networks. We augment Tree-RNNs and Tree-LSTMs with external differentiable memory. This memory has the data structure of a neural stack. The network learns to read from and write to the memory by popping from and pushing to the memory, respectively. We develop soft push and pop operations making the network end-to-end differentiable, therefore, we can train these networks in an end-to-end manner. A sketch of the stack-augmented Tree-LSTM is shown in Figure 2(b). Stacks are the , and in that figure, and the model accesses the memory through . We describe more details about the model in the next subsections.

In this section we present matrices with bold uppercase letters, vectors with bold lowercase letters and scalars with non-bold letters. For simplicity, we will present the equations for a binary recursive neural network in this paper. However, everything can be trivially extended to n-ary recursive neural networks. A recursive neural network is a tree-structured network where each node of the tree is a neural network block. The structure of the tree is often indicated by the data. For example, for language models, the structure of the tree can be the structure of the dependency parse of the input sentence, or for mathematical equation verification, the structure of the tree is the structure of the input equation (Figure


All the nodes or blocks of the recursive neural networks have a state denoted by and an input denoted by where is the hidden dimension, and is the number of nodes in the tree. Let us label the children of node with and . We have


where indicates concatenation. If the block is a leaf node, is the input of the network. For example, in the equation tree shown in Figure 1, all the terminal nodes are the inputs to the neural network. Note that for simplicity we assume that the internal blocks do not have an external input and only take inputs from their children. However, the extension to the case where we do have an external input is trivial and can be done by additionally concatenating the input with the children’s states in Equation 3.

the way is computed using depends on the neural network block’s architecture and in the subsections below we will explain how is computed given the input state for each model. In the following sub-sections we will re-use notation to make model comparison easier and refrain from introducing too many variables to make the flow easier. We have tried to make the names consistent for corresponding variables across models. Therefore, note that the variables defined in each sub-section should only be used within that sub-section unless noted otherwise.

2.1 Tree-RNNs

before we present the stack-augmented recursive neural networks, let us start with vanilla Tree-RNNs.

the node state is computed by passing

through a feed-forward neural network. For example, Assuming that the network blocks are single-layer networks we have,


where is a nonlinear function such as Sigmoid and is the matrix of network weights and is the bias. Note that the weights and biases are indexed with . These weights can either be shared among all the blocks or can be specific to the node type. To elaborate, in the case of mathematical equation verification the parameters can be specific to the function such that we have different weights for addition and multiplication. In the simplest case, the weights are shared among all the neural network block in the tree.

Stack augmented Tree-RNNs

In this model, each node is augmented with a stack where is the stack size. The input of each node will be stored in the stack if the model decides to push, and the output representation is computed using the stack if the model decides to pop from it.

A stack is a LIFO data structure and the network can only interact with it through its top. This is the desired behavior when dealing with recursive function execution. We indicate the top of the stack with . The stack has two operations pop and push. Inspired by Joulin and Mikolov (2015) and the push-down automation we use a 2-dimensional action vector whose elements represent the soft push and pop operations for interacting with the stack. These two actions are controlled by the network’s input state at each node.


where and

is the softmax function. We denote the probability of the action push with

and pop with and these two probabilities sum to .

We assume that the top of the stack is located at index . Let denote the concatenation of the children’s stacks given below


We have,


Where and indicates transposition. The stack update equations are then as follows


where and is the stack row index. The stack is initialized with the matrix. Note that for the stack will be out of index in Equation 11 and we assume that in that case we pop an all-zero vector.

The output state of the node of this model, is computed by looking at the top elements of the stack where and the input state as given below:


where , and indicates the top-k rows of the stack. is a tuning parameter of the model and its choice is problem dependent.

Additional stack operation: No-Op

We can additionally add another stack operation called no-op where and the elements correspond to push, pop and no-op. No-op is the state where the network neither pushes to the stack nor pops from it and keeps the stack as is. We have


where . Therefore, the stack operations change as shown below.


The state can still be computed using Equation 13 but with the new stack updated using Equations 15 and 16.

2.2 Tree-LSTMs

In this section we present stack-augmented Tree-LSTMs. Similar to Tree-RNNs and in order to be self-contained, we start from Tree-LSTM and then extend it to stack-augmented tree-LSTMs.


Where indicates element-wise multiplication, all the vectors in the left-hand-side of equations 17-23 are in and all the weight vectors in the right-hand sides are weight matrices in . This structure is shown in Figure 2(a). As it is shown in the figure, Tree-LSTM’s memory, , is a 1-dimensional vector. The stack-augmented Tree-LSTM will have a 2-dimensional vector for the memory ( in Figure 2(b)) where each row corresponds to a stack entry. We propose a push and pop mechanism for reading from and writing to this memory. In the next subsection we introduce stack-augmented Tree-LSTMs.

Stack-augmented Tree-LSTMs

Tree-LSTM equations were presented in Equations 17 through 23. There are two main differences between tree-LSTM and its stack-augmented counterpart. First, the stack-augmented Tree-LSTM does not have an input gate ( in Equation 17) and instead uses a push gate to combine the input with the contents of the memory. Moreover, it fills up the stack using the push and pop gates that are presented below.

The children’s stacks are combined using and gates in Equations 18 and 19


for .

Here the push and pop operations are element-wise gates given below



The stack and state update equations are therefore:


where is given in Equation 21. The output state is computes by looking at the top-k stack elements as shown below if


where and indicates the top-k rows of the stack. If we have:


where is given in Equation 20. As noted in the stack-augmented Tree-LSTM section, is a problem dependent tuning parameter.

Additional stack operation: No-Op

Similar to stack-augmented Tree-RNNs we can use the no-op operator to keep the stack in its previous state if need be. In this case the no-op gate and the stack update equations are


where . The stack update equations change as shown below


Similar to stack-augmented Tree-RNNs, the output can be computed using the stack-augmented Tree-LSTM in Equations 30 and 31 depending on the .

3 Experimental setup

In this section we discuss the problem we evaluate our model on and state our implementation details. We explore the applicability of our model in a neural programming task called mathematical equation verification defined by Arabshahi et al. (2018). We briefly define this task in the next section and then provide implementation details about our model.

3.1 Mathematical Equation Verification

In this task, the inputs are symbolic and numeric mathematical equations from trigonometry and linear algebra and the goal is to verify their correctness. For example, the following symbolic equation is correct:


whereas the numeric equation is incorrect. These symbolic and numeric equations are a composition of mathematical functions in trigonometry and algebra, and the recursive neural networks in the experiments mirror the composition structure of each input equation. The equations and therefore the recursive neural networks are rooted at equality as shown in Figure 1. An indication of composition complexity in this scenario is the depth of the equations. We observe that as the depth of the equations grow beyond that of training data, the accuracy drops significantly and we show that augmenting recursive neural networks improve the generalization performance on equations of higher depth.

A data generation strategy for this task was presented in Arabshahi et al. (2018). We use this data generation strategy to generate symbolic mathematical equations of up to depth 13. We generate equations of different depths. The complete data statistics is given in Table 2. This will allow us to evaluate the generalizability of our model on more complex compositions. The full statistics of the generated data is presented in Table 2. As it can be seen in the Table, the dataset is approximately balanced with correct and incorrect equations. We generate equations of depth 1 through 13. We train the models on equations of depth 1 through 7 and evaluate the model on equations of depth 8 through 13. More details about this task are given in Arabshahi et al. (2018). Table 1 lists some examples that were generated sampled randomly.

Example Label Depth
Correct 4
Incorrect 4
Correct 8
Incorrect 8
Correct 13
Correct 13
Incorrect 13
Table 1: Examples of generated equations in the dataset

3.2 Implementation Details

The models are implemented in PyTorch

(Paszke et al., 2017). Our recursive neural networks perform a two-class classification by optimizing the softmax loss on the output of the root. The root of the model represents equality and performs a dot product of the output embeddings of the right and left sub-tree. The input of the neural networks are the terminal nodes in the equations that consist of symbols representing variables in the equation and numbers. The leaves of the neural networks are two-layer feed-forward networks that embed the symbols and numbers in the equation and the other tree nodes are single-layer neural networks that represent different functions in the equation. We share the parameters of the nodes that have the same functionality. For example, all the addition functions use the same set of parameters.

Statistic Data depth
all 1 2 3 4 5 6 7 8 9 10 11 12 13
Number of equations 41,894 21 355 2,542 7,508 9,442 7,957 6,146 3,634 1,999 1,124 677 300 189
Correct portion 0.56 0.52 0.57 0.62 0.61 0.58 0.56 0.54 0.52 0.52 0.49 0.50 0.50 0.50
Table 2: Data set statistics

All the models use the Adam optimizer (Kingma and Ba (2014)) with and and learning rate . We regularize the models with a

weight decay. The hidden dimension of the models are 50. All the models are ran using three different seeds and the reported results are the average of the three seeds as well as their standard deviation. We choose the models based on the best accuracy on the validation data. The train and validation datasets contain equations of depth 1-7 and the test dataset contains equations of depth 8 through 13.

3.3 Baselines and Evaluation Metrics

We use several baselines to validate our experiments as described below. We also provide a model ablation study and investigate the behavior of our model under different settings in Section 4. In order to assess the generalizability of the model, we train our models on Equations of depths 1 through 7 and test our model on Equations of depths 8 through 13. Let us first discuss the baselines that we used in the experiments.

Majority class

baseline is a classification that always predicts the majority class. This is an indication of how hard the classification task is and shows how balanced the training data is.


is the vanilla recursive neural network presented in Section 2. The implementation details of this network is similar to the experimental setup given in Section 3.2.


is the Tree-LSTM network proposed by Tai et al. (2015) and presented in Section 2. the leaves of this Tree-LSTM are two-layer neural networks that embed the symbols and numbers and the other nodes are LSTM networks whose weights are shared between the same function. The hidden dimension and the optimizer parameters are the same as what’s described in Section 3.2 .

We did not choose recurrent neural networks as baselines since the above baselines have already outperformed these models on the same task (Arabshahi et al., 2018).

Evaluation metric

Our evaluation metric is the accuracy, precision and recall of predicting correct and incorrect equations. These metrics are reported as a percentage in Table

3 and abbreviated as Acc, Prec and Rcl for accuracy, precision and recall, respectively.

4 Results

In this section, we evaluate the performance of the stack-augmented tree-RNNs and Tree-LSTMs. As mentioned, we evaluate our model on the task of equation verification described in Section 3.

Approach Train (Depths 1-7) validation (Depths 1-7) Test (Depths 8-13)
Acc Prec Rcl Acc Prec Rcl Acc Prec Rcl
Majority Class 58.12 - - 56.67 - - 51.71 - -
Tree-NN 96.03 95.36 97.94 89.11 87.79 93.84 80.67 82.63
Tree-NN+Stack 95.92 95.74 97.32 88.88 87.37 93.95
Tree-NN+Stack +no-op 95.86 96.40 96.49 88.44 87.21 93.29
Tree-LSTM 99.40 99.47 92.67 96.82
Tree-LSTM+Stack 99.23 99.19 99.49 93.31 92.36 96.15
Tree-LSTM+Stack+normalize 98.76 98.59 99.29 93.32 92.23 96.33
Tree-LSTM+Stack+normalize+no-op 98.34 98.13 99.04 93.84 92.60 96.87
Table 3: Overall accuracy of the models on train and test datasets
Figure 4: breakdown of model accuracy across different depths for the stack-LSTM models and baselines Tree-RNN and Tree-LSTM

Table 3 shows the performance of all models on our equation verification task. We report the average and standard deviation of accuracy, precision and recall for the models initialized with different random seeds. The stack size in all the models in Table 3 is set to 5 and the models are choosing the top-1 stack element to compute the output, therefore .

4.1 Model Ablation

Tree-NN+stack refers to the memory-augmented Tree-RNN introduced in Section 2 and Tree-RNN+stack+no-op is the same model with the no-op operation introduced in Section 2. Tree-LSTM+stack refers to the model presented in Section 2. Tree-LSTM+stack+normalize refers to the stack-augmented Tree-LSTM model where the push and pop action vectors and are element-wise normalized so that each for . Tree-LSTM+stack+normalize+no-op further adds the no-op operator explained in Section 2 to the model. As can be seen the normalized action with no-op is the best model in terms of overall accuracy.

4.2 Generalization to higher depth

In order to see the performance of the models on different depths, we have plotted the accuracy breakdown in terms of depth in Figure 4. As it can be seen, Tree-LSTM+stack consistently improves the performance of Tree-NN and Tree-LSTM on higher depths and as depth grows beyond the training data, the improvement gap widens. Therefore, stack-augmented recursive neural networks improve model’s generalizability on data of higher depth.

This is an important result since generalizing to data of higher complexity than training is an ongoing challenge for neural networks in neural programming and we show that augmenting recursive neural networks with an external memory with the structure of a stack can be a potential solution to this problem.

Specifically, since the models have access to an external memory that has the structure of a stack they are able to fit the functions better and therefore, result in better output representations that will reduce error propagation.

5 Conclusions

Recursive neural networks have shown a good performance for modeling compositional data. However, their performance degrades significantly when generalizing to data of higher complexity. In this paper we present memory augmented recursive neural networks to address this challenge. We augment recursive neural networks, namely Tree-NNs and Tree-LSTMs with an external differentiable memory that has the structure of a stack. Stacks are Last In First Our data structures with two operations push and pop. We present differentiable push and pop operations and augment recursive neural networks with this data structure.

Our experiments indicate that augmenting recursive neural networks with external memory allows the model to generalize to data of higher complexity. We evaluate our model against baselines such as Tree-RNNs and Tree-LSTMs and achieve up to better generalizability compared to the baselines. We also provide a model ablation study to analyze the preformance of the different components of the model on the final result