1 Introduction
Compositional data are ubiquitous and comopositional models are naturally the right candidates to fit such data. For example, recursive neural networks such as TreeRNNs and TreeLSTMs have been successfully applied to tasks in natural language processing, vision and neural programming
(Tai et al., 2015; Socher et al., 2011; Allamanis et al., 2017; Arabshahi et al., 2018; Evans et al., 2018).Although recursive networks significantly outperform their flat counterparts, they still fail to generalize to unseen composition complexity. In other words, if we fix the composition complexity of the training data and test the models on unseen data points, the performance of the models start to decay as the complexity of the unseen data grows beyond that of the training data. This is an ongoing challenge for neural networks and it is important to address it because it is unreasonable to assume that the model sees all possible composition complexities during training. Therefore, it is important to build architectures that generalize to unseen composition complexity.
One of the reasons of this lack of generalization is error propagation as the composition complexity and depth grows. The more representative the components of the recursive neural network, the less the error propagation and the better the generalization performance to higher complexity. We show in this paper that external memory improves the generalization performance of recursive neural networks for neural programming.
Neural programming refers to the task of using neural network to learn programs, mathematics and logic from data. Often in neural programming, a core neural network cell is used to learn and represent a certain program such as addition or sort and mathematical or logical function such as multiplication and OR. Most of these programs and functions can be implemented more compactly and efficiently using recursion, potentially making it easier for neural networks to model them. For example, the pseudo code for a recursive and nonrecursive implementation of the multiplication function is given in Figure 2. It is known that stacks are used to implement and execute a recursive function.
In this paper, we build upon this observation and intuition and introduce stack augmented recursive neural networks. We hypothesize that augmenting neural networks with additional stacks allows them to better represent recursive functions. We show that on the task of mathematical equation verification (Arabshahi et al., 2018) this structure consistently improves the generalization performance of TreeLSTM on unseen (higher) depths. An example of a symbolic representation of the recursive multiplication algorithm in Figure 2 is given below:
(1)  
(2) 
The compositional structure of Equation 1 is depicted in Figure 1. In this paper, we use recursive neural networks for representing symbolic mathematical expressions where each function is a neural network whose weights are shared with the other appearances of that function.
We show that if we augment the cells of a recursive neural network with additional memory with the structure of a stack the cells are able to learn a better representation of the functions resulting in less error propagation and an improved generalization performance.
1.1 Summary of Results
Our contributions in this paper are twofold. First, we propose to augment recursive neural networks with an external memory that has the data structure of a stack. This memory will allow the network to generalize to data of higher complexity and depths.
We augment TreeRNNs and TreeLSTMs with an external data structures, namely differentiable stacks. A stack is a Last In First Out (LIFO) data structure often used for recursive computation and implementation. We develop soft push and pop operations for the stack and allow the model to learn to control these operations.
Second, we test the proposed model on a neural programming task called mathematical equation verification where given a symbolic math equality, the goal is to verify if the equation is correct or incorrect (Allamanis et al., 2017; Arabshahi et al., 2018). We show that stackaugmented recursive neural networks consistently improve the generalization performance of TreeNNs and TreeLSTMs on unseen composition complexity. Moreover, we provide a model ablation study that shows the effect of different model components on the output results. We show that stackaugmented recursive neural networks consistently improve the performance of TreeLSTMs and TreeRNNs on the equation verification task on data of higher composition complexity.
1.2 Related Work
Recursive neural networks have been used to model compositional data in many applications including natural scene classification
(Socher et al., 2011), sentiment classification, Semantic Relatedness and syntactic parsing (Tai et al., 2015; Socher et al., 2011), neural programming and logic (Allamanis et al., 2017; Zaremba et al., 2014; Evans et al., 2018). In all these problems there is an inherent hierarchy nested in the data and capturing it allows the models to improve significantly.Recursive neural networks have shown to be good at capturing long term dependencies (Tai et al., 2015)
compared to flat recurrent neural networks
(Graves et al., 2013; Hochreiter and Schmidhuber, 1997). However, their performance significantly drops when generalizing to dependency ranges not seen in the training data.Recently there have been attempts to provide a global memory to recurrent neural models that will play the role of a working memory and can be used to store information to and read information from. (Graves et al., 2014; Jason Weston, 2015; Grefenstette et al., 2015; Joulin and Mikolov, 2015).
Memory networks and their differentiable counterpart Jason Weston (2015); Sukhbaatar et al. (2015) store instances of the input data into an external memory that can later be read through their recurrent neural network architecture. Neural Programmer Interpreters augment their underlying recurrent LSTM core with a keyvalue pair style memory and they additionaly enable read and write operations for accessing it Reed and De Freitas (2015); Cai et al. (2017)
Graves et al. (2014) define soft read and write operations so that a recurrent controller unit can access this memory for read and write operations. Another line of research proposes to augment recurrent neural networks with specific data structures such as stacks and queues Das et al. (1992); Sun et al. (2017); Joulin and Mikolov (2015); Grefenstette et al. (2015). There has also been an attempt to improve the performance of recurrent neural networks (Trinh et al., 2018)Despite the amount of effort spent on augmenting recurrent neural networks, to the best of our knowledge, there has been no attempt to enable an external memory to recursive networks which will allow them to generalize to higher recursion depths. In each step of the traversal the nodes get to read from or write to this additional memory. Therefore, in this work, inspired by the recent attempts to augment recurrent neural networks with stacks, we propose to augment recursive neural networks with an external memory that they can access to fill or read from.
In a parallel research direction Kumar et al. (2016) present episodic memory for question answering applications. This is different from the symbolic way of defining memory for models that handle neural programming tasks. Another different line of work are graph memory networks and tree memory networks Pham et al. (2018); Fernando et al. (2018) which construct a memory with a specific structure and are different from augmenting a recursive neural network with an additional global memory
2 Memory augmented recursive neural networks
In this section, we introduce memory augmented recursive neural networks. We augment TreeRNNs and TreeLSTMs with external differentiable memory. This memory has the data structure of a neural stack. The network learns to read from and write to the memory by popping from and pushing to the memory, respectively. We develop soft push and pop operations making the network endtoend differentiable, therefore, we can train these networks in an endtoend manner. A sketch of the stackaugmented TreeLSTM is shown in Figure 2(b). Stacks are the , and in that figure, and the model accesses the memory through . We describe more details about the model in the next subsections.
In this section we present matrices with bold uppercase letters, vectors with bold lowercase letters and scalars with nonbold letters. For simplicity, we will present the equations for a binary recursive neural network in this paper. However, everything can be trivially extended to nary recursive neural networks. A recursive neural network is a treestructured network where each node of the tree is a neural network block. The structure of the tree is often indicated by the data. For example, for language models, the structure of the tree can be the structure of the dependency parse of the input sentence, or for mathematical equation verification, the structure of the tree is the structure of the input equation (Figure
1).All the nodes or blocks of the recursive neural networks have a state denoted by and an input denoted by where is the hidden dimension, and is the number of nodes in the tree. Let us label the children of node with and . We have
(3) 
where indicates concatenation. If the block is a leaf node, is the input of the network. For example, in the equation tree shown in Figure 1, all the terminal nodes are the inputs to the neural network. Note that for simplicity we assume that the internal blocks do not have an external input and only take inputs from their children. However, the extension to the case where we do have an external input is trivial and can be done by additionally concatenating the input with the children’s states in Equation 3.
the way is computed using depends on the neural network block’s architecture and in the subsections below we will explain how is computed given the input state for each model. In the following subsections we will reuse notation to make model comparison easier and refrain from introducing too many variables to make the flow easier. We have tried to make the names consistent for corresponding variables across models. Therefore, note that the variables defined in each subsection should only be used within that subsection unless noted otherwise.
2.1 TreeRNNs
before we present the stackaugmented recursive neural networks, let us start with vanilla TreeRNNs.
the node state is computed by passing
through a feedforward neural network. For example, Assuming that the network blocks are singlelayer networks we have,
(4) 
where is a nonlinear function such as Sigmoid and is the matrix of network weights and is the bias. Note that the weights and biases are indexed with . These weights can either be shared among all the blocks or can be specific to the node type. To elaborate, in the case of mathematical equation verification the parameters can be specific to the function such that we have different weights for addition and multiplication. In the simplest case, the weights are shared among all the neural network block in the tree.
Stack augmented TreeRNNs
In this model, each node is augmented with a stack where is the stack size. The input of each node will be stored in the stack if the model decides to push, and the output representation is computed using the stack if the model decides to pop from it.
A stack is a LIFO data structure and the network can only interact with it through its top. This is the desired behavior when dealing with recursive function execution. We indicate the top of the stack with . The stack has two operations pop and push. Inspired by Joulin and Mikolov (2015) and the pushdown automation we use a 2dimensional action vector whose elements represent the soft push and pop operations for interacting with the stack. These two actions are controlled by the network’s input state at each node.
(5) 
where and
is the softmax function. We denote the probability of the action push with
and pop with and these two probabilities sum to .We assume that the top of the stack is located at index . Let denote the concatenation of the children’s stacks given below
(6) 
We have,
(7)  
(8)  
(9) 
Where and indicates transposition. The stack update equations are then as follows
(10)  
(11) 
where and is the stack row index. The stack is initialized with the matrix. Note that for the stack will be out of index in Equation 11 and we assume that in that case we pop an allzero vector.
The output state of the node of this model, is computed by looking at the top elements of the stack where and the input state as given below:
(12)  
(13) 
where , and indicates the topk rows of the stack. is a tuning parameter of the model and its choice is problem dependent.
Additional stack operation: NoOp
We can additionally add another stack operation called noop where and the elements correspond to push, pop and noop. Noop is the state where the network neither pushes to the stack nor pops from it and keeps the stack as is. We have
(14) 
where . Therefore, the stack operations change as shown below.
(15)  
(16) 
The state can still be computed using Equation 13 but with the new stack updated using Equations 15 and 16.
2.2 TreeLSTMs
In this section we present stackaugmented TreeLSTMs. Similar to TreeRNNs and in order to be selfcontained, we start from TreeLSTM and then extend it to stackaugmented treeLSTMs.
(17)  
(18)  
(19)  
(20)  
(21)  
(22)  
(23) 
Where indicates elementwise multiplication, all the vectors in the lefthandside of equations 1723 are in and all the weight vectors in the righthand sides are weight matrices in . This structure is shown in Figure 2(a). As it is shown in the figure, TreeLSTM’s memory, , is a 1dimensional vector. The stackaugmented TreeLSTM will have a 2dimensional vector for the memory ( in Figure 2(b)) where each row corresponds to a stack entry. We propose a push and pop mechanism for reading from and writing to this memory. In the next subsection we introduce stackaugmented TreeLSTMs.
Stackaugmented TreeLSTMs
TreeLSTM equations were presented in Equations 17 through 23. There are two main differences between treeLSTM and its stackaugmented counterpart. First, the stackaugmented TreeLSTM does not have an input gate ( in Equation 17) and instead uses a push gate to combine the input with the contents of the memory. Moreover, it fills up the stack using the push and pop gates that are presented below.
(24) 
for .
Here the push and pop operations are elementwise gates given below
(25)  
(26) 
where
The stack and state update equations are therefore:
(27)  
(28) 
where is given in Equation 21. The output state is computes by looking at the topk stack elements as shown below if
(29)  
(30) 
where and indicates the topk rows of the stack. If we have:
(31) 
where is given in Equation 20. As noted in the stackaugmented TreeLSTM section, is a problem dependent tuning parameter.
Additional stack operation: NoOp
Similar to stackaugmented TreeRNNs we can use the noop operator to keep the stack in its previous state if need be. In this case the noop gate and the stack update equations are
(32) 
where . The stack update equations change as shown below
(33)  
(34) 
Similar to stackaugmented TreeRNNs, the output can be computed using the stackaugmented TreeLSTM in Equations 30 and 31 depending on the .
3 Experimental setup
In this section we discuss the problem we evaluate our model on and state our implementation details. We explore the applicability of our model in a neural programming task called mathematical equation verification defined by Arabshahi et al. (2018). We briefly define this task in the next section and then provide implementation details about our model.
3.1 Mathematical Equation Verification
In this task, the inputs are symbolic and numeric mathematical equations from trigonometry and linear algebra and the goal is to verify their correctness. For example, the following symbolic equation is correct:
(35) 
whereas the numeric equation is incorrect. These symbolic and numeric equations are a composition of mathematical functions in trigonometry and algebra, and the recursive neural networks in the experiments mirror the composition structure of each input equation. The equations and therefore the recursive neural networks are rooted at equality as shown in Figure 1. An indication of composition complexity in this scenario is the depth of the equations. We observe that as the depth of the equations grow beyond that of training data, the accuracy drops significantly and we show that augmenting recursive neural networks improve the generalization performance on equations of higher depth.
A data generation strategy for this task was presented in Arabshahi et al. (2018). We use this data generation strategy to generate symbolic mathematical equations of up to depth 13. We generate equations of different depths. The complete data statistics is given in Table 2. This will allow us to evaluate the generalizability of our model on more complex compositions. The full statistics of the generated data is presented in Table 2. As it can be seen in the Table, the dataset is approximately balanced with correct and incorrect equations. We generate equations of depth 1 through 13. We train the models on equations of depth 1 through 7 and evaluate the model on equations of depth 8 through 13. More details about this task are given in Arabshahi et al. (2018). Table 1 lists some examples that were generated sampled randomly.
Example  Label  Depth 

Correct  4  
Incorrect  4  
Correct  8  
Incorrect  8  
Correct  13  
Correct  13  
Incorrect  13 
3.2 Implementation Details
The models are implemented in PyTorch
(Paszke et al., 2017). Our recursive neural networks perform a twoclass classification by optimizing the softmax loss on the output of the root. The root of the model represents equality and performs a dot product of the output embeddings of the right and left subtree. The input of the neural networks are the terminal nodes in the equations that consist of symbols representing variables in the equation and numbers. The leaves of the neural networks are twolayer feedforward networks that embed the symbols and numbers in the equation and the other tree nodes are singlelayer neural networks that represent different functions in the equation. We share the parameters of the nodes that have the same functionality. For example, all the addition functions use the same set of parameters.Statistic  Data depth  

all  1  2  3  4  5  6  7  8  9  10  11  12  13  
Number of equations  41,894  21  355  2,542  7,508  9,442  7,957  6,146  3,634  1,999  1,124  677  300  189 
Correct portion  0.56  0.52  0.57  0.62  0.61  0.58  0.56  0.54  0.52  0.52  0.49  0.50  0.50  0.50 
All the models use the Adam optimizer (Kingma and Ba (2014)) with and and learning rate . We regularize the models with a
weight decay. The hidden dimension of the models are 50. All the models are ran using three different seeds and the reported results are the average of the three seeds as well as their standard deviation. We choose the models based on the best accuracy on the validation data. The train and validation datasets contain equations of depth 17 and the test dataset contains equations of depth 8 through 13.
3.3 Baselines and Evaluation Metrics
We use several baselines to validate our experiments as described below. We also provide a model ablation study and investigate the behavior of our model under different settings in Section 4. In order to assess the generalizability of the model, we train our models on Equations of depths 1 through 7 and test our model on Equations of depths 8 through 13. Let us first discuss the baselines that we used in the experiments.
Majority class
baseline is a classification that always predicts the majority class. This is an indication of how hard the classification task is and shows how balanced the training data is.
TreeRNN
TreeLSTM
is the TreeLSTM network proposed by Tai et al. (2015) and presented in Section 2. the leaves of this TreeLSTM are twolayer neural networks that embed the symbols and numbers and the other nodes are LSTM networks whose weights are shared between the same function. The hidden dimension and the optimizer parameters are the same as what’s described in Section 3.2 .
We did not choose recurrent neural networks as baselines since the above baselines have already outperformed these models on the same task (Arabshahi et al., 2018).
Evaluation metric
Our evaluation metric is the accuracy, precision and recall of predicting correct and incorrect equations. These metrics are reported as a percentage in Table
3 and abbreviated as Acc, Prec and Rcl for accuracy, precision and recall, respectively.4 Results
In this section, we evaluate the performance of the stackaugmented treeRNNs and TreeLSTMs. As mentioned, we evaluate our model on the task of equation verification described in Section 3.
Approach  Train (Depths 17)  validation (Depths 17)  Test (Depths 813)  

Acc  Prec  Rcl  Acc  Prec  Rcl  Acc  Prec  Rcl  
Majority Class  58.12      56.67      51.71     
TreeNN  96.03  95.36  97.94  89.11  87.79  93.84  80.67  82.63  
TreeNN+Stack  95.92  95.74  97.32  88.88  87.37  93.95  
TreeNN+Stack +noop  95.86  96.40  96.49  88.44  87.21  93.29  
TreeLSTM  99.40  99.47  92.67  96.82  
TreeLSTM+Stack  99.23  99.19  99.49  93.31  92.36  96.15  
TreeLSTM+Stack+normalize  98.76  98.59  99.29  93.32  92.23  96.33  
TreeLSTM+Stack+normalize+noop  98.34  98.13  99.04  93.84  92.60  96.87 
Table 3 shows the performance of all models on our equation verification task. We report the average and standard deviation of accuracy, precision and recall for the models initialized with different random seeds. The stack size in all the models in Table 3 is set to 5 and the models are choosing the top1 stack element to compute the output, therefore .
4.1 Model Ablation
TreeNN+stack refers to the memoryaugmented TreeRNN introduced in Section 2 and TreeRNN+stack+noop is the same model with the noop operation introduced in Section 2. TreeLSTM+stack refers to the model presented in Section 2. TreeLSTM+stack+normalize refers to the stackaugmented TreeLSTM model where the push and pop action vectors and are elementwise normalized so that each for . TreeLSTM+stack+normalize+noop further adds the noop operator explained in Section 2 to the model. As can be seen the normalized action with noop is the best model in terms of overall accuracy.
4.2 Generalization to higher depth
In order to see the performance of the models on different depths, we have plotted the accuracy breakdown in terms of depth in Figure 4. As it can be seen, TreeLSTM+stack consistently improves the performance of TreeNN and TreeLSTM on higher depths and as depth grows beyond the training data, the improvement gap widens. Therefore, stackaugmented recursive neural networks improve model’s generalizability on data of higher depth.
This is an important result since generalizing to data of higher complexity than training is an ongoing challenge for neural networks in neural programming and we show that augmenting recursive neural networks with an external memory with the structure of a stack can be a potential solution to this problem.
Specifically, since the models have access to an external memory that has the structure of a stack they are able to fit the functions better and therefore, result in better output representations that will reduce error propagation.
5 Conclusions
Recursive neural networks have shown a good performance for modeling compositional data. However, their performance degrades significantly when generalizing to data of higher complexity. In this paper we present memory augmented recursive neural networks to address this challenge. We augment recursive neural networks, namely TreeNNs and TreeLSTMs with an external differentiable memory that has the structure of a stack. Stacks are Last In First Our data structures with two operations push and pop. We present differentiable push and pop operations and augment recursive neural networks with this data structure.
Our experiments indicate that augmenting recursive neural networks with external memory allows the model to generalize to data of higher complexity. We evaluate our model against baselines such as TreeRNNs and TreeLSTMs and achieve up to better generalizability compared to the baselines. We also provide a model ablation study to analyze the preformance of the different components of the model on the final result
References

Allamanis et al. (2017)
M. Allamanis, P. Chanthirasegaran, P. Kohli, and C. Sutton.
Learning continuous semantic representations of symbolic expressions.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 80–88. JMLR. org, 2017.  Arabshahi et al. (2018) F. Arabshahi, S. Singh, and A. Anandkumar. Combining symbolic expressions and blackbox function evaluations in neural programs. International Conference on Learning Representations (ICLR), 2018.
 Cai et al. (2017) J. Cai, R. Shin, and D. Song. Making neural programming architectures generalize via recursion. arXiv preprint arXiv:1704.06611, 2017.
 Das et al. (1992) S. Das, C. L. Giles, and G.Z. Sun. Learning contextfree grammars: Capabilities and limitations of a recurrent neural network with an external stack memory. In Proceedings of The Fourteenth Annual Conference of Cognitive Science Society. Indiana University, page 14, 1992.
 Evans et al. (2018) R. Evans, D. Saxton, D. Amos, P. Kohli, and E. Grefenstette. Can neural networks understand logical entailment? International Conference on Learning Representations (ICLR), 2018.
 Fernando et al. (2018) T. Fernando, S. Denman, A. McFadyen, S. Sridharan, and C. Fookes. Tree memory networks for modelling longterm temporal dependencies. Neurocomputing, 304:64–81, 2018.
 Graves et al. (2013) A. Graves, A.r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE, 2013.
 Graves et al. (2014) A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Grefenstette et al. (2015) E. Grefenstette, K. M. Hermann, M. Suleyman, and P. Blunsom. Learning to transduce with unbounded memory. In Advances in Neural Information Processing Systems, pages 1828–1836, 2015.
 Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Jason Weston (2015) A. B. Jason Weston, Sumit Chopra. Memory networks. 2015.
 Joulin and Mikolov (2015) A. Joulin and T. Mikolov. Inferring algorithmic patterns with stackaugmented recurrent nets. In Advances in neural information processing systems, pages 190–198, 2015.
 Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kumar et al. (2016) A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378–1387, 2016.
 Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 Pham et al. (2018) T. Pham, T. Tran, and S. Venkatesh. Graph memory networks for molecular activity prediction. arXiv preprint arXiv:1801.02622, 2018.
 Reed and De Freitas (2015) S. Reed and N. De Freitas. Neural programmerinterpreters. arXiv preprint arXiv:1511.06279, 2015.
 Socher et al. (2011) R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th international conference on machine learning (ICML11), pages 129–136, 2011.
 Sukhbaatar et al. (2015) S. Sukhbaatar, J. Weston, R. Fergus, et al. Endtoend memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015.
 Sun et al. (2017) G.Z. Sun, C. L. Giles, H.H. Chen, and Y.C. Lee. The neural network pushdown automaton: Model, stack and learning simulations. arXiv preprint arXiv:1711.05738, 2017.
 Tai et al. (2015) K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from treestructured long shortterm memory networks. arXiv preprint arXiv:1503.00075, 2015.
 Trinh et al. (2018) T. H. Trinh, A. M. Dai, M.T. Luong, and Q. V. Le. Learning longerterm dependencies in rnns with auxiliary losses. arXiv preprint arXiv:1803.00144, 2018.
 Zaremba et al. (2014) W. Zaremba, K. Kurach, and R. Fergus. Learning to discover efficient mathematical identities. In Advances in Neural Information Processing Systems, pages 1278–1286, 2014.
References

Allamanis et al. (2017)
M. Allamanis, P. Chanthirasegaran, P. Kohli, and C. Sutton.
Learning continuous semantic representations of symbolic expressions.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 80–88. JMLR. org, 2017.  Arabshahi et al. (2018) F. Arabshahi, S. Singh, and A. Anandkumar. Combining symbolic expressions and blackbox function evaluations in neural programs. International Conference on Learning Representations (ICLR), 2018.
 Cai et al. (2017) J. Cai, R. Shin, and D. Song. Making neural programming architectures generalize via recursion. arXiv preprint arXiv:1704.06611, 2017.
 Das et al. (1992) S. Das, C. L. Giles, and G.Z. Sun. Learning contextfree grammars: Capabilities and limitations of a recurrent neural network with an external stack memory. In Proceedings of The Fourteenth Annual Conference of Cognitive Science Society. Indiana University, page 14, 1992.
 Evans et al. (2018) R. Evans, D. Saxton, D. Amos, P. Kohli, and E. Grefenstette. Can neural networks understand logical entailment? International Conference on Learning Representations (ICLR), 2018.
 Fernando et al. (2018) T. Fernando, S. Denman, A. McFadyen, S. Sridharan, and C. Fookes. Tree memory networks for modelling longterm temporal dependencies. Neurocomputing, 304:64–81, 2018.
 Graves et al. (2013) A. Graves, A.r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE, 2013.
 Graves et al. (2014) A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Grefenstette et al. (2015) E. Grefenstette, K. M. Hermann, M. Suleyman, and P. Blunsom. Learning to transduce with unbounded memory. In Advances in Neural Information Processing Systems, pages 1828–1836, 2015.
 Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Jason Weston (2015) A. B. Jason Weston, Sumit Chopra. Memory networks. 2015.
 Joulin and Mikolov (2015) A. Joulin and T. Mikolov. Inferring algorithmic patterns with stackaugmented recurrent nets. In Advances in neural information processing systems, pages 190–198, 2015.
 Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kumar et al. (2016) A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378–1387, 2016.
 Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 Pham et al. (2018) T. Pham, T. Tran, and S. Venkatesh. Graph memory networks for molecular activity prediction. arXiv preprint arXiv:1801.02622, 2018.
 Reed and De Freitas (2015) S. Reed and N. De Freitas. Neural programmerinterpreters. arXiv preprint arXiv:1511.06279, 2015.
 Socher et al. (2011) R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th international conference on machine learning (ICML11), pages 129–136, 2011.
 Sukhbaatar et al. (2015) S. Sukhbaatar, J. Weston, R. Fergus, et al. Endtoend memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015.
 Sun et al. (2017) G.Z. Sun, C. L. Giles, H.H. Chen, and Y.C. Lee. The neural network pushdown automaton: Model, stack and learning simulations. arXiv preprint arXiv:1711.05738, 2017.
 Tai et al. (2015) K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from treestructured long shortterm memory networks. arXiv preprint arXiv:1503.00075, 2015.
 Trinh et al. (2018) T. H. Trinh, A. M. Dai, M.T. Luong, and Q. V. Le. Learning longerterm dependencies in rnns with auxiliary losses. arXiv preprint arXiv:1803.00144, 2018.
 Zaremba et al. (2014) W. Zaremba, K. Kurach, and R. Fergus. Learning to discover efficient mathematical identities. In Advances in Neural Information Processing Systems, pages 1278–1286, 2014.
Comments
There are no comments yet.