such as the Long Short-Term Memory (LSTM)lstm97and95 ; Gers:2000nc are general computers, e.g., siegelmann91turing
. LSTM-based systems achieved breakthroughs in various speech and Natural Language Processing tasksgooglevoice2015 ; wu2016google ; facebook2017 . Unlike humans, however, current RNNs cannot easily extract symbolic rules from experience and apply them to novel instances in a systematic way fodor1988connectionism ; hadley1994systematicity . They are catastrophically affected by systematic fodor1988connectionism ; hadley1994systematicity differences between training and test data still_not_systematic_lake ; lake2017building ; atzmon2016learning ; phillips1995connectionism .
In particular, standard RNNs have performed poorly at natural language reasoning (NLR) babi_tasks_weston where systematic generalisation (such as rule-like extrapolation) is essential. Consider a network trained on a variety of NLR tasks involving short stories about multiple entities. One task could be about tracking entity locations ([…] Mary went to the office. […] Where is Mary?), another about tracking objects that people are holding ([…] Daniel picks up the milk. […] What is Daniel holding?). If every person is able to perform every task, this will open up a large number of possible person-task pairs. Now suppose that during training we only have stories from a small subset of all possible pairs. More specifically, let us assume Mary is never seen picking up or dropping any item. Unlike during training, we want to test on tasks such as […] Mary picks up the milk. […] What is Mary carrying?. In this case, the training and test data exhibit systematic differences. Nevertheless, a systematic model should be able to infer milk because it has adopted a rule-like, entity-independent reasoning pattern that generalises beyond the training distribution. RNNs, however, tend to fail to learn such patterns if the train and test data exhibit such differences.
Here we aim at improving systematic generalisation by learning to deconstruct natural language statements into combinatorial representations BrousseCombinatorialRepresentations . We propose a new architecture based on the Tensor Product Representation (TPR) smolensky1990tensor
, a general method for embedding symbolic structures in a vector space.
Previous work already showed that TPRs allow for powerful symbolic processing with distributed representationssmolensky1990tensor
, given certain manual assignments of the vector space embedding. However, TPRs have commonly not been trained from data through gradient descent. Here we combine gradient-based RNNs with third-order TPRs to learn combinatorial representations from natural language, training the entire system on NLR tasks via error backpropagationLinnainmaa:1970 ; Kelley:1960 ; Werbos1990BTT . We point out similarities to systems with Fast Weights von1981correlation ; feldman1982dynamic ; hinton1987deblur , in particular, end-to-end-differentiable Fast Weight systems Schmidhuber:92ncfastweights ; Schmidhuber:93ratioicann ; schlag2017gated . In experiments, we achieve state-of-the-art results on the bAbI dataset babi_tasks_weston , obtaining better systematic generalisation than other methods. We also analyse the emerging combinatorial and, to some extent, interpretable representations. The code we used to train and evaluate our models is available at github.com/ischlag/TPR-RNN.
2 Review of the Tensor Product Representation and Notation
The TPR method is a mechanism to create a vector-space embedding of symbolic structures. To illustrate, consider the relation implicit in the short sentences "Kitty the cat" and "Mary the person". In order to store this structure into a TPR of order 2, each sentence has to be decomposed into two components by choosing a so-called filler symbol and a role symbol . Now a possible set of fillers and roles for a unique role/filler decomposition could be and . The two relations are then described by the set of filler/role bindings: . Let denote positive integers. A distributed representation is then achieved by encoding each filler symbol by a filler vector in a vector space and each role symbol by a role vector in a vector space . In this work, every vector space is over . The TPR of the symbolic structures is defined as the tensor in a vector space where is the tensor product operator. In this example the tensor is of order 2, a matrix, which allows us to write the equation of our example using matrix multiplication:
Here, the tensor product — or generalised outer product — acts as a variable binding operator. The final TPR representation is a superposition of all bindings via the element-wise addition.
In the TPR method the so-called unbinding operator consists of the tensor inner product which is used to exactly reconstruct previously stored variables from using an unbinding vector. Recall that the algebraic definition of the dot product of two vectors and is defined by the sum of the pairwise products of the elements of and . Equivalently, the tensor inner product can be expressed through the order increasing tensor product followed by the sum of the pairwise products of the elements of the -th and -th order.
Given now the unbinding vector , we can then retrieve the stored filler . In the simplest case, if the role vectors are orthonormal, the unbinding vector equals . Again, for a TPR of order 2 the unbinding operation can also be expressed using matrix multiplication.
Note how the dot product and matrix multiplication are special cases of the tensor inner product. We will later use the tensor inner product which can be used with a tensor of order 3 (a cube) and a tensor of order 1 (a vector) such that they result in a tensor of order 2 (a matrix). Other aspects of the TPR method are not essential for this paper. For further details, we refer to Smolensky’s work smolensky1990tensor ; smolensky2012symbolic ; basic_reasoningTPR_smolensky .
3 The TPR as a Structural Bias for Combinatorial Representations
A drawback of Smolensky’s TPR method is that the decomposition of the symbolic structures into structural elements — e.g. and in our previous example — are not learned but externally defined. Similarly, the distributed representations and are assigned manually instead of being learned from data, yielding arguments against the TPR as a connectionist theory of cognition fodor1990connectionism .
Here we aim at overcoming these limitations by recognising the TPR as a form of Fast Weight memory which uses multi-layer perceptron (MLP) based neural networks trained end-to-end by stochastic gradient descent. Previous outer product-based Fast WeightsSchmidhuber:93ratioicann , which share strong similarities to TPRs of order 2, have shown to be powerful associative memory mechanisms Ba2016using ; schlag2017gated . Inspired by this capability, we use a graph interpretation of the memory where the representations of a node and an edge allow for the associative retrieval of a neighbouring node. For the context of this work, we refer to the nodes of such a graph as entities and to the edges as relations. This requires MLPs which deconstruct an input sentence into the source-entity , the relation , and the target-entity such that and belong to the vector space and to . These representations are then bound together with the binding operator and stored as a TPR of order 3 where we interpret multiple unbindings as a form of graph traversal.
We’ll use a simple example to illustrate the idea. For instance, consider the following raw input: "Mary went to the kitchen.". A possible three-way task-specific decomposition could be , , and . At a later point in time, a question like "Where is Mary?" would have to be decomposed into the vector representations and . The vectors and have to be similar to the true unbinding vectors and in order to retrieve the previously stored but possibly noisy .
We chose a graph interpretation of the memory due to its generality as it can be found implicitly in the data of many problems. Another important property of a graph inspired neural memory is the combinatorial nature of entities and relations in the sense that any entity can be connected through any relation to any other entity. If the MLPs can disentangle entity-like information from relation-like information, the TPR will provide a simple mechanism to combine them in arbitrary ways. This means that if there is enough data for the network to learn specific entity representations such as then it should not require any more data or training to combine with any of the learned vectors embedded in even though such examples have never been covered by the training data. In Section 7 we analyse a trained model and present results which indicate that it indeed seems to learn representations in line with this perspective.
4 Proposed Method
RNNs can implement algorithms which map input sequences to output sequences. A traditional RNN uses one or several tensors of order 1 (i.e. a vector usually referred to as the hidden state) to encode the information of the past sequence elements necessary to infer the correct current and future outputs. Our architecture is a non-traditional RNN encoding relevant information from the preceding sequence elements in a TPR of order 3.
At discrete time , in the input sequence of varying length , the previous state is updated by the element-wise addition of an update representation .
The proposed architecture is separated into three parts: an input, update, and inference module. The update module produces while the inference module uses as parameters (Fast Weights) to compute the output of the model given a question as input. is the zero tensor.
Similar to previous work, our model also iterates over a sequence of sentences and uses an input module to learn a sentence representation from a sequence of words NIPS2015_memory_networks . Let the input to the architecture at time be a sentence of words with learned embeddings . The sequence is then compressed into a vector representation by
where are learned position vectors that are equivalent for all input sequences and is the Hadamard product. The vectors , and are in the vector space .
The TPR update is defined as the element-wise sum of the tensors produced by a write, move, and backlink function. We abbreviate the respective tensors as , , and and refer to them as memory operations.
To this end, two entity and three relation representations are computed from the sentence representation using five separate networks such that
where is an MLP network and its weights.
The write operation allows for the storage of a new node-edge-node association (, , ) using the tensor product where represents the source entity, represents the target entity, and the relation connecting them. To avoid superimposing the new association onto a possibly already existing association (, , ), the previous target entity has to be retrieved and subtracted from the TPR. If no such association exists, then will ideally be the zero vector.
While the write operation removes the previous target entity representation , the move operation allows to rewrite back into the TPR with a different relation . Similar to the write operation, we have to retrieve and remove the previous target entity that would otherwise interfere.
The final operation is the backlink. It switches source and target entities and connects them with yet another relation . This allows for the associative retrieval of the neighbouring entity starting from either one but with different relations (e.g. John is left of Mary and Mary is right of John).
One of our experiments requires a single prediction after the last element of an observed sequence (i.e. the last sentence). This final element is the question sentence representation . Since the inference module does not edit the TPR memory, it is sufficient to compute the prediction only when necessary. Hence we drop index in the following equations. Similar to the update module, first an entity and a set of relations are extracted from the current sentence using four different networks.
The extracted representations are used to retrieve one or several previously stored associations by providing the necessary unbinding vectors. The values of the TPR can be thought of as context-specific weights which are not trained by gradient descent but constructed incrementally during inference. They define a function that takes the entity and relations as an input. A simple illustration of this process is shown in Figure 2.
The most basic retrieval requires one source entity and one relation to extract the first target entity. We refer to this retrieval as a one-step inference and use the additional extracted relations to compute multi-step inferences. Here refers to layer normalization Ba2017LayerNorm which includes a learned scaling and shifting scalar. As in other Fast Weight work, improves our training procedure which is possibly due to making the optimization landscape smoother batch_norm_smoothes .
Finally, the output of our architecture consists of the sum of the three previous inference steps followed by a linear projection into the symbol space
where a softmax transforms the activations into a probability distribution over all words from the vocabulary of the current task.
5 Related Work
To our knowledge, our system is the first with a TPR of order 3 trained on raw data by backpropagation Linnainmaa:1970 ; Kelley:1960 ; Werbos1990BTT . However, previous work used TPRs of order 2 for simpler associations in the context of image-caption generation TPGN_captioning , question-answering TPRN_squad , and general NLP huang2018attentive problems with a gradient-based optimizer similar to ours.
On the other hand, the central operation of an order 2 TPR is the outer product of two vectors. This relates to many previous ideas, most notably Hebbian learning Hebb:49 , which partially inspired differentiable, outer product-based Fast Weight architectures Schmidhuber:93ratioicann learning context dependent weight changes through error-backpropagation - compare even earlier work on differentiable Fast Weights Schmidhuber:91fastweights . Variations of such outer product-based Fast Weights were able to generalise in a variety of small but complex sequence problems where standard RNNs tend to perform poorly Ba2016using ; schlag2017gated ; miconi2018differentiable .
RNNs are popular choices for modelling natural language. Despite ongoing research in RNN architectures, the good old LSTM lstm97and95 has been shown to outperform more recent variants lstm_still_sota_melis on standard language modelling datasets. However, such networks do not perform well in NLR tasks such as question answering babi_tasks_weston . Recent progress came through the addition of memory and attention components to RNNs. For the context of question answering, a popular line of research are memory networks memory_networks_weston ; dynamic_memory_networks_kumar ; weakly_memory_networks_sukhbaatar ; gated_endtoend_memory_networks_Liu ; dynamic_memory_networks_2_xiong . But it remains unclear whether mistakes in trained models arise from imperfect logical reasoning, knowledge representation, or insufficient data due to the difficulty of interpreting their internal representations dupoux_beyond_toytasks .
Some early memory-augmented RNNs focused primarily on improving the ratio of the number of trainable parameters to memory size Schmidhuber:93ratioicann
. The Neural Turing MachineGraves2014NTM was among the first models with an attention mechanism over external memory that outperformed standard LSTM on tasks such as copying and sorting. The Differentiable Neural Computer (DNC) further refined this approach graves2016 ; sparse_dnc_rae , yielding strong performance also on question-answering problems.
We evaluate our architecture on bAbI tasks, a set of 20 different synthetic question-answering tasks designed to evaluate NLR systems such as intelligent dialogue agents babi_tasks_weston . Every task addresses a different form of reasoning. It consists of the story - a sequence of sentences - followed by a question sentence with a single word answer. We used the train/validation/test split as it was introduced in v1.2 for the 10k samples version of the dataset. We ignored the provided supporting facts that simplify the problem by pointing out sentences relevant to the question. We only show story sentences once and before the query sentence, with no additional supervision signal apart from the prediction error.
We experiment with two models. The single-task model is only trained and tested on the data from one task but uses the same computational graph and hyper-parameters for all. The all-tasks model is a scaled up version trained and tested on all tasks simultaneously, using only the default hyper-parameters. More details such as specific hyper-parameters can be found in Appendix A.
In Table 1 and 2 we compare our model to various state-of-the-art models in the literature. We added best results for a better comparison to earlier work which did not provide statistics generated from multiple runs. Our system outperforms the state-of-the-art in both settings. We also seem to outperform the DNC in convergence speed as shown in Figure 3.
|Task||REN recurrent_entity_networks_henaff||DNC graves2016||SDNC sparse_dnc_rae||TPR-RNN (ours)|
|Avg Error||9.7 2.6||12.8 4.7||6.4 2.5||1.34 0.52|
|Failure (>5%)||5 1.2||8.2 2.5||4.1 1.6||0.86 1.11|
Mean and variance of the test error for the all-task setting. We perform early stopping according to the validation set. Our statistics are generated from 10 runs.
|Task||LSTM weakly_memory_networks_sukhbaatar||N2N weakly_memory_networks_sukhbaatar||DMN+ dynamic_memory_networks_2_xiong||REN recurrent_entity_networks_henaff||TPR-RNN (ours)|
|Avg Error||36.4||4.2||2.8||0.5||0.17||1.12 1.19|
|Failure (>5%)||16||3||1||0||0||0.4 0.55|
We ran ablation experiments on every task to assess the necessity of the three memory operations. The experimental results in Table 3 indicate that a majority of the tasks can be solved by the write operation alone. This is surprising at first because for some of those tasks the symbolic operations that a person might think of as ideal typically require more complex steps than what the write operation allows for.
|Operations||Failed tasks (err > 5%)|
|3, 6, 9, 10, 12, 13, 17, 19|
|9, 10, 13, 17|
However, the optimizer seems to be able to find representations that overcome the limitations of the architecture. That said, more complex tasks do benefit from the additional operations without affecting the performance on simpler tasks.
Here we analyse the representations produced by the MLPs of the update module. We collect the set of unique sentences across all stories from the validation set of a task and compute their respective entity and relation representations , , , , and
In Figure 4 we show such similarity matrices for a model trained on task 3. The image based on shows 4 distinct clusters which indicate that learned representations are almost perfectly orthogonal. By comparing the sentences from different clusters it becomes apparent that they represent the four entities independent of other factors. Note that the dimensionality of this vector space is 15 which seems larger than necessary for this task.
In the case of we observe that sentences seem to group into three, albeit less distinct, clusters. In this task, the structure in the data implies three important events for any entity: moving to any location, bind with any object, and unbind from a previously bound object; all three represented by a variety of possible words and phrases. By comparing sentences from different clusters, we can clearly associate them with the three general types of events.
We observed clusters of similar discreteness in all tasks; often with a semantic meaning that becomes apparent when we compare sentences that belong to different clusters. We also noticed that even though there are often clean clusters they are not always perfectly combinatorial, e.g., in as seen in Figure 4, we found two very orthogonal clusters for the target entity symbols and .
We conduct an additional experiment to empirically analyse the model’s capability to generalise in a systematic way fodor1988connectionism ; hadley1994systematicity . For this purpose, we join together all tasks which use the same four entity names with at least one entity appearing in the question (i.e. tasks 1, 6, 7, 8, 9, 11, 12, 13). We then augment this data with five new entities such that the train and test data exhibit systematic differences. The stories for a new entity are generated by randomly sampling 500 story/question pairs from a task such that in 20% of the generated stories the new entity is also contained in the question. We then add generated stories from all possible 40 combinations of new entities and tasks to the test set. To the training set, however, we only add stories from a subset of all tasks.
More specifically, the new entities are Alex, Glenn, Jordan, Mike, and Logan for which we generate training set stories from , , , , of the tasks respectively. We summarize the results in Figure 5 by averaging over tasks. After the network has been trained, we find that our model achieves high accuracy on entity/task pairs on which it has not been trained. This indicates its systematic generalisation capability due to the disentanglement of entities and relations.
Our analysis and the additional experiment indicate that the model seems to learn combinatorial representations resulting in interpretable distributed representations and data efficiency due to rule-like generalisation.
To compute the correct gradients, an RNN with external memory trained by backpropagation through time must store all values of all temporary variables at every time step of a sequence. Since outer product-based Fast Weights Schmidhuber:93ratioicann ; schlag2017gated and our TPR system have many more time-varying variables per learnable parameter than a classic RNN such as LSTM, this makes them less scalable in terms of memory requirements. The problem can be overcome through RTRL WilliamsZipser:92 ; RobinsonFallside:87tr , but only at the expense of greater time complexity. Nevertheless, our results illustrate how the advantages of TPRs can outweigh such disadvantages for problems of combinatorial nature.
One difficulty of our Fast Weight-like memory is the well-known vanishing gradient problemHochreiter:91
. Due to multiplicative interaction of Fast Weights with RNN activations, forward and backward propagation is unstable and can result in vanishing or exploding activations and error signals. A similar effect may affect the forward pass if the values of the activations are not bounded by some activation function. Nevertheless, in our experiments, we abandoned bounded TPR values as they significantly slowed down learning with little benefit. Although our current sub-optimal initialization may occasionally lead to exploding activations and NaN values after the first few iterations of gradient descent, we did not observe any extreme cases after a few dozen successful steps, and therefore simply reinitialize the model in such cases.
A direct comparison with the DNC is a bit inconclusive for the following reasons. Our architecture, uses a sentence encoding layer similar to how many memory networks encode their input. This slightly facilitates the problem since the network doesn’t have to learn which words belong to the same sentence. Most memory networks also iterate over sentence representations, which is less general than iterating over the word level, which is what the DNC does, which is even less general than iterating over the character level. In preliminary experiments, a word level variation of our architecture solved many tasks, but it may require non-trivial changes to solve all of them.
Our novel RNN-TPR combination learns to decompose natural language sentences into combinatorial components useful for reasoning. It outperforms previous models on the bAbI tasks through attentional control of memory. Our approach is related to Fast Weight architectures, another way of increasing the memory capacity of RNNs. An analysis of a trained model suggests straight-forward interpretability of the learned representations. Our model generalises better than a previous state-of-the-art model when there are strong systematic differences between training and test data.
We thank Paulo Rauber, Klaus Greff, and Filipe Mutz for helpful comments and helping hands. We are also grateful to NVIDIA Corporation for donating a DGX-1 as part of the Pioneers of AI Research Award and to IBM for donating a Minsky machine. This research was supported by an European Research Council Advanced Grant (no: 742870).
-  P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 1988.
-  R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994.
-  A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department, 1987.
-  S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997. Based on TR FKI-207-95, TUM (1995).
-  F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471, 2000.
-  H. T. Siegelmann and E. D. Sontag. Turing computability with neural nets. Applied Mathematics Letters, 4(6):77–80, 1991.
-  Hasim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays, and Johan Schalkwyk. Google voice search: faster and more accurate. Google Research Blog, 2015, http://googleresearch.blogspot.ch/2015/09/google-voice-search-faster-and-more.html.
-  Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. Preprint arXiv:1609.08144, 2016.
J.M. Pino, A. Sidorov, and N.F. Ayan.
Transitioning entirely to neural machine translation.Facebook Research Blog, 2017, https://code.facebook.com/posts/289921871474277/transitioning-entirely-to-neural-machine-translation/.
-  Jerry A Fodor and Zenon W Pylyshyn. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988.
-  Robert F Hadley. Systematicity in connectionist language learning. Mind & Language, 9(3):247–272, 1994.
-  Brenden M. Lake and Marco Baroni. Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks. CoRR, abs/1711.00350, 2017.
-  Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
-  Yuval Atzmon, Jonathan Berant, Vahid Kezami, Amir Globerson, and Gal Chechik. Learning to generalize to new compositions in image understanding. arXiv preprint arXiv:1608.07639, 2016.
-  Steven Andrew Phillips. Connectionism and the problem of systematicity. PhD thesis, University of Queensland, 1995.
-  Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. CoRR, abs/1502.05698, 2015.
-  Olivier J. Brousse. Generativity and Systematicity in Neural Network Combinatorial Learning. PhD thesis, Boulder, CO, USA, 1992. UMI Order No. GAX92-20396.
-  Paul Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence, 46(1-2):159–216, 1990.
-  S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki, 1970.
-  H. J. Kelley. Gradient theory of optimal flight paths. ARS Journal, 30(10):947–954, 1960.
-  Paul J. Werbos. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
-  von der Malsburg. The Correlation Theory of Brain Function. Internal report. Department of Neurobiology, Max-Planck-Institute for Biophysical Chemistry. 1981.
-  Jerome A Feldman. Dynamic connections in neural networks. Biological cybernetics, 46(1):27–39, 1982.
-  Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society, pages 177–186, 1987.
-  J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131–139, 1992.
-  J. Schmidhuber. On decreasing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460–463. Springer, 1993.
-  Imanol Schlag and Jürgen Schmidhuber. Gated fast weights for on-the-fly neural program generation. In NIPS Metalearning Workshop, 2017.
-  Paul Smolensky. Symbolic functions from neural computation. Phil. Trans. R. Soc. A, 370(1971):3543–3569, 2012.
-  Paul Smolensky, Moontae Lee, Xiaodong He, Wen-tau Yih, Jianfeng Gao, and Li Deng. Basic reasoning with tensor product representations. CoRR, abs/1601.02745, 2016.
-  Jerry Fodor and Brian P McLaughlin. Connectionism and the problem of systematicity: Why smolensky’s solution doesn’t work. Cognition, 35(2):183–204, 1990.
-  Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. In Advances In Neural Information Processing Systems, pages 4331–4339, 2016.
-  Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.
-  Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry.
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift).ArXiv e-prints, May 2018.
-  Qiuyuan Huang, Paul Smolensky, Xiaodong He, Li Deng, and Dapeng Oliver Wu. Tensor product generation networks. CoRR, abs/1709.09118, 2017.
-  Hamid Palangi, Paul Smolensky, Xiaodong He, and Li Deng. Deep learning of grammatically-interpretable representations through question-answering. CoRR, abs/1705.08432, 2017.
-  Qiuyuan Huang, Li Deng, Dapeng Wu, Chang Liu, and Xiaodong He. Attentive tensor product learning for language generation and grammar parsing. arXiv preprint arXiv:1802.07089, 2018.
-  D. O. Hebb. The Organization of Behavior. Wiley, New York, 1949.
-  J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, Institut für Informatik, Technische Universität München, March 1991.
-  Thomas Miconi, Jeff Clune, and Kenneth O Stanley. Differentiable plasticity: training plastic neural networks with backpropagation. arXiv preprint arXiv:1804.02464, 2018.
-  Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. CoRR, abs/1707.05589, 2017.
-  Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. CoRR, abs/1410.3916, 2014.
-  Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. CoRR, abs/1506.07285, 2015.
-  Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. Weakly supervised memory networks. CoRR, abs/1503.08895, 2015.
-  Julien Perez and Fei Liu. Gated end-to-end memory networks. CoRR, abs/1610.04211, 2016.
-  Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual and textual question answering. CoRR, abs/1603.01417, 2016.
-  Emmanuel Dupoux. Deconstructing ai-complete question-answering: going beyond toy tasks, 2015.
-  Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.
-  Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, AdriÃ Puigdomenech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
-  Jack W. Rae, Jonathan J. Hunt, Tim Harley, Ivo Danihelka, Andrew W. Senior, Greg Wayne, Alex Graves, and Timothy P. Lillicrap. Scaling memory-augmented neural networks with sparse reads and writes. CoRR, abs/1610.09027, 2016.
-  Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking the world state with recurrent entity networks. In International Conference on Learning Representations (ICLR2017). Preprint arXiv:1612.03969.
-  S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991. Advisor: J. Schmidhuber.
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
In Yee Whye Teh and Mike Titterington, editors, Proceedings of
the Thirteenth International Conference on Artificial Intelligence and
Statistics, volume 9 of
Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.
-  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
-  Timothy Dozat. Incorporating nestrov momentum into adam. In International Conference on Learning Representations (ICLR2016). CBLS, April 2016. OpenReview.net ID: OM0jvwB8jIp57ZJjtNEZ.
Appendix A Experimental Details
We encode the valid words for a task as a one-hot vector; the dimensionality of the vector space is equal to the size of the vocabulary. Each MLP which produces the entity and relation representations from a sentence representation consists of two layers, where each layer is an affine transformation followed by the hyperbolic tangent nonlinearity. The hidden layers of the MLPs refer to the intermediate activations and are vectors from the vector space .
We initialize the word embeddings with a uniform distribution fromto and apply the Glorot initialization scheme  for all other weights except the position vectors which are initialized as a vector of ones divided by , the number of position vectors. We implemented the model in the TensorFlow framework and compute the gradients through its automatic differentiation engine  based on Linnainmaa’s automatic differentiation or backpropgation scheme 
. We pad shorter sentences with the padding symbol to achieve a uniform sentence length but keep the story length dynamic as in previous work.
To deal with possible unstable initializations we incorporate a warm-up phase in which we train the network for 50 steps with of the learning rate. In the case of NaN values during this warm-up phase we reinitialize the network from scratch. After successful warm-up phases we never encountered any further instabilities.
We optimize the neural networks using the Nadam optimizer  which in our experiments consistently outperformed others in convergence speed but not necessarily in final performance. Finally, we multiply the learning rate by a factor of once it has reached a validation set loss below .
The Single-Task Model
For the single-task model , , and . Note that depends on the vocabulary size of each individual task. We achieved the results in Table 2 using the hyper-parameters , , , and a batch-size of 128. These hyper-parameters have been optimised using an evolution procedure with small random perturbations. The main effect is improved convergence speed. With the exception of a few tasks that were sensible to the momentum parameter, similar final performance can be achieved with the default hyper-parameters.
The All-Tasks Model
In the all-tasks setting we train one model on all tasks simultaneously. We increase the size of the model to , , and and train with a batch-size of 32. We used the default hyper-parameters , , in that case.
Appendix B Detailed All-Tasks Training Runs
|all (0)||1.50||1.69||1.13||1.04||0.78||0.96||1.20||2.40||0.78||1.34 0.52|
Appendix C Further Similarity Matrices
36: mary put down the milk there .
39: mary discarded the milk .
42: mary grabbed the milk there .
41: mary picked up the milk there .
45: jason picked up the milk there .
5: brian is green .
7: lily is green .
35: mary travelled to the kitchen .
43: jason went to the kitchen .
11: daniel went back to the kitchen .
17: sandra went back to the kitchen .
19: sandra travelled to the kitchen .
21: sandra journeyed to the kitchen .
33: mary went to the bathroom .
18: sandra went back to the bathroom .
49: john travelled to the bathroom .
14: daniel travelled to the bathroom .
12: daniel went to the bathroom .
15: daniel journeyed to the bathroom .
31: mary went back to the bedroom .
30: mary moved to the bedroom .
38: mary journeyed to the bedroom .
48: john went to the bedroom .
27: yann travelled to the bedroom .
10: daniel moved to the bedroom .
13: daniel travelled to the bedroom .
2: julius is white .
28: yann is tired .
8: sumit moved to the garden .
9: sumit is bored .
44: jason is thirsty .
34: mary travelled to the hallway .
26: bernhard is gray .
47: john went to the office .
20: sandra journeyed to the office .
32: mary went to the office .
3: julius is a rhino .
6: lily is a rhino .
22: sandra discarded the football .
23: sandra picked up the football there .
24: sandra took the football there .
46: then she journeyed to the bathroom .
1: greg is a frog .
4: brian is a frog .
29: yann picked up the pajamas there .
25: bernhard is a swan .
40: mary discarded the apple there .
16: daniel took the apple there .
37: mary got the apple there .
30: mary went back to the kitchen .
28: mary moved to the hallway .
29: mary moved to the garden .
27: sandra journeyed to the bathroom .
24: sandra travelled to the bedroom .
22: sandra moved to the kitchen .
23: sandra went to the office .
3: daniel went back to the bathroom .
4: daniel went to the bathroom .
2: daniel moved to the kitchen .
1: daniel moved to the office .
5: daniel travelled to the office .
36: john travelled to the bathroom .
35: john went to the kitchen .
33: john moved to the office .
34: john went back to the bedroom .
6: daniel and john went to the hallway .
31: mary and sandra travelled to the hallway .
26: sandra left the apple .
37: john discarded the football .
19: the bathroom is south of the hallway .
20: the bathroom is north of the kitchen .
21: the bathroom is east of the garden .
12: the hallway is west of the office .
10: the hallway is north of the bedroom .
11: the hallway is east of the bedroom .
7: the office is east of the bathroom .
8: the office is west of the hallway .
9: the office is west of the garden .
38: john grabbed the football there .
32: mary picked up the milk there .
13: the garden is west of the bedroom .
25: sandra got the apple there .
17: the kitchen is south of the office .
18: the kitchen is south of the bathroom .
16: the bedroom is west of the garden .
14: the bedroom is east of the hallway .
15: the bedroom is east of the kitchen .
4: is the pink rectangle to the left of the yellow square ?
3: is the red sphere below the triangle ?
1: is bill in the kitchen ?
0: does the chocolate fit in the container ?
5: is mary in the hallway ?
2: is julie in the park ?
6: is mary in the kitchen ?
14: how do you go from the bedroom to the office ?
13: how do you go from the garden to the hallway ?
21: what is john carrying ?
17: what is daniel carrying ?
19: what is mary carrying ?
15: how many objects is mary carrying ?
26: what did fred give to mary ?
22: what color is greg ?
24: what color is lily ?
23: what color is brian ?
25: what color is bernhard ?
12: where was fred before the office ?
10: where is mary ?
7: where is daniel ?
18: what is the kitchen south of ?
9: where is sandra ?
11: where is john ?
8: where is the milk ?
16: what is gertrude afraid of ?
20: what is emily afraid of ?