Log In Sign Up

Limitations in learning an interpreted language with recurrent models

by   Denis Paperno, et al.

In this submission I report work in progress on learning simplified interpreted languages by means of recurrent models. The data is constructed to reflect core properties of natural language as modeled in formal syntax and semantics: recursive syntactic structure and compositionality. Preliminary results suggest that LSTM networks do generalise to compositional interpretation, albeit only in the most favorable learning setting, with a well-paced curriculum, extensive training data, and left-to-right (but not right-to-left) composition.


page 1

page 2

page 3

page 4


Finite Representation Property for Relation Algebra Reducts

The decision problem of membership in the Representation Class of Relati...

Cooperative Learning of Disjoint Syntax and Semantics

There has been considerable attention devoted to models that learn to jo...

A right-to-left type system for mutually-recursive value definitions

In call-by-value languages, some mutually-recursive value definitions ca...

RNNs can generate bounded hierarchical languages with optimal memory

Recurrent neural networks empirically generate natural language with hig...

On the Linguistic Capacity of Real-Time Counter Automata

Counter machines have achieved a newfound relevance to the field of natu...

Formal Metatheory of Second-Order Abstract Syntax

Despite extensive research both on the theoretical and practical fronts,...

Modeling structure-building in the brain with CCG parsing and large language models

To model behavioral and neural correlates of language comprehension in n...

1 Motivation

Despite showing impressive performance on certain tasks, neural networks are still far from showing natural language understanding at a human level, cf. Paperno et al. (2016). In a sense, it is not even clear what kind of neural architecture is capable of learning natural language semantics in all its complexity, with recurrent and convolutional models being currently tried on various tasks.

One can hope to make progress towards the challenging goal of natural language understanding by taking into account what is known about language structure and language processing in humans. With this in mind, it is possible to formulate certain preliminary desiderata for an adequate natural language understanding model.

First, language processing in humans is known to be sequential; people process and interpret linguistic input on the fly, without any lookahead and without waiting for the linguistic structure to be completed. This property, which has serious potential consequences for the cognitive architecture (Christiansen and Chater, 2016), gives a certain degree of cognitive plausibility to unidirectional recurrent models compared to other neural architectures, at last in their current implementations.

Second, natural language can exploit recursive structures: natural language syntax consists of constructions, represented in formal grammars as rewrite rules, which can recursively embed other constructions of the same kind. For example, noun phrases can in principle consist of a single proper noun (e.g. Ann) but can also, among other possibilities, be built from other noun phrases recursively via the possessive construction, as in Ann’s child, Ann’s child’s friend, Ann’s child’s friend’s parent etc. The possessive construction can be described by the rewrite rule .

Third, the recursive syntactic structure drives compositional semantic interpretation. The meaning of the noun phrase Ann’s child’s friend is not merely the sum of the meanings of the individual words (in which case it would have been semantically equivalent to Ann’s friend’s child). Rather, to interpret a complex expression correctly, one has to follow the syntactic structure, first identifying the meaning of the smaller constituent (Ann’s friend), and then computing the meaning of the whole on its basis.

Fourth, semantic compositionality can be formalized as function application, with one constituent in a complex structure corresponding to an argument of a function that another constituent encodes. For instance, in Ann’s child, we can think of Ann as denoting an individual and child as denoting a function from individuals to individuals. In formal semantics, function argument application as a semantic compositionality mechanism extends to a wide range of syntactic constructions.

Finally, natural language interpretation, while being sensitive to syntactic structure, is robust to syntactic variation. For example, humans are equally capable of learning to interpret and using left-branching structures such as (Ann’s child) and right-branching structures such as (the child of Ann).

2 The task

To summarize, in order to mimic human language capacities an artificial system has to be able to learn interpreted languages with compositionally interpreted recursive structures, while being adaptive to surface variation in the syntactic patterns. To test whether neural systems can fit the bill, we define toy interpreted languages based on a fragment of English. The vocabulary includes four names (Ann, Bill, Dick, George), interpreted as individual identifiers, four function-denoting nouns (child, parent, friend, enemy), and grammatical elements (of, ’s, the). Our languages contain either left-branching (, Ann’s child) or right-branching structures (the child of Ann, ).

child Ann Bill Dick George
parent Bill George Ann Dick
Table 1: Parent relation in a toy universe.
friend Ann Bill Dick George
friend Dick George Ann Bill
Table 2: Friend relation in a toy universe.
enemy Ann Bill Dick George
enemy George Dick Bill Ann
Table 3: Enemy relation in a toy universe.

The interpretation is defined model-theoretically. We randomly generate a model where each proper name corresponds to a distinct individual and each function denoted by a common noun is total. An example interpretation of function elements is given in tables 3, 2, and 1. In such a model, each well-formed expression of the language is interpreted as an individual identifier. The denotation of any expression can be calculated by recursive application of functions to arguments, guided by the syntactic structure of the expression.

The task given to the neural systems is to identify the individual that corresponds to each expression; e.g. Ann’s child’s enemy is the same person as Bill. Since there is just a finite number of individuals in any given model, the task formally boils down to string classification, assigning each expression to one of the set of individuals in the model.

3 Systems and data

We tested the learning capacities of two standard recurrent systems on our task: a vanilla recurrent neural network (RNN) Elman (1991)

and a long short-term memory network (LSTM)

Hochreiter and Schmidhuber (1997)

. Both systems were implemented in PyTorch and used hidden layers of 256 units. The RNN was trained with stochastic gradient descent and the LSTM was trained with Adam optimizer. Models were trained for 100 epochs or until no improvement on the validation set was observed for 22 epochs.

A model with four individuals and four randomly assigned functions was generated for each run. We used all expressions of the language up to complexity as experimental data; development and testing data was randomly selected among examples of maximal complexity. Examples of smaller complexity, i.e. 1 and 2, were always included in the training partition since they are necessary to learn the interpretation of lexical items. For example, the simplest set of training data (up to complexity 3) contained all names (examples of complexity 1), required to learn the individuals; all expressions with 2 content words like Ann’s child, necessary for learning the meanings of functional words like child; and a random subset of three content word expressions like Ann’s child’s friend, which might help guide the systems to learning recursion.

We also set a curriculum whereby the system was at first given training examples of minimal complexity, with more complex examples added gradually in the process of training. Practically, we added examples of the next complexity level after every ten epochs. Tweaking the curriculum, as we found, affected the generalization of the model considerably: for successful learning, complexity of examples must grow neither too slowly nor too fast. To illustrate this, we also report below, for comparison, the results of training the models without a curriculum, whereby training examples of all complexity levels were available to the models at all epochs, as well as accuracies for a slower curriculum setup.

4 Results

Memorize or generalize? This question summarizes the common dichotomy in the analysis of learning systems’ performance. To solve our task, the successful model has to do both. To treat complex expressions, the model needs to generalize by learning to compose the representations of simple expressions recursively. But representations of simple expressions (complexity 1 and 2) have to be memorized in one form or another because their interpretation is arbitrary; without such memorization generalization to complex inputs is impossible.

We found the RNN system to struggle already at a basic level; it never achieved perfect accuracy even for minimally complex structures (e.g. Ann’s child), so assessing its recursive compositionality abilities is out of question. Accuracies across LSTM experimental setups are summarized in Table 4.

We find that LSTM does learn to do compositional interpretation in our task, but only in the best scenario. First, and unsurprisingly, a curriculum is essential for the LSTM to generalize to unseen compositional examples. Informally, the system has to learn to interpret words first, and recursive semantic composition has to be learned later.

Third, the LSTM only generalized correctly in the case of left-branching structures; the accuracy of recursive composition in the right branching case stays just above the chance level (25%). It means that the system only learned to apply composition following the linear sequence of the input and failed when the order of compositionality as determined by the syntactic structure runs opposite to the linear order.

test ex. complexity: 3 4 5 6 7
right branching 0 .17 .21 .23 .26
left branching 1 1 1 1 1
left, slow curriculum .17 .33 .96 1 1
left, no curriculum .17 .21 .19 .21 .26
Table 4: Accuracy of the LSTM model as a function of the language and input data complexity. Training data in each run includes examples of complexity up to , testing data (disjoint from the training set) contained examples of complexity exactly . Random baseline is 0.25.

5 Looking for zero-shot generalization

We also investigate how easily our most successive system learns to perform recursive interpretation. Ideally, learners with a strong bias towards languages with recursive syntactic structure (to which presumably human language learners belong) could acquire them in a zero-shot fashion. For example, if such a learner knows already that both the simple name Dick and the phrase Ann’s child refer to the same individual, and that Dick’s enemy refers to Bill, the learner should be able to infer, or at least reliably guess, that Ann’s child’s enemy is also Bill. Furthermore, such inference can be expected even in the absence of recursive structures in the training input. In a recurrent neural network, the expectation can be interpreted as follows: both Dick and the phrase Ann’s child are expected to be mapped to more or less the same hidden state, and since this hidden state allows to identify Bill after seeing the last two tokens of Dick’s enemy, the same can be expected for the phrase Ann’s child’s enemy.

To test whether zero-shot (or even one-shot) recursion capacity actually arises, we train the model on data of complexity up to 3 while varying the amount of recursion examples available as training data. The results are reported in 5, which shows that the LSTM needs to be trained on a vast majority of recursive examples to be able to generalize to new ones.

So, although the recurrent architecture seems naturally adapted for processing complex left-branching structures, the system has to be trained on a significant number of examples of composition before it generalizes. Unlike (presumably) in humans, recursive compositionality does not come for free and has to be learned from extensive data. This observation goes in line with other findings in related literature (Liska et al., 2018; Hupkes et al., 2018; Lake and Baroni, 2018). train 0.0 0.2 0.4 0.6 0.8
average accuracy 0 .65 .67 .92 .98
perfect accuracy 0 0 0 .4 .9
Table 5: Data hungriness for learning recursion at complexity 3, percentage of data complexity 3 included in training data vs. test accuracy. The results are based on 10 runs with different random seeds. We report average accuracy as well as the share of runs with perfect accuracy. Random baseline for accuracy is .25.

Following previous research Liska et al. (2018), we also trained our LSTM model 1000 times with different random seeds in order to test whether zero-shot generalization sometimes emerges from neural network training. If it did emerge, this would have meant that learning can be improved rather directly either by ensembling, or by adjusting the model’s biases, or by other means. The seeds were selected randomly from the range of positive integers. In this experiment, we observed no instance of zero-shot generalization: out of 1K runs with different random seeds, almost all produced 0 accuracy. Only 5 out of 1000 trained models gave one or two correct responses on the heldout test set, which is way below the random baseline. This suggests that in the absence of substantial evidence for a compositional solution the model overfits heavily to the training data that can be memorized.

On the positive side, we did observe generalization to bigger structures after substantial evidence for a compositional solution was made available to the system. To test this, we trained the LSTM on all expressions of complexity up to 3 and tested it on expressions of complexity 4. Contrary to reports in the literature on neural networks overfitting to the length of training input Lake and Baroni (2018), our model generalized well to data of unseen length, achieving perfect accuracy.

6 Conclusion

The results reported in this paper both encourage and point to limitations of conventional LSTM training. On the one hand, recurrent models do generalize to compositional interpretation in certain narrowly defined favorable conditions, with a gentle curriculum, plenty of training data that support the compositional solution, left branching language, etc.

On the other hand, our observations suggest that learning recursive structure in the general case remains a challenge for LSTM networks, which excel only in sequential, left-to-right processing. If recursion, as has been claimed, is a core distinguishing property of human language and cognition (Hauser et al., 2002; Chomsky, 2014), we may need to make sure that learning systems designed for language incorporate rasonable biases towards recursive processing.

In future research, we would like to explore on our task the generalization capacities of the neural models which, unlike the vanilla recurrent networks used here, contain what seems to be reasonable biases towards processing context free languages, arguably useful for learning natural language syntax. Indeed several systems have been proposed that aim at learning structures defined by context-free grammars, as opposed to purely sequential input processing. Several among these systems augment the recurrent architecture either with stack memory Joulin and Mikolov (2015); Yogatama et al. (2018) or with a chart parsing component Le and Zuidema (2015); Maillard et al. (2017), which by their nature are adapted to the task of processing context-free languages. On the other hand, additional memory representations invoked by such models may require further justification from the cognitive point of view if artificial neural networks are taken to be models of language processing in humans.

Lastly, in further work we plan like to replicate our experiments with human learners instead of artificial systems. This will enable a proper comparison between humans and machine learning algorithms’ generalization capacities depending on the nature and the quantity of input data. Indeed, our findings about the role of curriculum confirm Elman’s early observations

Elman (1993), who argued that processing and memory limitations of human brain during childhood may effectively create a staged input to the learning system, akin to curriculum learning in computational systems. Although not uncontroversial Rohde and Plaut (1997), Elman’s suggestion could serve as an explanation of the so-called critical period of first language acquisition. Experiments with human learning of simple interpreted languages can help support or disprove this hypothesis.

I also have to point at two further directions relating human and machine learning, which are at the moment very speculative but are nonetheless of greatest potential importance for our understanding and modelling of human linguistic cognition. The recurrent LSTM model in our experiment showed clear structural preferences, which we interpreted as limitations. We do not expect to find exactly the same limitations in human language learning, but they might have some correspondences, perhaps indirect, which can be observed in some acquisition scenarios.

First, the model showed a preference towards left-branching structures (such as John’s father) rather than right-branching ones (such as the father of John). While human languages in principle contain both types of structures, and both are eventually learned without significant difficulties, it is known that left-branching possessive constructions can emerge in infant speech even in languages that don’t have them. Monolingual infants have been reported to produce examples like Yael sefer ‘Yael’s book’ (Hebrew, Armon-Lotem 1998) or zia trattore ‘aunt’s tractor’ (Italian, Torregrossa and Melloni 2014), even though these languages do not allow the possessor-possessee possessee order and the children could only have been exposed to the opposite sequential order (sefer shel Yael, trattore della zia). This might suggest an innate bias towards head-final constructions which could have been left-branching had they been recursive.

Second, we found the absence of zero-shot generalization to recursive syntactic structures. To the contrary, the training data had to show strong support for recursion in order for the LSTM to learn it. While most if not all natural languages feature syntactic recursion, it does not follow logically that human babies learn to process and to produce recursive structures effortlessly without the need to be exposed to a large number of examples of recursion first. Indeed, there are examples of the lack of syntactic recursion for certain syntactic constructions. The most widely discussed example is Pirahã language of Brazil. The question whether Pirahã lacks syntactic recursion altogether remains controversial Sauerland (2010); Nevins et al. (2009); Everett (2007), but it seems clear that although Pirahã has possessive constructions (equivalents of the English Bill’s son) these are not recursive (so Bill’s son’s friend is impossible in Pirahã). Similar constraints on recursion are reported for German (Krause 2000, cited in Nevins et al. 2009). If these reports are correct, they strongly suggest that humans, like our LSTM models, in fact do not learn recursive syntactic structures in a zero-shot fashion, without being exposed to examples of recursion for a particular construction. If this tentative connection is on the right track, it opens further interesting questions about the role of recursion in the functioning and the evolution of language and about the interaction of cognitive, functional, and possibly cultural factors in shaping grammars of human languages.


The research has been supported by CNRS PEPS ReSeRVe grant. I also thank Germán Kruszewski and Marco Baroni for useful input on the topic.


  • Armon-Lotem (1998) Sharon Armon-Lotem. 1998. Mommy sock in a minimalist eye: On the acquisition of dp in hebrew. Issues in the theory of language acquisition. Essays in Honor of Jürgen Weissenborn. Bern (Peter Lang), pages 15–36.
  • Chomsky (2014) Noam Chomsky. 2014. Minimal recursion: exploring the prospects. In Recursion: Complexity in cognition, pages 1–15. Springer.
  • Christiansen and Chater (2016) Morten H. Christiansen and Nick Chater. 2016. The now-or-never bottleneck: A fundamental constraint on language. Behavioral and Brain Sciences, 39.
  • Elman (1991) Jeffrey L Elman. 1991. Distributed representations, simple recurrent networks, and grammatical structure. Machine learning, 7(2-3):195–225.
  • Elman (1993) Jeffrey L Elman. 1993. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99.
  • Everett (2007) Daniel L Everett. 2007. Cultural constraints on grammar in pirahã: A reply to nevins, pesetsky, and rodrigues.
  • Hauser et al. (2002) Marc D. Hauser, Noam Chomsky, and W. Tecumseh Fitch. 2002. The faculty of language: What is it, who has it, and how did it evolve? science, 298(5598):1569–1579.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Hupkes et al. (2018) Dieuwke Hupkes, Anand Singh, Kris Korrel, Germán Kruszewski, and Elia Bruni. 2018. Learning compositionally through attentive guidance. CoRR, abs/1805.09657.
  • Joulin and Mikolov (2015) Armand Joulin and Tomas Mikolov. 2015. Inferring algorithmic patterns with stack-augmented recurrent nets. In Advances in neural information processing systems, pages 190–198.
  • Krause (2000) Cornelia Krause. 2000. On an (in-)visible property of inherent case. In North Eastern Linguistic Society 30, pages 427–42.
  • Lake and Baroni (2018) Brenden Lake and Marco Baroni. 2018. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International Conference on Machine Learning, pages 2879–2888.
  • Le and Zuidema (2015) Phong Le and Willem Zuidema. 2015.

    The forest convolutional network: Compositional distributional semantics with a neural chart and without binarization.


    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    , pages 1155–1164.
  • Liska et al. (2018) Adam Liska, Germán Kruszewski, and Marco Baroni. 2018. Memorize or generalize? searching for a compositional RNN in a haystack. CoRR, abs/1802.06467.
  • Maillard et al. (2017) Jean Maillard, Stephen Clark, and Dani Yogatama. 2017. Jointly learning sentence embeddings and syntax with unsupervised tree-lstms. arXiv preprint arXiv:1705.09189.
  • Nevins et al. (2009) Andrew Nevins, David Pesetsky, and Cilene Rodrigues. 2009. Pirahã exceptionality: A reassessment. Language, pages 355–404.
  • Paperno et al. (2016) Denis Paperno, German Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda Torrent, and Raquel Fernandez. 2016. The lambada dataset: word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin: Association for Computational Linguistics, pages 1525–1534. ACL (Association for Computational Linguistics).
  • Rohde and Plaut (1997) Douglas L.T Rohde and David C Plaut. 1997. Simple recurrent networks and natural language: How important is starting small. In Proceedings of the 19th annual conference of the Cognitive Science Society, pages 656–661. Citeseer.
  • Sauerland (2010) Uli Sauerland. 2010. Experimental evidence for complex syntax in pirahã. URL:
  • Torregrossa and Melloni (2014) Jacopo Torregrossa and Chiara Melloni. 2014. English compounds in child italian. In New Directions in the Acquisition of Romance Languages, Selected Proceedings of the Romance Turn V, pages 346–371. Cambridge Scholars Publishing.
  • Yogatama et al. (2018) Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom. 2018. Memory architectures in recurrent neural network language models. In Proceedings of ICLR.