Reinforcement Learning of Minimalist Numeral Grammars

06/11/2019 ∙ by Peter beim Graben, et al. ∙ 0

Speech-controlled user interfaces facilitate the operation of devices and household functions to laymen. State-of-the-art language technology scans the acoustically analyzed speech signal for relevant keywords that are subsequently inserted into semantic slots to interpret the user's intent. In order to develop proper cognitive information and communication technologies, simple slot-filling should be replaced by utterance meaning transducers (UMT) that are based on semantic parsers and a mental lexicon, comprising syntactic, phonetic and semantic features of the language under consideration. This lexicon must be acquired by a cognitive agent during interaction with its users. We outline a reinforcement learning algorithm for the acquisition of the syntactic morphology and arithmetic semantics of English numerals, based on minimalist grammar (MG), a recent computational implementation of generative linguistics. Number words are presented to the agent by a teacher in form of utterance meaning pairs (UMP) where the meanings are encoded as arithmetic terms from a suitable term algebra. Since MG encodes universal linguistic competence through inference rules, thereby separating innate linguistic knowledge from the contingently acquired lexicon, our approach unifies generative grammar and reinforcement learning, hence potentially resolving the still pending Chomsky-Skinner controversy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Speech-controlled user interfaces such as Amazon’s Alexa, Apple’s Siri or Cortana by Microsoft substantially facilitate the operation of devices and household functions to laymen. Instead of using keyboard and display as input-output interfaces, the operator pronounces requests or instructions to the device and listens to its responses.

State-of-the-art language technology scans the acoustically analyzed speech signal for relevant keywords that are subsequently inserted into semantic frames [1] to interpret the user’s intent. This slot filling procedure [2, 3, 4]

is based on large language corpora that are evaluated by standard machine learning methods, such as conditional random fields


or by deep learning of neural networks

[4], for instance. The necessity to overcome traditional slot filling techniques by proper cognitive information and communication technologies has already been emphasized by Allan [5]. His research group trains semantic parsers from large language data bases such as WordNet or VerbNet that are constrained by hand-crafted expert knowledge and semantic ontologies [2, 6, 7].

One particular demand on cognitive user interfaces are the processing and understanding of numerals, e.g. in instructions like “increase the heating to 22.5 degrees

”, where the device may probably respond with a sensor registration: “

the current room temperature is 18.3 degrees[8]. Numerals are an important research domain in cognitive linguistics and language technology [9, 10, 11, 12, 13, 14]. They exhibit typological differences among languages but share a simple arithmetic semantics. Decent examples are different morphologies in German () or English (), and also different base systems in German () or French () [11]. Linguistically, numerals are regarded as modifiers [12] with a particular syntactic morphology that should be described by a suitable grammar formalism. This grammar must store numeral morphemes together with their arithmetic semantics in a data base, called the mental lexicon. It should be complex enough to account for the wealth of linguistic typology and constrained enough to exclude ungrammatical compositions such as zweizig in German or twoty in English [11].

Recent research in computational linguistics has demonstrated that quite different grammar formalisms, such as categorial grammar [15], tree-adjoining grammar [16], multiple context free grammar (MCFG) [17], range concatenation grammar [18], and minimalist grammar [19, 20] converge toward universal description models [21, 22]. Minimalist grammar has been developed by Stabler [19] to mathematically codify Chomsky’s Minimalist Program [23] in the generative grammar framework. A minimalist grammar (MG) consists of a mental lexicon storing linguistic signs as arrays of syntactic, phonetic and semantic features, on the one hand, and of two structure-building functions, called “merge” and “move”, on the other hand. Syntactic features in the lexicon are, e.g., the linguistic base categories noun (n), verb (v), adjective (a), or, in the present context, numeral (num). These are syntactic heads selecting other categories either as complements or as adjuncts. The structure generation is controlled by selector categories that are “merged” together with their selected counterparts. Moreover, one distinguishes between licensors and licensees, triggering the movement of maximal projections. An MG does not comprise any phrase structure rules; all syntactic information is encoded in the feature array of the mental lexicon. Furthermore, syntax and compositional semantics can be combined via the lambda calculus [24, 25], while MG parsing can be implemented by compilation into an equivalent MCFG [26].

One important property of MG is their effective learnability in the sense of Gold’s formal learning theory [27]. Specifically, MG can be acquired by positive examples [28, 29] from linguistic dependence graphs [30, 31], which is consistent with psycholinguistic findings on early-child language acquisition [32, 33, 34]. However, learning through positive examples only, could easily lead to overgeneralization. According to Pinker [33] this could effectively be avoided through reinforcement learning [35, 36]. Although there is only little psycholinguistic evidence for reinforcement learning in human language acquisition [37, 38], we outline a machine learning algorithm for the acquisition of an MG mental lexicon of numeral morphology and semantics through reinforcement learning in this contribution.

Ii Numeral Grammar

Our language acquisition approach for numeral grammar combines methods from computational linguistics, formal logic, and abstract algebra. Starting point of our algorithm are utterance meaning pairs (UMP)


where is the spoken or written utterance, given as the exponent of a linguistic sign [39]. Technically, exponents are strings taken from the Kleene hull of some finite alphabet, , i.e. . The sign’s semantics is a logical term, usually expressed by means of lambda calculus.

Ii-a Numeral Semantics

The straightforward meaning of a numeral, say fourtytwo, is a number concept, such as . However, from a computational point of view, the UMP simply relates a symbolic string fourtytwo to another symbolic string , without making the exponent and the semantics of the sign operationally accessible. This is achieved by interpreting digit strings in a -adic number system. In the decimal system with , we have


with coefficients ( the number of digits).

Equation (2) can directly be written as a tree-like arithmetic term structure

[. [. ] [. ] ]

Fig. 1: Arithmetic term tree for .

Using the binary operators and , and writing them in the unary Schönfinkel representation

where is regarded as a function , and as another function , respectively, we obtain an expression of the arithmetic term algebra [39] in Polish notation

that will be interpreted as the meaning of the numeral fourtytwo in the sequel [13, 14]. Hence, the correct UMP for is


Ii-B Minimalist Grammar

Following Kracht [39], we regard a linguistic sign as an ordered triple


with the same exponent and semantics as in the UMP (1). In addition, is a syntactic type that we encode by means of minimalist grammar (MG) in its chain representation [20]. The type controls the generation of syntactic structure and hence the order of lambda application, analogously to the typed lambda calculus in Montague semantics.

An MG consists of a data base, the mental lexicon, containing signs as arrays of syntactic, phonetic and semantic features, and of two structure-generating functions, called “merge” and “move”. Syntactic features are the basic types from a finite set , with , etc, together with a set of their respective selectors that are unified by the “merge” operation. Moreover, one distinguishes between a set of licensers and another set of their corresponding licensees triggering the “move” operation. is another finite set of movement identifiers. is called the feature set. Finally, one has a two-element set of categories, where “::” indicates simple, lexical categories while “:” denotes complex, derived categories. The ordering of syntactic features is prescribed as regular expressions, i.e. is the set of syntactic types [19, 20]. The set of linguistic signs is then given as [39].

Let be exponents, semantic terms in the lambda calculus, one feature identifier, feature strings compatible with the regular types in , and sequences of signs, then and form signs in the sense of (4). A sequence of signs is called a minimalist expression, and the first sign of an expression is called its head, controlling the structure building through “merge” and “move” as follows.

The MG function “merge” is defined through inference schemata


Correspondingly, “move” is given through


where only one sign with licensee may appear in the expression licensed by in the head. This so-called shortest movement constraint (SMC) guarantees syntactic locality demands [19, 20].

A minimalist derivation terminates when all syntactic features besides only one distinguished start symbol, in our case num, have been consumed. The meaning of rules (59) and their applicability becomes clear in the next section.

Iii Reinforcement Learning

The language learner is a cognitive agent in a state , to be identified with ’s mental lexicon at training time . At time , is initialized as a tabula rasa with empty lexicon


and exposed to UMPs produced by a continuously counting teacher . The first UMPs given by are , , , and so forth. Note that we assume presenting already complete UMPs and not singular utterances to . Thus we avoid the symbol grounding problem of firstly assigning meanings to uttered exponents [40], which will be addressed in future research. Moreover, we assume that is instructed to reproduce ’s counting based on its own numeric understanding. This provides a feedback loop and therefore applicability of reinforcement learning [35, 36].

As long as is not able to detect patterns or common similarities in ’s UMPs, it simply adds new entries directly to its mental lexicon, assuming that all numerals have base type num. Hence, ’s state evolves according to the update rule


when is the UMP presented at time by .

In this way, the mental lexicon of simplex numerals in Tab. I has been acquired at time .

TABLE I: Content of the minimalist lexicon of language learner at time .

The learner is so able to perfectly reproduce the learned entries directly via data base query. As a consequence, the teacher rewards thus signalling that it has correctly learned the lexicon .

When the teacher continues counting: , ,

and so on, the learner’s pattern matching faculty detects a common affix

teen in the exponents, and a common function in the semantics of UMPs .

Thus, in a first step UMP is still added to the lexicon according to update rule (11),


However, at time , pattern matching, segmentation and lambda abstraction are performed, leading to a revision [28, 29]


such that in (13) the previously learned lexicon is revised by removing the entry for the composite thirteen, followed by adding the complex morpheme in (14), and completed in (15). For the morpheme is already contained in the lexicon, further updating is not required at this time.

Next, has to correctly reproduce the UMPs and by invoking its utterance-meaning transducer (UMT) [14]. Consider , which is now ambiguous with respect to the lexicon entries for . First, may access data base entries and and derive the following UMP according to the MG rules (59)

This yields the correct semantics with the lambda calculus

and the uttered exponent thirteen, generated by the UMT [14], is well-formed and will be rewarded by the teacher.

However, may alternatively select data base entries and as well. Then

will be derived instead. Although it has the correct semantics , uttering the exponent threeteen will be rejected by . Upon the resulting punishment, has to reconfigure its mental lexicon by introducing additional licenser/licensee pairs, here denoted as [28, 29]. Table II displays the result of this reorganization process at some time later than when all possible ungrammaticalities have been abandoned.

TABLE II: Content of the minimalist lexicon of language learner after punishment reorganization at time .

Now only the data base selection and leads to a grammatical derivation of the UMT [14],

while its ambiguous counterpart

cannot be further processed due to a lacking licensee -k.

The same argument applies to the ambiguous entries and where only the latter successfully derives . Note that the currently learned grammar also derives the exponent eightteen instead of eighteen; this could be corrected by either learning an additional entry and revising , or, perhaps more appropriately, by introduction of additional phonotactical rules operating on abstract graphon representations [10]. Moreover, since simplex numerals such as four, six, seven, and nine must not possess any other features than num, they would be doubled in a more rigorous treatment, resulting in four additional lexicon entries.

From a semantic point of view, the lexicon state in Tab. II is not yet satisfactory, because another step of lambda abstraction can be applied to entry , entailing the semantics of plain addition


Incorporating this into the training process gives another updating dynamics


such that (21) removes the original teen from the lexicon which is subsequently replaced by the phonetically void addition operator and a new representative .

Table III shows the updated lexicon at some even later time .

TABLE III: Content of the minimalist lexicon of language learner after semantic reorganization at time .

Now, the correct derivation of thirteen reads

By virtue of lexicon the learner is able to correctly reproduce numerals , employing its UMT [14]. This will be rewarded by the teacher. Later, the teacher utters the UMPs , , etc. Again, the learner will first incorporate according to rule (11) into the lexicon. But upon perceiving its pattern matching device produces a common morpheme


through lambda abstraction. Then the essentially same processes of reinforcement learning are repeated as above until the complete numeral system of the language taught by the teacher has been acquired by the learner.

Iv Discussion

In this contribution we have outlined an algorithm for effectively learning the syntactic morphology and semantics of English numerals [11]. Number words are presented to a cognitive agent by a teacher in form of utterance meaning pairs (UMP) where the meanings are encoded as arithmetic terms from a suitable term algebra. This representation allows for the application of compositional semantics via lambda calculus. For the description of syntactic categories we use Stabler’s minimalist grammar (MG) [19, 20], a powerful computational implementation of Chomsky’s recent Minimalist Program for generative linguistics [23]. Despite the controversy between Chomsky and Skinner [41], we exploit reinforcement learning [35, 36] as training paradigm. Since MG encodes universal linguistic competence through the five inference rules (59), thereby separating innate linguistic knowledge from the contingently acquired lexicon, our approach could potentially unify generative grammar and reinforcement learning, hence resolving the abovementioned dispute.

Minimalist grammar can be learned from linguistic dependency structures [28, 29, 30, 31] by positive examples, which is supported by psycholinguistic findings on early human language acquisition [32, 33, 34]. However, as Pinker [33] has emphasized, learning through positive examples alone, could lead to undesired overgeneralization. Therefore, reinforcement learning that might play a role in children language acquisition as well [37, 38], could effectively avoid such problems. The required dependency structures are directly provided by the semantics in the training UMPs. Thus, our approach is explicitly semantic-driven, in contrast to the algorithm in [31] that regards dependencies as latent variables for EM training.

As a proof-of-concept we suggested an algorithm for English numerals. However, we also have evidence that it works for German and French number systems as well and hopefully for other languages also. Using attribute-value logics [42] and its associated term algebra, it should be possible to encode the semantics of arbitrary utterances in a compositional fashion. This will open up an entirely new avenue for the further development of speech-controlled cognitive user interfaces [8].


  • [1] M. Minsky, “A framework for representing knowledge,” M.I.T., Cambridge (MA), Tech. Rep. AIM-306, 1974. [Online]. Available:
  • [2]

    J. F. Allen, “Natural language processing,” in

    Encyclopedia of Computer Science.   Chichester (UK): Wiley, 2003, pp. 1218 – 1222.
  • [3] G. Tur, D. Hakkani-Tür, L. Heck, and S. Parthasarathy, “Sentence simplification for spoken language understanding,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 5628 – 5631.
  • [4]

    G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, and G. Zweig, “Using recurrent neural networks for slot filling in spoken language understanding,”

    IEEE Transactions on Audio, Speech and Language Processing, vol. 23, no. 3, pp. 530 – 539, 2015.
  • [5] J. Allen, “Dialogue as collaborative problem solving,” in Proceedings of Interspeech Conference, 2017, p. 833.
  • [6] ——, “Learning a lexicon for broad-coverage semantic parsing,” in Proceedings of the ACL 2014 Workshop on Semantic Parsing, 2014, pp. 1 – 6. [Online]. Available:
  • [7] J. F. Allen, O. Bahkshandeh, W. de Beaumont, L. Galescu, and C. M. Teng, “Effective broad-coverage deep parsing,” in

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018. [Online]. Available:
  • [8] C. Tschöpe, F. Duckhorn, M. Huber, W. Meyer, and M. Wolff, “A cognitive user interface for a multi-modal human-machine interaction,” in Speech and Computer, A. Karpov, O. Jokisch, and R. Potapova, Eds.   Cham: Springer, 2018, pp. 707 – 717.
  • [9] G. Flach, M. Holzapfel, C. Just, A. Wachtler, and M. Wolff, “Automatic learning of numeral grammars for multi-lingual speech synthesizers,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, 2000, pp. 1291 – 1294.
  • [10] M. Wolff, M. Eichner, and R. Hoffmann, “Improved data-driven generation of pronunciation dictionaries using an adapted word list,” in Proceedings of EUROSPEECH 2001 Scandinavia, 7th European Conference on Speech Communication and Technology, 2001, pp. 1433 – 1436.
  • [11] J. R. Hurford, “Numeral systems,” in International Encyclopedia of the Social & Behavioral Sciences.   Elsevier, 2001, pp. 10 756 – 10 761.
  • [12] T. Ionin and O. Matushansky, “The composition of complex cardinals,” Journal of Semantics, vol. 23, no. 4, pp. 315 – 360, 11 2006.
  • [13] J. A. Mendia, “Epistemic numbers,” Proceedings of SALT, vol. 28, pp. 493 – 511, 2018.
  • [14] P. beim Graben, W. Meyer, R. Römer, and M. Wolff, “Bidirektionale Utterance-Meaning-Transducer für Zahlworte durch kompositionale minimalistische Grammatiken,” in Tagungsband der 30. Konferenz Elektronische Sprachsignalverarbeitung (ESSV), ser. Studientexte zur Sprachkommunikation, P. Birkholz and S. Stone, Eds., vol. 91.   Dresden: TU-Dresden Press, 2019, pp. 76 – 82.
  • [15] M. Amblard, A. Lecomte, and C. Retoré, “Categorial minimalist grammar: from generative syntax to logical form,” Linguistic Analysis, vol. 36, no. 1-4, pp. 273 – 306, 2010.
  • [16] A. K. Joshi, L. S. Levy, and M. Takahashi, “Tree adjunct grammars,” Journal of Computer and System Sciences, vol. 10, no. 1, pp. 136 – 163, 1975. [Online]. Available:
  • [17] H. Seki, T. Matsumura, M. Fujii, and T. Kasami, “On multiple context-free grammars,” Theoretical Computer Science, vol. 88, no. 2, pp. 191 – 229, 1991. [Online]. Available:
  • [18] P. Boullier, “Range concatenation grammars,” in New Developments in Parsing Technology, ser. Text, Speech and Language Technology, H. Bunt, J. Carroll, and G. Satta, Eds.   Springer, 2005, vol. 23, pp. 269 – 289.
  • [19] E. P. Stabler, “Derivational minimalism,” in Logical Aspects of Computational Linguistics, ser. Lecture Notes in Computer Science, C. Retoré, Ed.   New York: Springer, 1997, vol. 1328, pp. 68 – 95.
  • [20] E. P. Stabler and E. L. Keenan, “Structural similarity within and among languages,” Theoretical Computer Science, vol. 293, pp. 345 – 363, 2003.
  • [21] J. Michaelis, “Derivational minimalism is mildly context-sensitive,” in Logical Aspects of Computational Linguistics, ser. Lecture Notes in Artificial Intelligence, M. Moortgat, Ed., vol. 2014.   Berlin: Springer, 2001, pp. 179 – 198.
  • [22] E. P. Stabler, “Computational perspectives on minimalism,” in Oxford Handbook of Linguistic Minimalism, C. Boeckx, Ed.   Oxford University Press, 2011, pp. 617 – 641.
  • [23] N. Chomsky, The Minimalist Program, ser. Current Studies in Linguistics.   Cambridge (MA): MIT Press, 1995.
  • [24] S. Niyogi, “A minimalist implementation of verb subcategorization,” in Proceedings of the Seventh International Workshop on Parsing Technologies (IWPT-2001)., 2001.
  • [25] G. M. Kobele, “Syntax and semantics in minimalist grammars,” in Proceedings of ESSLLI 2009, 2009.
  • [26] E. P. Stabler, “Top-down recognizers for MCFGs and MGs,” in Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics.   Portland, Oregon, USA: Association for Computational Linguistics, 2011, pp. 39 – 48. [Online]. Available:
  • [27] E. M. Gold, “Language identification in the limit,” Information and Control, vol. 10, no. 5, pp. 447 – 474, 1967.
  • [28] G. M. Kobele, T. Collier, C. Taylor, and E. P. Stabler, “Learning mirror theory,” in Proceedings of the Sixth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+6), 2002, pp. 66 – 73. [Online]. Available:
  • [29] E. P. Stabler, T. C. Collier, G. M. Kobele, Y. Lee, Y. Lin, J. Riggle, Y. Yao, and C. E. Taylor, “The learning and emergence of mildly context sensitive languages,” in Advances in Artificial Life, ser. Lecture Notes in Computer Science, W. B. et al., Ed.   Berlin: Springer, 2003, vol. 2801, pp. 525 – 534.
  • [30] M. Boston, J. Hale, and M. Kuhlmann, “Dependency structures derived from minimalist grammars,” in The Mathematics of Language, ser. Lecture Notes in Computer Science, C. Ebert, G. Jäger, and J. Michaelis, Eds.   Berlin: Springer, 2010, vol. 6149, pp. 1 – 12. [Online]. Available:
  • [31] D. Klein and C. D. Manning, “Corpus-based induction of syntactic structure: models of dependency and constituency,” in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics.   Stroudsburg (PA): Association for Computational Linguistics, 2004.
  • [32] N. C. Ellis, “Language acquisition as rational contingency learning,” Applied Linguistics, vol. 27, no. 1, pp. 1 – 24, 2006.
  • [33] S. Pinker, “Language acquisition,” in Language: An Invitation to Cognitive Science, L. R. Gleitman, D. N. Osherson, M. Liberman, L. R. Gleitman, D. N. Osherson, and M. Liberman, Eds.   MIT Press, 1995, ch. 6, pp. 135 – 182.
  • [34] M. Tomasello, “First steps toward a usage-based theory of language acquisition,” Cognitive Linguistics, vol. 11, no. 1-2, p. 61, 2006.
  • [35] B. F. Skinner, Verbal Behavior.   Mansfield Centre (CT): Martino Publishing, 2015.
  • [36] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.   MIT press, 2018.
  • [37] E. L. Moerk, “A behavioral analysis of controversial topics in first language acquisition: Reinforcements, corrections, modeling, input frequencies, and the three-term contingency pattern,” Journal of Psycholinguistic Research, vol. 12, no. 2, pp. 129 – 155, 1983.
  • [38] M. L. Sundberg, J. Michael, J. W. Partington, and C. A. Sundberg, “The role of automatic reinforcement in early language acquisition,” Analysis of Verbal Behavior, vol. 13, no. 1, pp. 21 – 37, 1996.
  • [39] M. Kracht, The Mathematics of Language, ser. Studies in Generative Grammar.   Berlin: Mouton de Gruyter, 2003, no. 63.
  • [40] S. Harnad, “The symbol grounding problem,” Physica D, vol. 42, pp. 335 – 346, 1990.
  • [41] N. Chomsky, “A review of B. F. Skinner’s Verbal Behavior,” Language, vol. 35, no. 1, pp. 26 – 58, 1959. [Online]. Available:
  • [42] M. Johnson, Attribute-Value Logic and the Theory of Grammar, ser. CSLI Lecture Notes.   Stanford (CA): CSLI, 1988.