Mildly context sensitive grammar induction and variational bayesian inference

by   Eva Portelance, et al.
Stanford University

We define a generative model for a minimalist grammar formalism. We present a generalized algorithm for the application of variational bayesian inference to lexicalized mildly context sensitive grammars. We apply this algorithm to the minimalist grammar model.



There are no comments yet.


page 1

page 2

page 3

page 4


Compound Probabilistic Context-Free Grammars for Grammar Induction

We study a formalization of the grammar induction problem that models se...

A Concept Learning Approach to Multisensory Object Perception

This paper presents a computational model of concept learning using Baye...

Performance prediction of massively parallel computation by Bayesian inference

A performance prediction method for massively parallel computation is pr...

RL-GRIT: Reinforcement Learning for Grammar Inference

When working to understand usage of a data format, examples of the data ...

Shape Inference and Grammar Induction for Example-based Procedural Generation

Designers increasingly rely on procedural generation for automatic gener...

Cortical Microcircuits from a Generative Vision Model

Understanding the information processing roles of cortical circuits is a...

Two- and Multi-dimensional Curve Fitting using Bayesian Inference

Fitting models to data using Bayesian inference is quite common, but whe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is well known that natural language exhibits complex syntactic relations such as long distance and cross-serial dependencies, as in examples (1) and (1).
Long-distance displacement.
[theme=default] [column sep=1em] What & did & you & see & GAP.
[edge unit distance=1ex]45OBJ [edge unit distance=0.5ex]51WH Cross-serial dependencies. (Swiss German Shieber (1985))
…mer d’chind em Hans es huus lönd hälfe aastriiche
…we the children Hans the house let help paint
’…we let the children help Hans paint the house’
[theme=default] …d’chind & em Hans & es huus & lönd & hälfe & aastriiche
[edge unit distance=0.5ex]41OBJ [edge unit distance=1ex]52OBJ [edge unit distance=1.5ex]63OBJ Some of these dependencies cannot be expressed by simple context free grammars. The class of grammars which generate mildly context sensitive languages first described by joshi85 are the best suited for describing natural language as they are capable of expressing long-distance and cross-serial dependencies in a meaningful way which captures the generalizations put forth by linguists. Many different formalisms have emerged within this class and have been shown to be equivalent, including tree adjoining grammars Joshi . (1969), combinatory categorial grammars Steedman (1987); Szabolcsi (1992), and more expressive linear context free rewriting Systems Weir (1988), multiple context free grammars Seki . (1991) and minimalist grammars E. Stabler (1997); Harkema (2001). Some of these grammars have been formally implemented as probabilistic language models.

Probabilistic language models allow us to build parsing systems which can capture the right semantic interpretations and to develop machines which interact with people more fluently. Such models are adaptable to new inputs and can help resolve both lexical (1) and syntactic (1

) ambiguities which exist in natural language by presenting us with a scalable metric of "goodness" for parses, where the evaluation metric stems from a previously motivated and well-understood detailed theory. Furthermore, such models can be used to relate our grammar to quantitative empirical data (eg.processing times, corpus frequencies, MEG/fMRI data, etc.). From a theoretical point of view, probabilistic language models present an environment for testing the learnability of syntactic theories.

Lexical ambiguity

  1. [label=.]

  2. The sweater was worn (by Mary). - Passive verb

  3. The sweater was (very) worn. - Adjective

Syntactic ambiguity

  1. [label=.]

  2. I watched [the movie with Jim Carrey]. - Jim Carrey acts in the movie.

  3. I [watched [the movie] [with Jim Carrey]]. - Jim Carrey and I watched a movie together.

Probabilistic inference for grammar induction has been applied to combinatory categorial grammars Y. Bisk  Hockenmaier (2013, 20121, 20122); YY. Bisk (2015); Wang (2016) and tree substitution grammar Blunsom  Cohn (2010), but none of these use a variational Bayesian approach. Given the complexity of grammar generative models, this approach may be seen as favorable, as it turns an inference problem into an optimization problem and avoids the issues which arise from computing the posterior directly. This method has been applied to context free grammars Kurihara  Sato (2004) and adaptor grammars Cohen  Smith (2010), but has yet to be applied to the wider class of grammars which derive mildly context sensitive languages including minimalist grammars for which this technical report presents the first steps towards defining a generative model and a variational inference updating protocol. In the first section, we define a minimalist grammar and then in the following section, present the generative model for this grammar. This is then followed by the third section where we present the current formalization of the mean-field variational Bayes inference algorithm applied to the grammar described in the previous section and a set of conditions for its application to other equivalent formalisms.

2 Minimalist grammars

Minimalist Grammars are a formalization of chomsk95’s The minimalist program, a language processing model which is the framework used for much of the current research in linguistic syntactic theory. The interest in studying these grammars is that they offer a direct line of comparison for resulting syntactic structure predictions within the linguistic minimalist literature. In the subsections which follow, we present a working definition of a minimalist grammar.

2.1 Formalization of a minimalist grammar

The following formalization is based on the work of harkem01diss and stabler11’s ’directional minimalist grammars’. A minimalist grammar is composed of a lexicon

and structure building operations, typically classified as

merge and move.

Let be a minimalist grammar, such that .

  • is a finite set of lexical items with are defined as , where is a phonological feature (or a word in our case) and are syntactic features which interact with the structure building operations and are categorized into four types as follows:

    1. category (e.g. v, d, p) - define the syntactic categories (verb, noun …);

    2. selector (e.g. d= , =p) - select argument constituent to the left or right;

    3. licensor (e.g. +case, +wh) - select moving constituent;

    4. licensee (e.g. -case, -wh) - selected moving constituent.

  • are the structure building operations which allow lexical items to merge together into large units of meaning. They are defined further bellow.

Each lexical item is categorized by one category feature. If a lexical item has selector features, these are checked by merging with constituents of the corresponding category feature. A licensor feature is used to move a previously merged constituent with the corresponding licensee feature.

contains two structure building operations, merge defined in (1) and move defined in (2). These operations apply to constituents, both simple (a single lexical item) and complex (a series of lexical items which have been merged). which re marked by a single ’:’. The operations are defined as follows:

Let x be a category feature, be a list of syntactic features and constituents.

  1. merge

    • merge-L (merge a non-moving item to left)

    • merge-R (merge non-moving item to right)

    • merge-m (merge an eventually moving item)

  2. move

    • move-1 (moving item to final landing position)

    • move-2 (moving item which will move again)

Here, represent constituents which have previously merged using merge-m into the derivation, but still carry licensee features - i.e. items which have not reached their final landing position in the syntactic structure.

The figure in (2.1) is an example of a derivation in this system using and the following lexical items defined. Derivation of ’what did you see?’.

= { what :: d -wh , see :: =d d= v, you :: d, did :: =v i, :: =i +wh c }



what did you see [.merge

+wh c, -wh [.=i +wh c ] [.merge

i, -wh

(did you see, what) [.=v i

did ] [.merge

v, -wh

(you see, what) [.d

you ] [.merge

d= v, -wh

(see, what) [.=d d= v

see ] [.d -wh

what ] ] ] ] ] ]

  1. merge-m: v selects d what;

  2. merge-L: v selects d you;

  3. merge-R: i selects v you see, what;

  4. merge-R: c selects i did you see, what;

  5. move: -wh moves to satisfy +wh what did you see;

  6. All features are satisfied.

Thus, this minimalist grammar allows us to represent long distance displacements as this example shows. In the next section I develop a generative model based on this minimalist grammar.

3 The generative model for a minimalist grammar

Developing a generative model for minimalist grammars is not intuitively obvious for two reasons. The first reason is minimalist grammars have always been defined ’bottom up’: they begin with a set of lexical items ( this is called the enumeration in chomsk95’s original theory) and use the structure building rules to derive a tree rooted by a single category feature. However, when sampling, we need to adopt a ’top-down’ approach to the derivation problem to make sure we only sample derivation trees which terminate - i.e. are rooted by a single category feature. The second problem we face is how to handle move. This operation complicates things significantly because it introduces the concept of tuples of items. For this reason, for our initial implementation we have decided to define a generative model which excludes move as a possibility. However, we hope to update this model to include the all the minimalist structure building operations soon. Essentially, what this means is that the following generative model is equivalent to a context free grammar.

Let be an arbitrary category feature. We define a Dirichlet prior over parameterized by a distribution over categories of lexical items .

is the probability distribution over all possible feature sets which contain the category

such that . Furthermore, is be the probability of the lexical item of category c and is a function which returns the category feature of .




We are explicitly sampling a distribution over each category from a Dirichlet. Effectively, our probabilistic minimalist grammar is defined as .

We can generate a derivation tree using the following recursive structure. I define the function and the function which take a lexical item as input and return respectively the selector features of that item and the category feature of the item. Furthermore, we define a function which takes as input a category feature and returns a lexical item of that category. Then we can recursively sample structures for derived constituents using the following equation. We define to be a derivation headed by the lexical item . Let by selector features.


Concretely, this returns the probability of the derivation , which is the product of the probability of all the lexical items which compose it.

Finally, we define , our corpus, where , and , the set of possible derivations for , where . We can now define the posterior we are interested in as follows:


Given that the denominator of this fraction is not a tractable problem, using variational Bayesian inference to approximate the posterior represents feasible way of doing inference for this model.

4 Variational Bayesian inference for minimalist grammar induction

Mean field variational inference is based on the assumption that all variables in the approximated posterior are independent and therefore it can be factorized.


where is the set of possible derivations of .

We define an incremental updating scheme for as follows.
First, we update .






is a hyperparameter of the Dirichlet prior,

returns all the possible derivations for and returns the counts of in some derivation in

, given the current estimate of the grammar. Finally, we proceed to update



Second, we update .




and is the digamma function.

By updating and incrementally, and given that they are codependent, we can optimize .

5 Variational inference for mildly context sensitive language grammars

Given that variational Bayesian inference transforms an inference problem into an optimization one, unlike sampling-based methods of inference, the method can easily be translated to work for other models as long as they can satisfy certain assumptions. The previous example of variational inference applied to a minimalist grammar was built on three assumptions. Assumptions for generalized variational Bayesian inference for mildly context sensitive language grammars

  1. The latent derivation trees are context-free;

  2. The probabilities of the trees in (1) can be decomposed into the product of their components’ probabilities;

  3. There is a Dirichlet prior over lexical items/rules.

The first assumption supposes that each lexical item is well-formed and conditionally independent, however the linearization assumption usually given in CFGs is not required of mildly context sensitive language grammar derivation trees. The variational inference updating protocol presented can be generalized to any lexicalized mildly context sensitive language grammar formalism that satisfies these three assumptions.

6 Conclusion

In this paper, we presented the formalization of a minimalist grammar and a generative model for this grammar as well as a generalized variational Bayesian approach to mildly context sensitive language grammar induction. We demonstrated how this approach can be applied to a specific grammar instantiation within this equivalence class using the case of a minimalist grammar. In future work, we hope to implement a computable version of this algorithm which will be integrated into a parsing framework which is also currently being developed Harasim . (2017)

. This new framework will then be used to test linguistic syntactic theories and provide a baseline for future development of unsupervised language learning models as well as more ’human’-like natural language processing applications.


  • Y. Bisk  Hockenmaier (20121) biskhocken12inductBisk, Y.  Hockenmaier, J.  20121. Induction of linguistic structure with combinatory categorial grammars Induction of linguistic structure with combinatory categorial grammars. Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure Proceedings of the naacl-hlt workshop on the induction of linguistic structure ( 90–95).
  • Y. Bisk  Hockenmaier (20122) biskhocken12simpleBisk, Y.  Hockenmaier, J.  20122. Simple Robust Grammar Induction with Combinatory Categorial Grammars. Simple robust grammar induction with combinatory categorial grammars. AAAI. Aaai.
  • Y. Bisk  Hockenmaier (2013) biskhocken13Bisk, Y.  Hockenmaier, J.  2013. An HDP model for inducing combinatory categorial grammars An hdp model for inducing combinatory categorial grammars. Transactions of the Association for Computational Linguistics175–88.
  • YY. Bisk (2015) bisk15dissBisk, YY.  2015.  Unsupervised Grammar Induction with Combinatory Categorial Grammars Unsupervised grammar induction with combinatory categorial grammars .  University of Illinois at Urbana-Champaign.
  • Blunsom  Cohn (2010) blunsocohn10Blunsom, P.  Cohn, T.  2010. Unsupervised induction of tree substitution grammars for dependency parsing Unsupervised induction of tree substitution grammars for dependency parsing. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing Proceedings of the 2010 conference on empirical methods in natural language processing ( 1204–1213).
  • Chomsky (1995) chomsk95Chomsky, N.  1995. The Minimalist Program (Current Studies in Linguistics 28) The minimalist program (current studies in linguistics 28). Cambridge et al.
  • Cohen  Smith (2010) cohensmith10Cohen, SB.  Smith, NA.  2010.

    Covariance in unsupervised learning of probabilistic grammars Covariance in unsupervised learning of probabilistic grammars.

    Journal of Machine Learning Research11Nov3017–3051.

  • Harasim . (2017) harasim17Harasim, D., Bruno, C., Portelance, E.  O’Donnell, TJ.  2017. A generalized parsing framework for Abstract Grammars Technical Report A generalized parsing framework for abstract grammars technical report. arXiv1710.11301.
  • Harkema (2001) harkem01dissHarkema, H.  2001.  Parsing Minimalist Languages Parsing minimalist languages .  University of California, Los Angeles.
  • Joshi (1985) joshi85Joshi, AK.  1985. Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions? Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions? DR. Dowty, L. Karttunen  AM. Zwicky (), Natural Language Parsing. Natural language parsing. Cambridge University Press.
  • Joshi . (1969) joshi69Joshi, AK., Kosaraju, SR.  Yamada, H.  1969Oct. String adjunct grammars String adjunct grammars. 10th Annual Symposium on Switching and Automata Theory (swat 1969) 10th annual symposium on switching and automata theory (swat 1969) ( 245-262). 10.1109/SWAT.1969.23
  • Kurihara  Sato (2004) kuriharasato04Kurihara, K.  Sato, T.  2004. An application of the variational Bayesian approach to probabilistic context-free grammars An application of the variational bayesian approach to probabilistic context-free grammars. IJCNLP-04 Workshop beyond shallow analyses. Ijcnlp-04 workshop beyond shallow analyses.
  • Seki . (1991) sekietal91Seki, H., Matsumura, T., Fujii, M.  Kasami, T.  1991. On multiple context-free grammars On multiple context-free grammars. Theoretical Computer Science882191 - 229.
  • Shieber (1985) shiebe85Shieber, SM.  1985. Evidence against the context-freeness of natural language Evidence against the context-freeness of natural language. The Formal complexity of natural language33320–332.
  • E. Stabler (1997) stabler97Stabler, E.  1997. Derivational minimalism Derivational minimalism. Logical Aspects of Computational Linguistics68–95.
  • EP. Stabler (2011) stabler11Stabler, EP.  2011. Computational perspectives on minimalism Computational perspectives on minimalism. Oxford handbook of linguistic minimalism617–643.
  • Steedman (1987) steedman87Steedman, M.  1987. Combinatory grammars and parasitic gaps Combinatory grammars and parasitic gaps. Natural Language & Linguistic Theory53403–439.
  • Szabolcsi (1992) szabolcsi92Szabolcsi, A.  1992. Combinatory grammar and projection from the lexicon Combinatory grammar and projection from the lexicon. Lexical matters1192.
  • Wang (2016) wang16dissWang, AX.  2016.  Linguistically Motivated Combinatory Categorial Grammar Induction Linguistically motivated combinatory categorial grammar induction .  University of Washington.
  • Weir (1988) weir88Weir, DJ.  1988.  Characterizing mildly context-sensitive grammar formalisms Characterizing mildly context-sensitive grammar formalisms .