The Importance of Category Labels in Grammar Induction with Child-directed Utterances

by   Lifeng Jin, et al.
The Ohio State University

Recent progress in grammar induction has shown that grammar induction is possible without explicit assumptions of language-specific knowledge. However, evaluation of induced grammars usually has ignored phrasal labels, an essential part of a grammar. Experiments in this work using a labeled evaluation metric, RH, show that linguistically motivated predictions about grammar sparsity and use of categories can only be revealed through labeled evaluation. Furthermore, depth-bounding as an implementation of human memory constraints in grammar inducers is still effective with labeled evaluation on multilingual transcribed child-directed utterances.



There are no comments yet.


page 1

page 2

page 3

page 4


Unsupervised Grammar Induction with Depth-bounded PCFG

There has been recent interest in applying cognitively or empirically mo...

Montague Grammar Induction

We propose a computational modeling framework for inducing combinatory c...

Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction

There have been several recent attempts to improve the accuracy of gramm...

Production vs Perception: The Role of Individuality in Usage-Based Grammar Induction

This paper asks whether a distinction between production-based and perce...

Constructive Type-Logical Supertagging with Self-Attention Networks

We propose a novel application of self-attention networks towards gramma...

An Empirical Study of Compound PCFGs

Compound probabilistic context-free grammars (C-PCFGs) have recently est...

Automatic Grammar Augmentation for Robust Voice Command Recognition

This paper proposes a novel pipeline for automatic grammar augmentation ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent work in probabilistic context-free grammar (PCFG) induction has shown that it is possible to learn accurate grammars from raw text Jin et al. (2018b, 2019); Kim et al. (2019), which is significant in addressing the issue of the poverty of the stimulus Chomsky (1965, 1980) in linguistics. Although phrasal categories and morphosyntactic features can be induced from raw text Jin and Schuler (2019); Jin et al. (2019), most unsupervised parsing work has been evaluated using unlabeled parsing accuracy scores Seginer (2007); Ponvert et al. (2011); Jin et al. (2018b); Shen et al. (2018, 2019); Shi et al. (2019). This is potentially distortative because children and adults can distinguish categories of phrases and clauses Tomasello and Olguin (1993); Valian (1986); Kemp et al. (2005); Pine et al. (2013), and much of acquisition modeling research has been directed at simulating the development of abstract linguistic categories in first language acquisition Bannard et al. (2009); Perfors et al. (2011); Kwiatkowski et al. (2012); Abend et al. (2017); Jin et al. (2018b).

Recent work proposed a labeled parsing accuracy evaluation metric called Recall-V-Measure (RVM) as a method for evaluating unsupervised grammar inducers Jin et al. (2019)

, but this metric counts categories as incorrect if they are finer-grained than reference categories or if they represent binarizations of n-ary branches in reference trees, which may be linguistically acceptable. We therefore further modify it to Recall-Homogeneity (RH) calculated as the homogeneity

Rosenberg and Hirschberg (2007)

of the labels of matching constituents of the induced and gold trees, weighted by unlabeled recall. This work uses transcribed child-directed utterances from multiple languages as input to a grammar inducer with hyperparameters tuned using either unlabeled F1 or labeled RH. Results show that: (1) the induced grammars capture the preference of sparse concentrations in human grammars only when using labeled evaluation; (2) grammar accuracy increases as the number of labels grows only when using labeled evaluation; (3) depth-bounding

(Jin et al., 2018a, limiting center embedding) is still effective when tuned to maximize labeled parsing accuracy.

2 Model

(a) Unlabeled F1 scores with s
(b) Homogeneity scores with s
(c) RH scores with s
(d) V-measure scores with s
(e) RVM scores with s
Figure 1: Different evaluation metrics on the Adam dataset with different values.

All experiments described in this paper use a Bayesian Dirichlet-multinomial model Jin et al. (2018a) to induce PCFGs without assuming any language specific knowledge. This model defines a Chomsky normal form (CNF) PCFG with nonterminal categories as a matrix 

of binary rule probabilities which is first drawn from the Dirichlet prior with a concentration parameter



Trees for sentences  in a corpus are then drawn from a PCFG parameterized by :


and each tree  is a set of category node labels  where  defines a path of left or right branches from the root to that node. Category labels for every pair of left and right children  are drawn from a multinomial distribution defined by the grammar  and the category of the parent :


where is a Kronecker delta function equal to at value  and elsewhere. Terminal expansions are treated as expanding into a terminal node followed by a special null node.

Inference in this model uses Gibbs sampling to produce samples of grammars and trees with the most probable parses obtained with the Viterbi algorithm.

3 Data and hyperparameters

Experiments here use transcribed child-directed utterances from the CHILDES corpus Macwhinney (1992) in three languages with more than 15,000 sentences each. English hand-annotated constituency trees are taken from the Adam and Eve portions of the Brown Corpus Brown (1973). Mandarin (Tong, Deng et al., 2018) and German (Leo, Behrens, 2006) data are collected from CHILDES with reference trees automatically generated using the state-of-the-art Kitaev and Klein (2018) parser. Disfluencies are removed, and only sentences spoken by caregivers are kept in the data. Models are run 10 times with 700 iterations with random seeds following previous work Jin et al. (2018a). The last sampled grammar is used to generate Viterbi parses for all sentences. All punctuation is retained during induction and then removed in evaluation. Significance testing uses permutation tests on concatenations of Viterbi trees from all test runs. We use Adam for exploratory experiments and the other three sets for confirmatory experiments.

(a) Unlabeled F1 scores with s
(b) RH scores with s
Figure 2: Different evaluation metrics on the WSJ20Dev dataset with different values.
(a) Unlabeled F1 with s and
(b) RH scores with s and
(c) RH scores with s and
Figure 3: Different evaluation metrics on the Adam dataset with different values at high and low s.
Figure 4: Depth-bounding on Adam

3.1 Recall-Homogeneity

RH is calculated by multiplying unlabeled recall of bracketed spans in the predicted Viterbi trees with the homogeneity score Rosenberg and Hirschberg (2007) of the predicted labels of the matching spans, This is different from RVM (Jin et al., 2019), which is the product of unlabeled recall and V-measure. The metric is insensitive to the branching factor of the grammar by the use of unlabeled recall. Unlike RVM, it is also insensitive to the precision of predicted labels to gold labels, indicating that models are not penalized by hypothesizing more refined categories, as long as these categories all fall into the confines of a gold category. RVM, on the other hand, would penalize both underproposing and overproposing categories compared to the ones in the annotation, but the gold categories, like nouns and verbs, are defined on a very high level that languages almost always further specify, represented usually as subcategories or features in linguistic theories. Unary branches in gold and predicted trees are removed, and the top category is used as the category for the constituent.

4 Experiments

4.1 Experiment 1: Labeled evaluation shows preference of grammar sparsity

Human grammars are sparse Johnson et al. (2007); Goldwater and Griffiths (2007). For example, in the Penn Treebank Marcus et al. (1993), there are 73 unique nonterminal categories. In theory, there can be more than 28 million possible unary, binary and ternary branching rules in the grammar. However, only 17,020 unique rules are found in the corpus, showing the high sparsity of attested rules. In other frameworks like Combinatory Categorial Grammar Steedman (2002) where lexical categories can be in the thousands, the number of attested lexical categories is still small compared to all possible ones.

The Dirichlet concentration hyperparameter in the model controls the probability of a sampled multinomial distribution concentrating its probability mass on only a few items. Previous work using similar models usually sets this value low Johnson et al. (2007); Goldwater and Griffiths (2007); Graça et al. (2009); Jin et al. (2018b) to prefer sparse grammars (i.e. grammars in which most of the probability mass is allocated to a small number of rules), with good results. The prediction based on the preference of sparsity is that the best value should be much lower than 1.

Figure 0(a) shows unlabeled F1 scores with different values on Adam.111The results shown in the figure use =30. We also tested other values from 15 to 105 and the trend is almost identical. Contrary to the prediction, grammar accuracy peaks at high values for when measured using unlabeled F1. However, these grammars with high unlabeled F1 are almost purely right-branching grammars, which performs very well on English child-directed speech in unlabeled parsing evaluation, but the right-branching grammars have phrasal labels that do not correlate with human annotation when evaluated with Homogeneity, shown in Figure 0(b). This indicates that instead of capturing human intuitions about syntactic structure, such grammars have only captured broad branching tendencies. The same grammars are evaluated again with RH, shown in Figure 0(c). When both structural and labeling accuracy is taken into account, results correctly capture the intuition that grammar accuracy has a low peaking concentration hyperparameter. Figure 0(d) and 0(e) shows the same experiments evaluated with the labeled evaluation metric RVM. Because of the sensitivity to labeling accuracy, results in VM and RVM also show the similar trend as Homogeneity and RH where labeling quality decreases as increases. Jin et al. (2018b) noted that induced grammars high in unlabeled bracketing scores are low in NP discovery scores, which is a category-specific evaluation metric. This can also be explained by the induced grammars with high bracketing scores only capture a broad right-branching bias without accurately clustering words and phrases based on their distributional properties.

Figure 2 shows the same experiments on a corpus of formal English written text, the WSJ20dev222The first half of the Wall Street Journal part of the Penn Treebank with sentences with 20 words or fewer. dataset. The pattern is similar but less extreme than on CHILDES. The higher s at the range of 0.1-0.2 still show better performance on unlabeled F1 than the sparser models, consistent with previous results in Jin et al. (2018b). However RH scores reveal that the labels induced by the denser models are less accurate, manifesting as the overall lower peak for using RH than using unlabeled F1.

4.2 Experiment 2: Performance increases with the number of categories

(a) Depth-bounding on Eve
(b) Depth-bounding on Tong
(c) Depth-bounding on Leo
Figure 5: Comparison of labeled and unlabeled evaluation of grammars bounded at depth 3 and unbounded grammars on English (Eve), Chinese Mandarin (Tong) and German (Leo) datasets from CHILDES.

Previous research Jin et al. (2018a) also reported that the number of categories used by the induction models was relatively low compared to the number of categories in human annotation. For example, there are 63 unique tags in the Adam dataset. This is in contrast to 30 or fewer categories used in previous induction work. The bias brought by high values and unlabeled evaluation together may be masking the real relationship between the number of categories and grammar accuracy.

Figures 2(a) and 2(b) show unlabeled and labeled evaluation on different grammars induced with the best performing on Adam tuned by unlabeled F1. With F1, increasing the number of categories beyond 30 yields no improvement as most of the induced grammars are purely right-branching grammars. RH results confirm this: as grammars approach the pure right-branching solution when increases, the similarity between induced and gold labels of constituents deteriorates quickly. RH scores from grammars induced with are more indicative of the interaction between the number of categories and grammar accuracy. Grammar accuracy increases as gets larger initially and peaks at . The results confirm the importance of labeled evaluation, because the trend from labeled evaluation shows that there should be a sufficient number of categories to account for different syntactic structures, and models with small numbers of categories are limited in their ability to do this.

4.3 Experiment 3: Depth-bounding is still effective with RH

Previous work showed that depth-bounding is effective in helping grammar inducers induce more accurate grammars Shain et al. (2016); Jin et al. (2018a), because it removes the parse trees with deeply nested center-embeddings, which cannot be produced by humans due to memory constraints Chomsky and Miller (1963), from grammar induction inference. However the unlabeled evaluation metric used in previous work may lead to unhelpful conclusions. In order to revisit this claim with labeled evaluation, experiments are first conducted on Adam exploring the interaction between depth and labeled performance, and subsequently on the Eve (English), Tong (Chinese Mandarin) and Leo (German) portions of the CHILDES corpus. All experiments use hyperparameters tuned with RH.333The optimal is 75 from previous experiments, but we used 30 in all depth-bounding experiments due to hardware constraints at high depth bounds.

Figure 4 shows the interaction between depth and RH scores on Adam. Performance of the unbounded models can be lower than all bounded models, showing that unbounded inducers can induce grammars inconsistent with human memory constraints. The labeled performance peaks at depth 3, which is significantly more accurate () than unbounded models. This is consistent with previous results that over 97% of trees in English contain 3 or fewer nested center embeddings Schuler et al. (2010).

Experiments on Eve, Tong and Leo replicate this result. Figure 5 shows that the models bounded at depth 3 are more accurate than unbounded models with both unlabeled and labeled evaluation metrics. Significance testing with unlabeled F1444Neither RH nor RVM were used in permutation significance testing, because labels with the same values from different induced grammars may represent different linguistic categories, therefore two parses of the same sentence from different runs are not exchangeable. shows the performance differences across three datasets are all highly significant (). Therefore, the claim that depth-bounding is effective in grammar induction is still supported when the models are developed and evaluated with labeled evaluation.

5 Conclusion

Unlabeled evaluation has been used in grammar induction, but experiments presented in this paper show that unlabeled evaluation can reveal unexpected bias in the data which may lead to unhelpful conclusions compared to labeled evaluation. Results show that trends of preference of sparsity and use of categories that are consistent with linguistic annotation can only be discovered with labeled evaluation. Furthermore, human memory constraints are still effective in grammar induction when labeled evaluation is used throughout all stages of development.


  • O. Abend, T. Kwiatkowski, N. J. Smith, S. Goldwater, and M. Steedman (2017) Bootstrapping language acquisition. In Cognition, Vol. 164, pp. 116–143. External Links: Link, Document, ISSN 18737838 Cited by: §1.
  • C. Bannard, E. Lieven, and M. Tomasello (2009) Modeling children’s early grammatical knowledge.. Proceedings of the National Academy of Sciences of the United States of America 106 (41), pp. 17284–9. External Links: Link, Document, ISSN 1091-6490 Cited by: §1.
  • H. Behrens (2006) The input–output relationship in first language acquisition. Language and Cognitive Processes 21 (1-3), pp. 2–24. Cited by: §3.
  • R. Brown (1973) A first language: The early stages.. Harvard U. Press. Cited by: §3.
  • N. Chomsky and G. A. Miller (1963) Introduction to the formal analysis of natural languages. In Handbook of Mathematical Psychology, pp. 269–321. Cited by: §4.3.
  • N. Chomsky (1965) Aspects of the Theory of Syntax. MIT Press, Cambridge, MA. Cited by: §1.
  • N. Chomsky (1980) On cognitive structures and their development: A reply to Piaget. In Language and learning: the debate between Jean Piaget and Noam Chomsky, M. Piattelli-Palmarini (Ed.), pp. 751–755. Cited by: §1.
  • X. Deng, V. Yip, B. Macwhinney, S. Matthews, M. Ziyin, Z. Jing, and H. Lam (2018) A Multimedia Corpus of Child Mandarin: The Tong Corpus. The Journal of Chinese Linguisticsvol 46 (1), pp. 69–92. External Links: Link, Document Cited by: §3.
  • S. Goldwater and T. Griffiths (2007) A fully Bayesian approach to unsupervised part-of-speech tagging. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 744–751. External Links: Link, ISBN 9781932432862, ISSN 0736587X Cited by: §4.1, §4.1.
  • J. V. Graça, K. Ganchev, T. Ben, and F. Pereira (2009) Posterior vs. Parameter sparsity in latent variable models. In Advances in Neural Information Processing Systems, pp. 664–672. External Links: ISBN 9781615679119 Cited by: §4.1.
  • L. Jin, F. Doshi-Velez, T. A. Miller, W. Schuler, and L. Schwartz (2018a) Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    External Links: Link Cited by: §1, §2, §3, §4.2, §4.3.
  • L. Jin, F. Doshi-Velez, T. A. Miller, W. Schuler, and L. Schwartz (2018b) Unsupervised Grammar Induction with Depth-bounded PCFG. Transactions of the Association for Computational Linguistics. Cited by: §1, §4.1, §4.1, §4.1.
  • L. Jin, F. Doshi-Velez, T. Miller, L. Schwartz, and W. Schuler (2019) Unsupervised Learning of PCFGs with Normalizing Flow. In ACL, Cited by: §1, §1, §3.1.
  • L. Jin and W. Schuler (2019) Variance of average surprisal: a better predictor for quality of grammar from unsupervised PCFG induction. In ACL, Cited by: §1.
  • M. Johnson, T. L. Griffiths, and S. Goldwater (2007) Bayesian Inference for PCFGs via Markov chain Monte Carlo. Proceedings of Human Language Technologies: The Conference of the North American Chapter of the Association for Computational Linguistics, pp. 139–146. External Links: Link Cited by: §4.1, §4.1.
  • N. Kemp, E. Lieven, and M. Tomasello (2005) Young Children’s Knowledge of the “Determiner” and “Adjective” Categories. Journal of Speech, Language, and Hearing Research 48 (June), pp. 592–609. Cited by: §1.
  • Y. Kim, C. Dyer, and A. M. Rush (2019) Compound Probabilistic Context-Free Grammars for Grammar Induction. In ACL, External Links: Link Cited by: §1.
  • N. Kitaev and D. Klein (2018) Constituency Parsing with a Self-Attentive Encoder. In ACL, External Links: Link Cited by: §3.
  • T. Kwiatkowski, S. Goldwater, L. Zettlemoyer, and M. Steedman (2012) A probabilistic model of syntactic and semantic acquisition from child-directed utterances and their meanings. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 234–244. External Links: Link, ISBN 1937284190 Cited by: §1.
  • B. Macwhinney (1992) The CHILDES Project: Tools for Analyzing Talk. Third edition, Lawrence Elrbaum Associates, Mahwah, NJ. External Links: ISBN 0-8058-1005-6 (hardcover); 0-8058-1006-4 (paperback), Document, ISSN 0265-6590 Cited by: §3.
  • M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19 (2), pp. 313–330. External Links: Link Cited by: §4.1.
  • A. Perfors, J. B. Tenenbaum, and T. Regier (2011) The learnability of abstract syntactic principles. Cognition 118, pp. 306–338. External Links: Document Cited by: §1.
  • J. M. Pine, D. Freudenthal, G. Krajewski, and F. Gobet (2013) Do young children have adult-like syntactic categories? Zipf’s law and the case of the determiner. Cognition 127 (3), pp. 345–360. External Links: Link, Document, ISSN 00100277 Cited by: §1.
  • E. Ponvert, J. Baldridge, and K. Erk (2011) Simple unsupervised grammar induction from raw text with cascaded finite state models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1077–1086. External Links: ISBN 9781932432879 Cited by: §1.
  • A. Rosenberg and J. Hirschberg (2007) V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), Cited by: §1, §3.1.
  • W. Schuler, S. AbdelRahman, T. Miller, and L. Schwartz (2010) Broad-coverage parsing using human-Like memory constraints. Computational Linguistics 36 (1), pp. 1–30. External Links: Link, ISBN 0891-2017, Document, ISSN 0891-2017 Cited by: §4.3.
  • Y. Seginer (2007) Fast Unsupervised Incremental Parsing. In Proceedings of the Annual Meeting of the Association of Computational Linguistics, pp. 384–391. External Links: Link, ISBN 0736-587X, ISSN 0736587X Cited by: §1.
  • C. Shain, W. Bryce, L. Jin, V. Krakovna, F. Doshi-Velez, T. Miller, W. Schuler, and L. Schwartz (2016) Memory-bounded left-corner unsupervised grammar induction on child-directed input. In Proceedings of the International Conference on Computational Linguistics, pp. 964–975. External Links: Link Cited by: §4.3.
  • Y. Shen, Z. Lin, C. Huang, and A. Courville (2018)

    Neural Language Modeling by Jointly Learning Syntax and Lexicon

    In ICLR, External Links: Link Cited by: §1.
  • Y. Shen, S. Tan, A. Sordoni, and A. Courville (2019)

    Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks

    In ICLR, External Links: Link, Document Cited by: §1.
  • H. Shi, J. Mao, K. Gimpel, and K. Livescu (2019) Visually Grounded Neural Syntax Acquisition. In ACL, External Links: Link Cited by: §1.
  • M. Steedman (2002) Formalizing Affordance. In Proceedings of the Annual Meeting of the Cognitive Science Society, External Links: Link Cited by: §4.1.
  • M. Tomasello and R. Olguin (1993) Twenty-three-month-old children have a grammatical category of noun. Cognitive Development 8 (4), pp. 451–464. External Links: Document, ISSN 08852014 Cited by: §1.
  • V. Valian (1986) Syntactic Categories in the Speech of Young Children. Developmental Psychology 22 (4), pp. 562–579. External Links: Document, ISSN 00121649 Cited by: §1.