Recent work in probabilistic context-free grammar (PCFG) induction has shown that it is possible to learn accurate grammars from raw text Jin et al. (2018b, 2019); Kim et al. (2019), which is significant in addressing the issue of the poverty of the stimulus Chomsky (1965, 1980) in linguistics. Although phrasal categories and morphosyntactic features can be induced from raw text Jin and Schuler (2019); Jin et al. (2019), most unsupervised parsing work has been evaluated using unlabeled parsing accuracy scores Seginer (2007); Ponvert et al. (2011); Jin et al. (2018b); Shen et al. (2018, 2019); Shi et al. (2019). This is potentially distortative because children and adults can distinguish categories of phrases and clauses Tomasello and Olguin (1993); Valian (1986); Kemp et al. (2005); Pine et al. (2013), and much of acquisition modeling research has been directed at simulating the development of abstract linguistic categories in first language acquisition Bannard et al. (2009); Perfors et al. (2011); Kwiatkowski et al. (2012); Abend et al. (2017); Jin et al. (2018b).
Recent work proposed a labeled parsing accuracy evaluation metric called Recall-V-Measure (RVM) as a method for evaluating unsupervised grammar inducers Jin et al. (2019)
, but this metric counts categories as incorrect if they are finer-grained than reference categories or if they represent binarizations of n-ary branches in reference trees, which may be linguistically acceptable. We therefore further modify it to Recall-Homogeneity (RH) calculated as the homogeneityRosenberg and Hirschberg (2007)
of the labels of matching constituents of the induced and gold trees, weighted by unlabeled recall. This work uses transcribed child-directed utterances from multiple languages as input to a grammar inducer with hyperparameters tuned using either unlabeled F1 or labeled RH. Results show that: (1) the induced grammars capture the preference of sparse concentrations in human grammars only when using labeled evaluation; (2) grammar accuracy increases as the number of labels grows only when using labeled evaluation; (3) depth-bounding(Jin et al., 2018a, limiting center embedding) is still effective when tuned to maximize labeled parsing accuracy.
All experiments described in this paper use a Bayesian Dirichlet-multinomial model Jin et al. (2018a) to induce PCFGs without assuming any language specific knowledge. This model defines a Chomsky normal form (CNF) PCFG with nonterminal categories as a matrix
of binary rule probabilities which is first drawn from the Dirichlet prior with a concentration parameter:
Trees for sentences in a corpus are then drawn from a PCFG parameterized by :
and each tree is a set of category node labels where defines a path of left or right branches from the root to that node. Category labels for every pair of left and right children are drawn from a multinomial distribution defined by the grammar and the category of the parent :
where is a Kronecker delta function equal to at value and elsewhere. Terminal expansions are treated as expanding into a terminal node followed by a special null node.
Inference in this model uses Gibbs sampling to produce samples of grammars and trees with the most probable parses obtained with the Viterbi algorithm.
3 Data and hyperparameters
Experiments here use transcribed child-directed utterances from the CHILDES corpus Macwhinney (1992) in three languages with more than 15,000 sentences each. English hand-annotated constituency trees are taken from the Adam and Eve portions of the Brown Corpus Brown (1973). Mandarin (Tong, Deng et al., 2018) and German (Leo, Behrens, 2006) data are collected from CHILDES with reference trees automatically generated using the state-of-the-art Kitaev and Klein (2018) parser. Disfluencies are removed, and only sentences spoken by caregivers are kept in the data. Models are run 10 times with 700 iterations with random seeds following previous work Jin et al. (2018a). The last sampled grammar is used to generate Viterbi parses for all sentences. All punctuation is retained during induction and then removed in evaluation. Significance testing uses permutation tests on concatenations of Viterbi trees from all test runs. We use Adam for exploratory experiments and the other three sets for confirmatory experiments.
RH is calculated by multiplying unlabeled recall of bracketed spans in the predicted Viterbi trees with the homogeneity score Rosenberg and Hirschberg (2007) of the predicted labels of the matching spans, This is different from RVM (Jin et al., 2019), which is the product of unlabeled recall and V-measure. The metric is insensitive to the branching factor of the grammar by the use of unlabeled recall. Unlike RVM, it is also insensitive to the precision of predicted labels to gold labels, indicating that models are not penalized by hypothesizing more refined categories, as long as these categories all fall into the confines of a gold category. RVM, on the other hand, would penalize both underproposing and overproposing categories compared to the ones in the annotation, but the gold categories, like nouns and verbs, are defined on a very high level that languages almost always further specify, represented usually as subcategories or features in linguistic theories. Unary branches in gold and predicted trees are removed, and the top category is used as the category for the constituent.
4.1 Experiment 1: Labeled evaluation shows preference of grammar sparsity
Human grammars are sparse Johnson et al. (2007); Goldwater and Griffiths (2007). For example, in the Penn Treebank Marcus et al. (1993), there are 73 unique nonterminal categories. In theory, there can be more than 28 million possible unary, binary and ternary branching rules in the grammar. However, only 17,020 unique rules are found in the corpus, showing the high sparsity of attested rules. In other frameworks like Combinatory Categorial Grammar Steedman (2002) where lexical categories can be in the thousands, the number of attested lexical categories is still small compared to all possible ones.
The Dirichlet concentration hyperparameter in the model controls the probability of a sampled multinomial distribution concentrating its probability mass on only a few items. Previous work using similar models usually sets this value low Johnson et al. (2007); Goldwater and Griffiths (2007); Graça et al. (2009); Jin et al. (2018b) to prefer sparse grammars (i.e. grammars in which most of the probability mass is allocated to a small number of rules), with good results. The prediction based on the preference of sparsity is that the best value should be much lower than 1.
Figure 0(a) shows unlabeled F1 scores with different values on Adam.111The results shown in the figure use =30. We also tested other values from 15 to 105 and the trend is almost identical. Contrary to the prediction, grammar accuracy peaks at high values for when measured using unlabeled F1. However, these grammars with high unlabeled F1 are almost purely right-branching grammars, which performs very well on English child-directed speech in unlabeled parsing evaluation, but the right-branching grammars have phrasal labels that do not correlate with human annotation when evaluated with Homogeneity, shown in Figure 0(b). This indicates that instead of capturing human intuitions about syntactic structure, such grammars have only captured broad branching tendencies. The same grammars are evaluated again with RH, shown in Figure 0(c). When both structural and labeling accuracy is taken into account, results correctly capture the intuition that grammar accuracy has a low peaking concentration hyperparameter. Figure 0(d) and 0(e) shows the same experiments evaluated with the labeled evaluation metric RVM. Because of the sensitivity to labeling accuracy, results in VM and RVM also show the similar trend as Homogeneity and RH where labeling quality decreases as increases. Jin et al. (2018b) noted that induced grammars high in unlabeled bracketing scores are low in NP discovery scores, which is a category-specific evaluation metric. This can also be explained by the induced grammars with high bracketing scores only capture a broad right-branching bias without accurately clustering words and phrases based on their distributional properties.
Figure 2 shows the same experiments on a corpus of formal English written text, the WSJ20dev222The first half of the Wall Street Journal part of the Penn Treebank with sentences with 20 words or fewer. dataset. The pattern is similar but less extreme than on CHILDES. The higher s at the range of 0.1-0.2 still show better performance on unlabeled F1 than the sparser models, consistent with previous results in Jin et al. (2018b). However RH scores reveal that the labels induced by the denser models are less accurate, manifesting as the overall lower peak for using RH than using unlabeled F1.
4.2 Experiment 2: Performance increases with the number of categories
Previous research Jin et al. (2018a) also reported that the number of categories used by the induction models was relatively low compared to the number of categories in human annotation. For example, there are 63 unique tags in the Adam dataset. This is in contrast to 30 or fewer categories used in previous induction work. The bias brought by high values and unlabeled evaluation together may be masking the real relationship between the number of categories and grammar accuracy.
Figures 2(a) and 2(b) show unlabeled and labeled evaluation on different grammars induced with the best performing on Adam tuned by unlabeled F1. With F1, increasing the number of categories beyond 30 yields no improvement as most of the induced grammars are purely right-branching grammars. RH results confirm this: as grammars approach the pure right-branching solution when increases, the similarity between induced and gold labels of constituents deteriorates quickly. RH scores from grammars induced with are more indicative of the interaction between the number of categories and grammar accuracy. Grammar accuracy increases as gets larger initially and peaks at . The results confirm the importance of labeled evaluation, because the trend from labeled evaluation shows that there should be a sufficient number of categories to account for different syntactic structures, and models with small numbers of categories are limited in their ability to do this.
4.3 Experiment 3: Depth-bounding is still effective with RH
Previous work showed that depth-bounding is effective in helping grammar inducers induce more accurate grammars Shain et al. (2016); Jin et al. (2018a), because it removes the parse trees with deeply nested center-embeddings, which cannot be produced by humans due to memory constraints Chomsky and Miller (1963), from grammar induction inference. However the unlabeled evaluation metric used in previous work may lead to unhelpful conclusions. In order to revisit this claim with labeled evaluation, experiments are first conducted on Adam exploring the interaction between depth and labeled performance, and subsequently on the Eve (English), Tong (Chinese Mandarin) and Leo (German) portions of the CHILDES corpus. All experiments use hyperparameters tuned with RH.333The optimal is 75 from previous experiments, but we used 30 in all depth-bounding experiments due to hardware constraints at high depth bounds.
Figure 4 shows the interaction between depth and RH scores on Adam. Performance of the unbounded models can be lower than all bounded models, showing that unbounded inducers can induce grammars inconsistent with human memory constraints. The labeled performance peaks at depth 3, which is significantly more accurate () than unbounded models. This is consistent with previous results that over 97% of trees in English contain 3 or fewer nested center embeddings Schuler et al. (2010).
Experiments on Eve, Tong and Leo replicate this result. Figure 5 shows that the models bounded at depth 3 are more accurate than unbounded models with both unlabeled and labeled evaluation metrics. Significance testing with unlabeled F1444Neither RH nor RVM were used in permutation significance testing, because labels with the same values from different induced grammars may represent different linguistic categories, therefore two parses of the same sentence from different runs are not exchangeable. shows the performance differences across three datasets are all highly significant (). Therefore, the claim that depth-bounding is effective in grammar induction is still supported when the models are developed and evaluated with labeled evaluation.
Unlabeled evaluation has been used in grammar induction, but experiments presented in this paper show that unlabeled evaluation can reveal unexpected bias in the data which may lead to unhelpful conclusions compared to labeled evaluation. Results show that trends of preference of sparsity and use of categories that are consistent with linguistic annotation can only be discovered with labeled evaluation. Furthermore, human memory constraints are still effective in grammar induction when labeled evaluation is used throughout all stages of development.
- Bootstrapping language acquisition. In Cognition, Vol. 164, pp. 116–143. External Links: Cited by: §1.
- Modeling children’s early grammatical knowledge.. Proceedings of the National Academy of Sciences of the United States of America 106 (41), pp. 17284–9. External Links: Cited by: §1.
- The input–output relationship in first language acquisition. Language and Cognitive Processes 21 (1-3), pp. 2–24. Cited by: §3.
- A first language: The early stages.. Harvard U. Press. Cited by: §3.
- Introduction to the formal analysis of natural languages. In Handbook of Mathematical Psychology, pp. 269–321. Cited by: §4.3.
- Aspects of the Theory of Syntax. MIT Press, Cambridge, MA. Cited by: §1.
- On cognitive structures and their development: A reply to Piaget. In Language and learning: the debate between Jean Piaget and Noam Chomsky, M. Piattelli-Palmarini (Ed.), pp. 751–755. Cited by: §1.
- A Multimedia Corpus of Child Mandarin: The Tong Corpus. The Journal of Chinese Linguisticsvol 46 (1), pp. 69–92. External Links: Cited by: §3.
- A fully Bayesian approach to unsupervised part-of-speech tagging. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 744–751. External Links: Cited by: §4.1, §4.1.
- Posterior vs. Parameter sparsity in latent variable models. In Advances in Neural Information Processing Systems, pp. 664–672. External Links: Cited by: §4.1.
Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction.
Proceedings of the Conference on Empirical Methods in Natural Language Processing, External Links: Cited by: §1, §2, §3, §4.2, §4.3.
- Unsupervised Grammar Induction with Depth-bounded PCFG. Transactions of the Association for Computational Linguistics. Cited by: §1, §4.1, §4.1, §4.1.
- Unsupervised Learning of PCFGs with Normalizing Flow. In ACL, Cited by: §1, §1, §3.1.
- Variance of average surprisal: a better predictor for quality of grammar from unsupervised PCFG induction. In ACL, Cited by: §1.
- Bayesian Inference for PCFGs via Markov chain Monte Carlo. Proceedings of Human Language Technologies: The Conference of the North American Chapter of the Association for Computational Linguistics, pp. 139–146. External Links: Cited by: §4.1, §4.1.
- Young Children’s Knowledge of the “Determiner” and “Adjective” Categories. Journal of Speech, Language, and Hearing Research 48 (June), pp. 592–609. Cited by: §1.
- Compound Probabilistic Context-Free Grammars for Grammar Induction. In ACL, External Links: Cited by: §1.
- Constituency Parsing with a Self-Attentive Encoder. In ACL, External Links: Cited by: §3.
- A probabilistic model of syntactic and semantic acquisition from child-directed utterances and their meanings. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 234–244. External Links: Cited by: §1.
- The CHILDES Project: Tools for Analyzing Talk. Third edition, Lawrence Elrbaum Associates, Mahwah, NJ. External Links: Cited by: §3.
- Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19 (2), pp. 313–330. External Links: Cited by: §4.1.
- The learnability of abstract syntactic principles. Cognition 118, pp. 306–338. External Links: Cited by: §1.
- Do young children have adult-like syntactic categories? Zipf’s law and the case of the determiner. Cognition 127 (3), pp. 345–360. External Links: Cited by: §1.
- Simple unsupervised grammar induction from raw text with cascaded finite state models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1077–1086. External Links: Cited by: §1.
- V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), Cited by: §1, §3.1.
- Broad-coverage parsing using human-Like memory constraints. Computational Linguistics 36 (1), pp. 1–30. External Links: Cited by: §4.3.
- Fast Unsupervised Incremental Parsing. In Proceedings of the Annual Meeting of the Association of Computational Linguistics, pp. 384–391. External Links: Cited by: §1.
- Memory-bounded left-corner unsupervised grammar induction on child-directed input. In Proceedings of the International Conference on Computational Linguistics, pp. 964–975. External Links: Cited by: §4.3.
Neural Language Modeling by Jointly Learning Syntax and Lexicon. In ICLR, External Links: Cited by: §1.
- . In ICLR, External Links: Cited by: §1.
- Visually Grounded Neural Syntax Acquisition. In ACL, External Links: Cited by: §1.
- Formalizing Affordance. In Proceedings of the Annual Meeting of the Cognitive Science Society, External Links: Cited by: §4.1.
- Twenty-three-month-old children have a grammatical category of noun. Cognitive Development 8 (4), pp. 451–464. External Links: Cited by: §1.
- Syntactic Categories in the Speech of Young Children. Developmental Psychology 22 (4), pp. 562–579. External Links: Cited by: §1.