Unsupervised grammar induction aims at building a formal device for discovering syntactic structure from natural language corpora. Within the scope of grammar induction, there are two main directions of research: unsupervised constituency parsing, which attempts to discover the underlying structure of phrases, and unsupervised dependency parsing, which attempts to discover the underlying relations between words. Early work on induction of syntactic structure focused on learning phrase structure and generally used some variant of probabilistic context-free grammars (PCFGs; lari1990estimation; charniak1996tree; clark2001unsupervised). In recent years, dependency grammars have gained favor as an alternative syntactic formulation yuret1998discovery; carroll1992two; paskin2002grammatical. Specifically, the dependency model with valence (DMV) klein2004corpus forms the basis for many modern approaches in dependency induction. Most recent models for grammar induction, be they for PCFGs, DMVs, or other formulations, have generally coupled these models with some variety of neural model to use embeddings to capture word similarities, improve the flexibility of model parameterization, or both he2018unsupervised; jin2019unsupervised; kim2019compound; han2019enhancing.
Notably, the two different syntactic formalisms capture very different views of syntax. Phrase structure takes advantage of an abstracted recursive view of language, while the dependency structure more concretely focuses on the propensity of particular words in a sentence to relate to each-other syntactically. However, few attempts at unsupervised grammar induction have been made to marry the two and let both benefit each other. This is precisely the issue we attempt to tackle in this paper.
As a specific formalism that allows us to model both formalisms at once, we turn to lexicalized probabilistic context-free grammars (L-PCFGs; collins2003head). L-PCFGs borrow the underlying machinery from PCFGs but expand the grammar by allowing rules to include information about the lexical heads of each phrase, an example of which is shown in Figure 1chasing] VBZ[is] VP[chasing] is much higher than VP VBZ VP, because “chasing” is a present participle . Historically, these grammars have been mostly used for supervised parsing, combined with traditional count-based estimators of rule probabilities collins2003head. Within this context, lexicalized grammar rules are powerful, but the counts available are sparse, and thus required extensive smoothing to achieve competitive results bikel2004intricacies; Hockenmaier2002.
In this paper, we contend that with recent advances in neural modeling, it is time to return to modeling lexical dependencies, specifically in the context of unsupervised constituent-based grammar induction. We propose neural L-PCFGs as a parameter-sharing method to alleviate the sparsity problem of lexicalized PCFGs. Figure 2 illustrates the generation procedure of a neural L-PCFG. Different from traditional lexicalized PCFGs, the probabilities of production rules are not independently parameterized, but rather conditioned on the representations of non-terminals, preterminals and lexical items (§3). Apart from devising lexicalized production rules (§2.3) and their corresponding scoring function, we also follow kim2019compound’s compound PCFG model for (non-lexicalized) constituency parsing with compound variables (§3.2), enabling modeling of a continuous mixture of grammar rules.222In other words, we do not induce a single PCFG, but a distribution over a family of PCFGs. We define how to efficiently train (§4.1) and perform inference (§4.2) in this model using dynamic programming and variational inference.
Put together, we expect this to result in a model that both is effective, and simultaneously induces both phrase structure and lexical dependencies,333Note that by “lexical dependencies” we are referring to unilexical dependencies between the head word and child non-terminals, as opposed to bilexical dependencies between two words (as are modeled in many dependency parsing models). whereas previous work has focused on only one. Our empirical evaluation examines this hypothesis, asking the following question:
In neural grammar induction models, is it possible to jointly and effectively learn both phrase structure and lexical dependencies? Is using both in concert better at the respective tasks than specialized methods that model only one at a time?
Our experiments (§5.3) answer in the affirmative, with better performance than baselines designed specially for either dependency or constituency parsing under multiple settings. Importantly, our detailed ablations show that methods of factorization play important role in the performance of neural L-PCFGs (§5.3.2). Finally, qualitatively (§5.4), we find that latent labels induced by our model align with annotated gold non-terminals in PTB.
2 Motivation and Definitions
In this section, we will first provide the background of constituency grammars and dependency grammars, and then formally define the general L-PCFG, illustrating how both dependencies and phrase structures can be induced from L-PCFGs.
2.1 Phrase Structures and CFGs
The phrase structure of a sentence is formed by recursively splitting constituents. In the parse above: the sentence is split into a noun phrase (NP) and a verb phrase (VP), which can themselves be further split into smaller constituents; for example, the NP is comprised of a determiner (DT) “the” and a normal noun (NN) “dog.”
Such phrase structures are represented as a context-free grammar444Note and , so this formulation does not capture the structure of sentences of length zero or one. (CFG), which can generate an infinite set of sentences via the repeated application of a finite set of rules:
denotes a start symbol, is a finite set of non-terminals, is a finite set of preterminals, is a set of terminal symbols, i.e. words and punctuation.
2.2 Dependency Structures and Grammars
[theme = simple] [column sep=1em] The & dog & is & chasing & the & cat
[edge height=1.0cm]4ROOT 21det 42nsubj 43aux 46nobj 65det
In a dependency tree of a sentence, the syntactic nodes are the words in the sentence. Here the root is the root word of the sentence, and the children of each word are its dependents. Above, the root word is chasing, which has three dependents, its subject (nsubj) dog, auxiliary verb (aux) is, and object (nobj) cat. A dependency grammar555This work assumes a projective tree. specifies the possible head-dependent pairs , where the set denotes the vocabulary.
2.3 Lexicalized CFGs
Although both the constituency and dependency grammars capture some aspects of syntax, we aim to leverage their relative strengths in a single unified formalism. In a unified grammar, these two types of structure can benefit each other. For example, in The dog is chasing the cat of my neighbor’s, while the phrase of my neighbor’s might be incorrectly marked as the adverbial phrase of chasing in a dependency model, the constituency parser can provide the constraint that the cat of my neighbor’s is a constituent, thereby requiring chasing to be the head of the phrase.
Lexicalized CFGs are based on a backbone similar to standard CFGs but parameterized to be sensitive to lexical dependencies such as those used in dependency grammars. Similarly to CFGs, L-CFGs are defined as a five-tuple . The differences lie in the formulation of rules :
where are words, and mark the head of constituent when they appear in “”.666w.l.o.g. we only consider binary branching in . Branching rules and encode the dependencies .777Note that root seeking rule encodes .
In a lexicalized CFG, a sentence can be generated by iterative binary splitting and emission, forming a parse tree , where rules are sorted from top to bottom and from left to right. We will denote the set of parse trees that generate within grammar as .
2.4 Grammar Induction with L-PCFGs
In this subsection, we will introduce L-PCFGs, the probabilistic formulation for L-CFGs. The task of grammar induction is to ask, given a corpus , how can we obtain the probabilistic generative grammar that maximizes its likelihood. With the induced grammar, we are also interested in how to obtain the trees that are most likely given an individual sentence – in other words, syntactic parsing according to this grammar.
We begin by defining the probability distribution over sentences, by marginalizing over all parse trees that may have generated :
where is an unnormalized probability of a parse tree (which we will refer to as an energy function), is the normalizing constant, and is a compound variable (§3.2) which allows for more complex and expressive generative grammars robbins1951asymptotically.
We define the energy function of a parse tree by exponentiating a score
where is the parameter of function . Theoretically, could be an arbitrary scoring function, but in this paper, as with most previous work, we consider a context-free scoring function, where the score of each rule is independent of the other rules in the parse tree :
where is the rule-scoring function which maps the rule and latent variable to real space, assigning a log likelihood to each rule. This formulation allows for efficient calculation using dynamic programming. We also include a restriction that the energies must be top-down locally-normalized, under which the partition function should automatically equate to 1
To train an L-PCFG, we maximize the log likelihood of the corpus (the latent variable is marginalized out):
And obtain the most likely parse tree of a sentence by maximizing the posterior probability:
3 Neural Lexicalized PCFGs
As noted, one advantage of L-PCFGs is that the obtained encodes both dependencies and phrase structures, allowing both to be induced simultaneously. We also expect this to improve performance, because different information is capture by each of these two structures. However, this expressivity comes at a price: more complex rules. In contrast to the traditional PCFG, which has production rules, the L-PCFG requires production rules. Because traditionally rules of L-PCFGs have been parameterized independently by scalars, i.e. (collins2003head), these parameters were hard to estimate due to data sparsity.
We propose an alternate parameterization, the neural L-PCFG, which ameliorates these sparsity problems through parameter sharing, and the compound L-PCFG, which allows a more flexible sentence-by-sentence parameterization of the model. Below, we explain the neural L-PCFG factorization we found performed best but include ablations of our decisions in Section 5.3.2.
3.1 Neural L-PCFG Factorization
The score of an individual rule is calculated as the combination of several component probabilities:
- root to non-terminal probability :
Probability that the start symbol produces a non-terminal .888i.e. is the non-terminal of the whole sentence
- word emission probability :
Probability that the head word of a constituent is conditioned on that the non-terminal of the constituent is .
- head non-terminal probability or :
Probability of the headedness direction and head-inheriting child999Child non-terminals that inherit the parent’s head word. conditioned on the parent non-terminal and head words.
- non-inheriting child probability or :
Probability of the non-inheriting child conditioned on the headedness direction, and parent and head-inheriting child non-terminals.
The score of root-seeking rule is factorized as the product of the root to non-terminal score and word emission scores, as shown in Equation 7.
The scores of branching rules and are factorized as the sum of a binary non-terminal score, a head non-terminal score, and a word emission score. Equation 8 describes the factorization of the score of rule and :
Since the head of preterminals is already specified upon generation of one of the ancestor non-terminals, the score of emission rule is 0.
where denotes concatenation and the word emission probability is
with partition functions s.t. and s.t. .
The non-inheriting child probabilities for left- and right-headed dependencies are
where partition functions satisfy and .
The respective head non-terminal scores are
where the partition function satisfies .
Here vectors represent the embeddings of non-terminals, preterminals and words.
are multi-layer perceptrons with different set of parameters, where we use residual connections101010 he2016deep between layers to facilitate training of deeper models.
3.2 Compound Grammar
Among various existing grammar induction models, the compound PCFG model of (kim2019compound) both shows highly competitive results and follows a PCFG-based formalism similar to ours, and thus we build upon this method. The compound in compound PCFG refers to the fact that it uses a compound probability distribution robbins1951asymptotically in modeling and estimation of its parameters. A compound probability distribution enables continuous variants of grammars, allowing the probabilities of the grammar to change based on the unique characteristics of the sentence. In general, compound variables can be devised in any way that may inform the specification of the rule probabilities (e.g. a structured variable to provide frame semantics or the social context in which the sentence is situated). In this way, compound grammar increases the capacity of the original PCFG.
In this paper, we use a latent compound variable
which is sampled from a standard spherical Gaussian distribution.
We denote the probability of latent variable as . By marginalizing out the compound variable, we get the log likelihood of a sentence:
4 Training and Inference
It is intractable to obtain either the exact log likelihood by integration over , and estimation by Monte-Carlo sampling would be hopelessly inefficient. However, we can optimize the evidence lower bound (ELBo):
is the proposal probability parameterized by an inference network, similar to those used in variantial autoencoderskingma2013autoencoding. The ELBo can be estimated by Monte-Carlo sampling:
where are sampled from . We model the proposal probability as an orthogonal Gaussian distribution:
where (, ) are output by the inference network
Both and are parameterized as LSTMs hochreiter1997long. Note that the inference network could be optimized by the reparameterization trick kingma2013autoencoding:
where denotes Hadamard operation. The KL divergence between and is
We initialize word embeddings using GloVe embeddings pennington2014GloVe
. We further cluster word embeddings with K-Meansmacqueen1967some, as shown in Figure 3 and use the centroids of the clusters to initialize the embeddings of preterminals. The K-Means algorithm is initialized using the K-Means++ method and trained until convergence. The intuition therein is that this gives the model a rough idea of syntactic categories before starting grammar induction. We also consider the variant without pretrained word embeddings, where we initialize word embeddings and preterminals both by drawing from . Other parameters are initialized by Xavier normal initialization glorot2010understanding.
We also apply curriculum learning bengio2009curriculum; Spitkovsky2010 to learn the grammar gradually. Starting at half of the maximum length in the training set, we raise the length limit by
We are interested in the induced parse tree for each sentence in the task of unsupervised parsing, i.e. the most probable tree
where is the posterior over compound variables.
However, it is intractable to get the most probable tree. Hence we use the mean predicted by the inference network and replace with a Dirac delta distribution in place of the real distribution to approximate the integral111111Note that it is also possible to use other methods for approximation. For example, we can use in place of posterior distribution. However, using it still results in high prediction variance of the max function. We did not observe a significant improvement with other methods.
in place of posterior distribution. However, using it still results in high prediction variance of the max function. We did not observe a significant improvement with other methods.
The most probable tree can be obtained via CYK algorithm.
5.1 Data Setup
All models are evaluated using the Penn Treebank marcus1993building as the test corpus, following the splits and preprocessing methods, including removing punctuation, provided by kim2019compound. To convert the original phrase bracket and label annotations to dependency annotations, we use Stanford typed dependency representations de2008stanford.
We employ three standard metrics to measure the performance of models on the validation and test sets: directed and undirected attachment score (DAS and UAS) for dependency parsing, and unlabeled constituent F1 score for constituency parsing.
We tune hyperparameters of the model to minimize perplexity on the validation set. We choose perplexity because it requires only plain text and not annotated parse trees. Specifically, we tuned the architecture of
in the space of multilayer perceptrons, with the dimension of each layer being
, with residual connections and different non-linear activation functions.Table 1 shows the final hyper-parameters of our model. Due to memory constraints on a single graphic card, we set the number of non-terminals and preterminals to 10 and 20, respectively. Later we will show that the compound PCFG’s performance is benefited by a larger grammar, it is therefore possible the same is true for our neural L-PCFG. LABEL:sec:conclusion includes a more detailed discussion of space complexity.
|#layers of , ,||6, 6, 4|
|Gold Tags||Word Embedding||Dev||Test||Dev||Test||Dev||Test|
|Compound PCFG||✗||15.6 (3.9)||17.8 (4.2)||27.7 (4.1)||30.2 (5.3)||45.63 (1.71)||47.79 (2.32)|
|Compound PCFG||✗||GloVe||16.4 (2.4)||18.6 (3.7)||28.7 (3.5)||31.6 (4.5)||45.52 (2.14)||48.20 (2.53)|
|DMV||✗||-||24.7 (1.5)||27.2 (1.9)||43.2 (1.9)||44.3 (2.2)||-||-|
|DMV||✓||-||28.5 (1.9)||29.9 (2.5)||45.5 (2.8)||47.3 (2.7)||-||-|
|Neural L-PCFGs||✗||37.5 (2.7)||39.7 (3.1)||50.6 (3.1)||53.3 (4.2)||52.90 (3.72)||55.31 (4.03)|
|Neural L-PCFGs||✗||GloVe||38.2 (2.1)||40.5 (2.9)||54.4 (3.6)||55.9 (3.8)||45.67 (0.95)||47.23 (2.06)|
|Neural L-PCFGs||✓||35.4 (0.5)||39.2 (1.1)||50.0 (1.3)||53.8 (1.7)||51.16 (5.11)||54.49 (6.32)|
stand for directed/undirected accuracy. For the compound PCFG we use heuristic head rules to obtain dependencies (§5.2
). Figures in the parenthesis show the standard deviation calculated from five runs with different random seeds.indicates a large (30 NT, 60 PT) compound PCFG from kim2019compound – we could not use this size in our experiments due to memory constraints. Results are not directly comparable with the other rows due to model size, but we report them for completeness. Best average performances are indicated in bold.
We compare our neural L-PCFGs with the following baselines:
The compound PCFG (kim2019compound) is an unsupervised constituency parsing model which is a PCFG model with neural scoring. The main difference between this model and neural L-PCFG is the modeling of headedness and the dependency between head word and generated non-terminals or preterminals. We apply the same hyperparameters and techniques, including number of non-terminals and preterminals, initialization, curriculum learning and variational training to compound PCFGs for a fair comparison. Because compound PCFGs have no notion of dependencies, we extract dependencies from the compound PCFG with three kinds of heuristic head rules: left-headed, right-headed and large-headed. Left-/right-headed mean always choosing the root of the left/right child constituent as the root of the parent constituent, whereas large-headedness is generated by a heuristic rule which chooses the root of larger child constituent as the root of the parent constituent. Among these, we choose the method that obtains the best parsing accuracy on the dev set (making these results an oracle with access to more information than our proposed method).
Dependency Model with Valence (DMV)
The DMV klein2004corpus is a model for unsupervised dependency parsing, where valence stands for the number of arguments controlled by a head word. The choices to attach words as children are conditioned on the head words and valences. As shown in (smith2006novel), the DMV model can be expressed as a head-driven context-free grammar with a set of generation rules and scores, where the non-terminals represent the valence of head words. For example, “L[chasing] [is] R[chasing]” denotes that left-hand constituent with full left valence produces a word and a constituent with full right valence. Therefore, it could be seen as a special case of lexicalized PCFG where the generation rules provide inductive biases for dependency parsing but are also restricted – for example, a void-valence constituent cannot produce a full-valence constituent with the same head.
Note that DMV uses far fewer parameters than the PCFG based models, . The neural L-PCFG uses a similar number of parameters as we do, .
We compare models under two settings: (1) with gold tag information and (2) without it, denoted by ✓and ✗, respectively in Table 2. To use gold tag information in training the neural L-PCFG, we assign the 19 most frequent tags as categories and combine the rest into a 20th “other” category. These categories are used as supervision for the preterminals. In this setting, instead of optimizing the log probability of the sentence, we optimize the log joint probability of the sentence and the tags.
5.3 Quantitative Results
First, in this section, we present and discuss quantitative results, as shown in Table 2.
5.3.1 Main Results
First comparing neural L-PCFGs with compound PCFGs, we can see that L-PCFGs perform slightly better on phrase structure prediction and achieve much better dependency accuracy. This shows that (1) lexical dependencies contribute somewhat to the learning of phrase structure, and (2) the head rules learned by neural L-PCFGs are significantly more accurate than the heuristics that we applied to standard compound PCFGs. We also find that GloVe embeddings can help (unsupervised) dependency parsing, but do not benefit constituency parsing.
Next, we can compare the dependency induction accuracy of the neural L-PCFGs with the DMV. The results indicate that neural L-PCFGs without gold tags achieve even better accuracy than DMV with gold tags on both directed accuracy and undirected accuracy. As discussed before, DMV can be seen as a special case of L-PCFG where the attachment of children is conditioned on the valence of the parent tag, while in L-PCFG the generated head directions are conditioned on the parent non-terminal and the head word, which is more general. Comparatively positive results show that conditioning on generation rules not only is more general but also yields a better prediction of attachment.
Table 3 shows label-level recall, i.e. unlabeled recall of constituents annotated by each non-terminal. We observe that the neural L-PCFG outperforms all baselines on these frequent constituent categories.
5.3.2 Impact of Factorization
Table 4 compares the effects of three alternate factorizations of :
Factorization I assumes that the child non-terminals do not depend on the head lexical item, which influences the parsing result significantly. Although Factorization II is as general as our proposed method, it uses separate representations for different directions, and . Factorization III assumes the independence between direction and dependent non-terminals. These results indicate that our factorization strikes a good balance between modeling lexical dependencies and directionality, and avoiding over-parameterization of the model that may lead to sparsity and difficulties in learning.
|w/ xavier init||27.2||47.6||43.6|
|w/ Factorization I||16.4||33.3||25.7|
|w/ Factorization II||22.3||42.7||39.6|
|w/ Factorization III||25.9||46.9||34.7|
” means that preterminals are not initialized by clustering centroids by xavier normal distribution. “w/ Factorization N” represents different factorization methods (§5.3.2).
5.4 Qualitative Analysis
We analyze our best model without gold tags in detail. Figure 4 visualizes the alignment between our induced non-terminals and gold constituent labels on the overlapping constituents of induced trees and the ground-truth. For each constituent label, we show the frequency of it annotating the same span of each non-terminal. We observe from the first map that a clear alignment between certain linguistic labels and induced non-terminals, e.g. VP and NT-4, S and NT-2, PP and NT-8. But for other non-terminals, there’s no clear alignment with induced classes. One hypothesis for this diffusion is due to the diversity of the syntactic roles of these constituents. To investigate this, we zoom in on noun phrases in the second map, and observe that NP-SBJ, NP-TMP and NP-MNR are combined into a single non-terminal NT-5 in the induced grammar, and that NP, NP-PRD and NP-CLR corresponds to NT-2, NT-6 and NT-0, respectively.
We also include an example set of parses for comparing the DMV and neural L-PCFG in LABEL:tab:case_study_1. Note that DMV uses “to” as the head of “know”, the neural L-PCFG correctly inverts this relationship to produce a parse that is better aligned with the gold tree. One of the possible reasons that the DMV tends to use “to” as the head is that DMV has to carry the information that the verb is in the infinitive form, which will be lost if it uses “know” as the head. In our model, however, such information is contained in the types of non-terminals. In this way, our model uses the open class word “know” as the root. Note that we also illustrate a similar failure case in this example. Neural L-PCFG uses “if” as the head of the if-clause, which is probably due to the independency between the root of the if-clause and “know”.
A common mistake made by the neural L-PCFG is treating auxiliary verbs like adjectives that combine with the subject instead of modifying verb phrases. For example, the neural L-PCFG parses “…the exchange will look at the performance…” as “((the exchange) will) (look (at (the performance)))”, while the compound PCFG produces the correct parse “((the exchange) (will (look (at (the performance))))”. A possible reason for this mistake is English verb phrases are commonly left-headed which makes attaching an auxiliary verb less probable as the left child of a verb phrase. This type of error may stem from the model’s inability to assess the semantic function of auxiliary verbs Bisk2015.