Probabilistic Models for High-Order Projective Dependency Parsing

02/14/2015 ∙ by Xuezhe Ma, et al. ∙ Shanghai Jiao Tong University 0

This paper presents generalized probabilistic models for high-order projective dependency parsing and an algorithmic framework for learning these statistical models involving dependency trees. Partition functions and marginals for high-order dependency trees can be computed efficiently, by adapting our algorithms which extend the inside-outside algorithm to higher-order cases. To show the effectiveness of our algorithms, we perform experiments on three languages---English, Chinese and Czech, using maximum conditional likelihood estimation for model training and L-BFGS for parameter estimation. Our methods achieve competitive performance for English, and outperform all previously reported dependency parsers for Chinese and Czech.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dependency parsing is an approach to syntactic analysis inspired by dependency grammar. In recent years, several domains of Natural Language Processing have benefited from dependency representations, such as synonym generation 

[Shinyama, Sekine, and Sudo2002], relation extraction [Nguyen, Moschitti, and Riccardi2009] and machine translation [Katz-Brown et al.2011, Xie, Mi, and Liu2011]. A primary reason for using dependency structures instead of more informative constituent structures is that they are usually easier to be understood and is more amenable to annotators who have good knowledge of the target domain but lack of deep linguistic knowledge [Yamada and Matsumoto2003] while still containing much useful information needed in application.

Dependency structure represents a parsing tree as a directed graph with different labels on each edge, and some methods based on graph models have been applied to it and achieved high performance. Based on the report of the CoNLL-X shared task on dependency parsing [Buchholz and Marsi2006, Nivre et al.2007], there are currently two dominant approaches for data-driven dependency parsing: local-and-greedy transition-based algorithms [Yamada and Matsumoto2003, Nivre and Scholz2004, Attardi2006, McDonald and Nivre2007], and globally optimized graph-based algorithms [Eisner1996, McDonald, Crammer, and Pereira2005, McDonald et al.2005, McDonald and Pereira2006, Carreras2007, Koo and Collins2010], and graph-based parsing models have achieved state-of-the-art accuracy for a wide range of languages.

There have been several existing graph-based dependency parsers, most of which employed online learning algorithms such as the averaged structured perceptron (AP) 

[Freund and Schapire1999, Collins2002] or Margin Infused Relaxed Algorithm (MIRA) [Crammer and Singer2003, Crammer et al.2006, McDonald2006] for learning parameters. However, One shortcoming of these parsers is that learning parameters of these models usually takes a long time (several hours for an iteration). The primary reason is that the training step cannot be performed in parallel, since for online learning algorithms, the updating for a new training instance depends on parameters updated with the previous instance.

Paskin Paskin:2001 proposed a variant of the inside-outside algorithm [Baker1979], which were applied to the grammatical bigram model [Eisner1996]. Using this algorithm, the grammatical bigram model can be learning by off-line learning algorithms. However, the grammatical bigram model is based on a strong independence assumption that all the dependency edges of a tree are independent of one another. This assumption restricts the model to first-order factorization (single edge), losing much of the contextual information in dependency tree. Chen Chen:2010 illustrated that a wide range of decision history can lead to significant improvements in accuracy for graph-based dependency parsing models. Meanwhile, several previous works [Carreras2007, Koo and Collins2010] have shown that grandchild interactions provide important information for dependency parsing. Therefore, relaxing the independence assumption for higher-order parts to capture much richer contextual information within the dependency tree is a reasonable improvement of the bigram model.

In this paper, we present a generalized probabilistic model that can be applied to any types of factored models for projective dependency parsing, and an algorithmic framework for learning these statistical models. We use the grammatical bigram model as the backbone, but relax the independence assumption and extend the inside-outside algorithms to efficiently compute the partition functions and marginals (see Section 2.4) for three higher-order models. Using the proposed framework, parallel computation technique can be employed, significantly reducing the time taken to train the parsing models. To achieve empirical evaluations of our parsers, these algorithms are implemented and evaluated on three treebanks—Penn WSJ Treebank [Marcus, Santorini, and Marcinkiewicz1993] for English, Penn Chinese Treebank [Xue et al.2005] for Chinese and Prague Dependency Treebank [Hajič1998, Hajič et al.2001] for Czech, and we expect to achieve an improvement in parsing performance. We also give an error analysis on structural properties for the parsers trained by our framework and those trained by online learning algorithms. A free distribution of our implementation has been put on the Internet.111

The remainder of this paper is structured as follows: Section 2 describes the probabilistic models and the algorithm framework for training the models. Related work is presented in Section 3. Section 4 presents the algorithms of different parsing models for computing partition functions and marginals. The details of experiments are reported in Section 5, and conclusions are in Section 6.

Figure 1: An example dependency tree.

2 Dependency Parsing

2.1 Background of Dependency Parsing

Dependency trees represent syntactic relationships through labeled directed edges of words and their syntactic modifiers. For example, Figure 1 shows a dependency tree for the sentence, Economic news had little effect on financial markets, with the sentence’s root-symbol as its root.

By considering the item of crossing dependencies, dependency trees fall into two categories—projective and non-projective dependency trees. An equivalent and more convenient formulation of the projectivity constrain is that if a dependency tree can be written with all words in a predefined linear order and all edges drawn on the plane without crossing edges (see Figure 1(b)). The example in Figure 1 belongs to the class of projective dependency trees where crossing dependencies are not allowed.

Dependency trees are often typed with labels for each edge to represent additional syntactic information (see Figure 1(a)), such as and for verb-subject and verb-object head-modifier interactions, respectively. Sometimes, however, the dependency labels are omitted. Dependency trees are defined as labeled or unlabeled according to whether the dependency labels are included or dropped. In the remainder of this paper, we will focus on unlabeled dependency parsing for both theoretical and practical reasons. From theoretical respect, unlabeled parsers are easier to describe and understand, and algorithms for unlabeled parsing can usually be extended easily to the labeled case. From practical respect, algorithms of labeled parsing generally have higher computational complexity than them of unlabeled version, and are more difficult to implement and verify. Finally, the dependency labels can be accurately tagged by a two-stage labeling method [McDonald2006], utilizing the unlabeled output parse.

2.2 Probabilistic Model

The symbols we used in this paper are denoted in what follows, represents a generic input sentence, and represents a generic dependency tree. is used to denote the set of possible dependency trees for sentence

. The probabilistic model for dependency parsing defines a family of conditional probability

over all given sentence , with a log-linear form:

where are feature functions, are parameters of the model, and is a normalization factor, which is commonly referred to as the partition function:

2.3 Maximum Likelihood Parameter Inference

Maximum conditional likelihood estimation is used for model training (like a CRF). For a set of training data , the logarithm of the likelihood, knows as the log-likelihood, is given by:

Maximum likelihood training chooses parameters such that the log-likelihood is maximized. This optimization problem is typically solved using quasi-Newton numerical methods such as L-BFGS [Nash and Nocedal1991], which requires the gradient of the objective function:

The computation of and the second item in summation of Equation (2.3) are the difficult parts in model training. In the following, we will show how these can be computed efficiently using the proposed algorithms.

2.4 Problems of Training and Decoding

In order to train and decode dependency parsers, we have to solve three inference problems which are central to the algorithms proposed in this paper.

The first problem is the decoding problem of finding the best parse for a sentence when all the parameters of the probabilistic model have been given. According to decision theory, a reasonable solution for classification is the Bayes classifier

which classify to the most probable class, using the conditional distribution. Dependency parsing could be regarded as a classification problem, so decoding a dependency parser is equivalent to finding the dependency tree

which has the maximum conditional probability:


The second and third problems are the computation of the partition function and the gradient of the log-likelihood (see Equation (2.3)).

From the definition above, we can see that all three problems require an exhaustive search over to accomplish a maximization or summation. It is obvious that the cardinality of grows exponentially with the length of , thus it is impractical to perform the search directly. A common strategy is to factor dependency trees into sets of small parts that have limited interactions:


That is, dependency tree is treated as a set of parts and each feature function is equal to the sum of all the features .

We denote the weight of each part as follows:

Based on Equation (3) and the definition of weight for each part, conditional probability has the the following form:

Furthermore, Equation (2) can be rewritten as:

and the partition function and the second item in the summation of Equation (2.3) are


where and is the set of all possible part for sentence . Note that the remaining problem for the computation of the gradient in Equation (2.3) is to compute the marginal probability for each part :

Then the three inference problems are as follows:

  • Problem 1: Decoding

  • Problem 2: Computing the Partition Function

  • Problem 3: Computing the Marginals

2.5 Discussion

It should be noted that for the parsers trained by online learning algorithms such as AP or MIRA, only the algorithm for solving the decoding problem is required. However, for the motivation of training parsers using off-line parameter estimation methods such as maximum likelihood described above, we have to carefully design algorithms for the inference problem 2 and 3.

The proposed probabilistic model is capable of generalization to any types of parts , and can be learned by using the framework which solves the three inference problems. For different types of factored models, the algorithms to solve the three inference problems are different. Following Koo and Collins Koo:2010, the order of a part is defined as the number of dependencies it contains, and the order of a factorization or parsing algorithm is the maximum of the order of the parts it uses. In this paper, we focus on three factorizations: sibling and grandchild, two different second-order parts, and grand-sibling, a third-order part:

In this paper, we consider only projective trees, where crossing dependencies are not allowed, excluding non-projective trees, where dependencies are allowed to cross. For projective parsing, efficient algorithms exist to solve the three problems, for certain factorizations with special structures. Non-projective parsing with high-order factorizations is known to be NP-hard in computation [McDonald and Pereira2006, McDonald and Satta2007]. In addition, our models capture multi-root trees, whose root-symbols have one or more children. A multi-root parser is more robust to sentences that contain disconnected but coherent fragments, since it is allowed to split up its analysis into multiple pieces.

2.6 Labeled Parsing

Our probabilistic model are easily extended to include dependency labels. We denote as the set of all valid dependency labels. We change the feature functions to include label function:


is the vector of dependency labels of edges belonging to part

. We define the order of as the number of labels contains, and denote it as . It should be noted that the order of is not necessarily equal to the order of , since may contain labels of parts of edges in . For example, for the second-order sibling model and the part , can be defined to contain only the label of edge from word to word .

The weight function of each part is changed to:


Based on Equation 4, Problem 2 and 3 are rewritten as follows:


This extension increases the computational complexity of time by factor of , where is the size of .

3 Related Work

3.1 Grammatical Bigram Probability Model

The probabilistic model described in Section 2.2 is a generalized formulation of the grammatical bigram probabilistic model proposed in Eisner eisn:1996, which is used by several works [Paskin2001, Koo et al.2007, Smith and Smith2007]. In fact, the grammatical bigram probabilistic model is a special case of our probabilistic model, by specifying the parts as individual edges. The grammatical bigram model is based on a strong independence assumption: that all the dependency edges of a tree are independent of one another, given the sentence .

For the first-order model (part is an individual edge), a variant of the inside-outside algorithm, which was proposed by Baker Baker:1979 for probabilistic context-free grammars, can be applied for the computation of partition function and marginals for projective dependency structures. This inside-outside algorithm is built on the semiring parsing framework [Goodman1999]. For non-projective cases, Problems 2 and 3 can be solved by an adaptation of Kirchhoff’s Matrix-Tree Theorem [Koo et al.2007, Smith and Smith2007].

Figure 2: The dynamic-programming structures and derivation of four graph-based dependency parsers with different types of factorization. Symmetric right-headed versions are elided for brevity.

3.2 Algorithms of Decoding Problem for Different Factored Models

It should be noted that if the score of parts is defined as the logarithm of their weight:

then the decoding problem is equivalent to the form of graph-based dependency parsing with global linear model (GLM), and several parsing algorithms for different factorizations have been proposed in previous work. Figure 2 provides graphical specifications of these parsing algorithms.

McDonald et al. McDonald:2005 presented the first-order dependency parser, which decomposes a dependency tree into a set of individual edges. A widely-used dynamic programming algorithm [Eisner2000] was used for decoding. This algorithm introduces two interrelated types of dynamic programming structures: complete spans, and incomplete spans [McDonald, Crammer, and Pereira2005]. Larger spans are created from two smaller, adjacent spans by recursive combination in a bottom-up procedure.

The second-order sibling parser [McDonald and Pereira2006] breaks up a dependency tree into sibling parts—pairs of adjacent edges with shared head. Koo and Collins Koo:2010 proposed a parser that factors each dependency tree into a set of grandchild parts. Formally, a grandchild part is a triple of indices where is the head of and is the head of . In order to parse this factorization, it is necessary to augment both complete and incomplete spans with grandparent indices. Following Koo and Collins Koo:2010, we refer to these augmented structures as g-spans.

The second-order parser proposed in Carreras cars:2007 is capable to score both sibling and grandchild parts with complexities of time and space. However, the parser suffers an crucial limitation that it can only evaluate events of grandchild parts for outermost grandchildren.

The third-order grand-sibling parser, which encloses grandchild and sibling parts into a grand-sibling part, was described in Koo and Collins Koo:2010. This factorization defines all grandchild and sibling parts and still requires time and space.

3.3 Transition-based Parsing

Another category of dependency parsing systems is “transition-based” parsing [Nivre and Scholz2004, Attardi2006, McDonald and Nivre2007] which parameterizes models over transitions from one state to another in an abstract state-machine. In these models, dependency trees are constructed by taking highest scoring transition at each state until a state for the termination is entered. Parameters in these models are typically learned using standard classification techniques to predict one transition from a set of possible transitions given a state history.

Recently, several approaches have been proposed to improve transition-based dependency parsers. In the aspect of decoding, beam search [Johansson and Nugues2007, Huang, Jiang, and Liu2009] and partial dynamic programming [Huang and Sagae2010] have been applied to improve one-best search. In the aspect of training, global structural learning has been applied to replace local learning on each decision [Zhang and Clark2008, Huang, Jiang, and Liu2009].

4 Algorithms for High-order Models

In this section, we describe our algorithms for problem 2 and 3 of three high-order factored models: grandchild and sibling, two second-order models; and grand-sibling, which is third-order. Our algorithms are built on the idea from the inside-outside algorithm [Paskin2001] for the first-order projective parsing model. Following this, we define the inside probabilities and outside probabilities over spans :

where is a sub-structure of a tree and is the sub-structure of tree that belongs to span .

for to

for to

for or

end for

end for

for to

end for

Algorithm 1: Compute inside probability for second-order Grandchild Model


for to

end for


for to

for to



end if

end for



end if

end for

end for

end for

Algorithm 2: Compute outside probability for second-order Grandchild Model

4.1 Model of Grandchild Factorization

In the second-order grandchild model, each dependency tree is factored into a set of grandchild parts— pairs of dependencies connected head-to-tail. Formally, a grandchild part is a triple of indices where both and are dependencies.

In order to compute the partition function and marginals for this factorization, we augment both incomplete and complete spans with grandparent indices. This is similar to Koo and Collins Koo:2010 for the decoding algorithm of this grandchild factorization. Following Koo and Collins Koo:2010, we refer to these augmented structures as g-spans, and denote an incomplete g-span as , where is a normal complete span and is the index of a grandparent lying outside the range , with the implication that is a dependency. Complete g-spans are defined analogously and denoted as . In addition, we denote the weight of a grandchild part as for brevity.

The algorithm for the computation of inside probabilities is shown as Algorithm 1. The dynamic programming derivations resemble those of the decoding algorithm of this factorization, the only difference is to replace the maximization with summation. The reason is obvious, since the spans defined for the two algorithms are the same. Note that since our algorithm considers multi-root dependency trees, we should perform another recursive step to compute the inside probability for the complete span , after the computation of for all g-spans.

Algorithm 2 illustrates the algorithm for computing outside probabilities . This is a top-down dynamic programming algorithm, and the key of this algorithm is to determine all the contributions to the final for each g-span; fortunately, this can be done deterministically for all cases. For example, the complete g-span with has two different contributions: combined with a g-span , of which , in the right side to build up a larger g-span ; or combined with a g-span , of which or , in the left side to form a larger g-span . So is the sum of two items, each of which corresponds to one of the two cases (See Algorithm 2). It should be noted that complete g-spans with or are two special cases.

After the computation of and for all spans, we can get marginals using following equation:

Since the complexity of the both Algorithm 1 and Algorithm 2 is time and space, the complexity overall for training this model is time and space, which is the same as the decoding algorithm of this factorization.

4.2 Model of Sibling Factorization

In order to parse the sibling factorization, a new type of span: sibling spans, is defined [McDonald2006]. We denote a sibling span as where and are successive modifiers with a shared head. Formally, a sibling span represents the region between successive modifiers and of some head. The graphical specification of the second-order sibling model for dynamic-programming, which is in the original work of Eisner [Eisner1996], is shown in Figure 2. The key insight is that an incomplete span is constructed by combining a smaller incomplete span with a sibling span that covers the region between the two successive modifiers. The new way allows for the collection of pairs of sibling dependents in a single state. It is no surprise that the dynamic-programming structures and derivations of the algorithm for computing is the same as that of the decoding algorithm, and we omit the pseudo-code of this algorithm.

The algorithm for computing can be designed with the new dynamic programming structures. The pseudo-code of this algorithm is illustrated in Algorithm 3. We use to denote the weight of a sibling part . The computation of marginals of sibling parts is quite different from that of the first-order dependency or second-order grandchild model. For the introduction of sibling spans, two different cases should be considered: the modifiers are at the left/right side of the head. In addition, the part , which represents that is the inner-most modifier of , is a special case and should be treated specifically. We can get marginals for all sibling parts with as following:

Since each derivation is defined by a span and a split point, the complexity for training and decoding of the second-order sibling model is time and space.

for to

for to

end for

end for

Algorithm 3: Compute outside probability for second-order Sibling Model

, , ,

for to

end for


for to

for to


end for


end for

end for

end for

Algorithm 4: Compute outside probability for third-order Grand-sibling Model

4.3 Model of Grand-Sibling Factorization

We now describe the algorithms of the third-order grand-sibling model. In this model, each tree is decomposed into grand-sibling parts, which enclose grandchild and sibling parts. Formally, a grand-sibling is a 4-tuple of indices where is a sibling part and is a grandchild part. The algorithm of this factorization can be designed based on the algorithms for grandchild and sibling models.

Like the extension of the second-order sibling model to the first-order dependency model, we define the sibling g-spans , where is a normal sibling span and is the index of the head of and , which lies outside the region with the implication that forms a valid sibling part. This model can also be treated as an extension of the sibling model by augmenting it with a grandparent index for each span, like the behavior of the grandchild model for the first-order dependency model. Figure 2 provides the graphical specification of this factorization for dynamic-programming, too. The overall structures and derivations is similar to the second-order sibling model, with the addition of grandparent indices. The same to the second-order grandchild model, the grandparent indices can be set deterministically in all cases.

The pseudo-code of the algorithm for the computation of the outside probability is illustrated in Algorithm 4. It should be noted that in this model there are two types of special cases—one is the sibling-g-span with or , as the complete g-span with or in the second-order grandchild model; another is the inner-most modifier case as the second-order sibling model. We use to denote the weight of a grand-sibling part and the marginals for all grand-sibling parts with can be computed as follows:

Despite the extension to third-order parts, each derivation is still defined by a g-span and a split point as in second-order grandchild model, so training and decoding of the grand-sibling model still requires time and space.

Training 2-21 39,832 843,029
PTB Dev 22 1,700 35,508
Test 23 2,416 49,892
Training 001-815; 1001-1136 16,079 370,777
CTB Dev 886-931; 1148-1151 804 17,426
Test 816-885; 1137-1147 1,915 42.773
Training - 73088 1,255,590
PDT Dev - 7,318 126,028
Test - 7,507 125,713
Table 1: Training, development and test data for PTB, CTB and PDT. and refer to the number of sentences and the number of words excluding punctuation in each data set, respectively.

5 Experiments for Dependency Parsing

5.1 Data Sets

We implement and evaluate the proposed algorithms of the three factored models (sibling, grandchild and grand-sibling) on the Penn English Treebank (PTB version 3.0) [Marcus, Santorini, and Marcinkiewicz1993], the Penn Chinese Treebank (CTB version 5.0) [Xue et al.2005] and Prague Dependency Treebank (PDT) [Hajič1998, Hajič et al.2001].

For English, the PTB data is prepared by using the standard split: sections 2-21 are used for training, section 22 is for development, and section 23 for test. Dependencies are extracted by using Penn2Malt222 tool with standard head rules [Yamada and Matsumoto2003]. For Chinese, we adopt the data split from Zhang and Clark zhang:2009, and we also used the Penn2Malt tool to convert the data into dependency structures. Since the dependency trees for English and Chinese are extracted from phrase structures in Penn Treebanks, they contain no crossing edges by construction. For Czech, the PDT has a predefined training, developing and testing split. we "projectivized" the training data by finding best-match projective trees333Projective trees for training sentences are obtained by running the first-order projective parser with an oracle model that assigns a score of +1 to correct edges and -1 otherwise..

All experiments were running using every single sentence in each set of data regardless of length. Parsing accuracy is measured with unlabeled attachment score (UAS): the percentage of words with the correct head, root accuracy (RA): the percentage of correctly identified root words, and the percentage of complete matches (CM). Following the standard of previous work, we did not include punctuation444English evaluation ignores any token whose gold-standard POS is one of ; Chinese evaluation ignores any token whose tag is “PU” in the calculation of accuracies for English and Chinese. The detailed information of each treebank is showed in Table 1.

dependency features for part
uni-gram features bi-gram features context features
L(s)P(s) L(s)P(s)L(t)P(t) P(s)P(t)P(s+1)P(t-1)
L(s) L(s)P(s)P(t) P(s)L(t)P(t) P(s)P(t)P(s-1)P(t-1)
P(s) L(s)P(s)L(t) L(s)L(t)P(t) P(s)P(t)P(s+1)P(t+1)
L(t)P(t) L(s)L(t) P(s)P(t) P(s)P(t)P(s+1)P(t-1)
L(t) in between features
P(t) L(s)L(b)L(t) P(s)P(b)P(t)
grandchild features for part sibling features for part
tri-gram features backed-off features tri-gram features backed-off features
L(g)L(s)L(t) L(g)L(t) L(s)L(r)L(t) L(r)L(t)
P(g)P(s)P(t) P(g)P(t) P(s)P(r)P(t) P(r)P(t)
L(g)P(g)P(s)P(t) L(g)P(t) L(s)P(s)P(r)P(t) L(r)P(t)
P(g)L(s)P(s)P(t) P(g)L(t) P(s)L(r)P(r)P(t) P(r)L(t)
P(g)P(s)L(t)P(t) P(s)P(r)L(t)P(t)
grand-sibling features for part
4-gram features context features backed-off features
L(g)P(s)P(r)P(t) P(g)P(s)P(r)P(t)P(g+1)P(s+1)P(t+1) L(g)P(r)P(t)
P(g)L(s)P(r)P(t) P(g)P(s)P(r)P(t)P(g-1)P(s-1)P(t-1) P(g)L(r)P(t)