1 Introduction
Dependency parsing is an approach to syntactic analysis inspired by dependency grammar. In recent years, several domains of Natural Language Processing have benefited from dependency representations, such as synonym generation
[Shinyama, Sekine, and Sudo2002], relation extraction [Nguyen, Moschitti, and Riccardi2009] and machine translation [KatzBrown et al.2011, Xie, Mi, and Liu2011]. A primary reason for using dependency structures instead of more informative constituent structures is that they are usually easier to be understood and is more amenable to annotators who have good knowledge of the target domain but lack of deep linguistic knowledge [Yamada and Matsumoto2003] while still containing much useful information needed in application.Dependency structure represents a parsing tree as a directed graph with different labels on each edge, and some methods based on graph models have been applied to it and achieved high performance. Based on the report of the CoNLLX shared task on dependency parsing [Buchholz and Marsi2006, Nivre et al.2007], there are currently two dominant approaches for datadriven dependency parsing: localandgreedy transitionbased algorithms [Yamada and Matsumoto2003, Nivre and Scholz2004, Attardi2006, McDonald and Nivre2007], and globally optimized graphbased algorithms [Eisner1996, McDonald, Crammer, and Pereira2005, McDonald et al.2005, McDonald and Pereira2006, Carreras2007, Koo and Collins2010], and graphbased parsing models have achieved stateoftheart accuracy for a wide range of languages.
There have been several existing graphbased dependency parsers, most of which employed online learning algorithms such as the averaged structured perceptron (AP)
[Freund and Schapire1999, Collins2002] or Margin Infused Relaxed Algorithm (MIRA) [Crammer and Singer2003, Crammer et al.2006, McDonald2006] for learning parameters. However, One shortcoming of these parsers is that learning parameters of these models usually takes a long time (several hours for an iteration). The primary reason is that the training step cannot be performed in parallel, since for online learning algorithms, the updating for a new training instance depends on parameters updated with the previous instance.Paskin Paskin:2001 proposed a variant of the insideoutside algorithm [Baker1979], which were applied to the grammatical bigram model [Eisner1996]. Using this algorithm, the grammatical bigram model can be learning by offline learning algorithms. However, the grammatical bigram model is based on a strong independence assumption that all the dependency edges of a tree are independent of one another. This assumption restricts the model to firstorder factorization (single edge), losing much of the contextual information in dependency tree. Chen et.al Chen:2010 illustrated that a wide range of decision history can lead to significant improvements in accuracy for graphbased dependency parsing models. Meanwhile, several previous works [Carreras2007, Koo and Collins2010] have shown that grandchild interactions provide important information for dependency parsing. Therefore, relaxing the independence assumption for higherorder parts to capture much richer contextual information within the dependency tree is a reasonable improvement of the bigram model.
In this paper, we present a generalized probabilistic model that can be applied to any types of factored models for projective dependency parsing, and an algorithmic framework for learning these statistical models. We use the grammatical bigram model as the backbone, but relax the independence assumption and extend the insideoutside algorithms to efficiently compute the partition functions and marginals (see Section 2.4) for three higherorder models. Using the proposed framework, parallel computation technique can be employed, significantly reducing the time taken to train the parsing models. To achieve empirical evaluations of our parsers, these algorithms are implemented and evaluated on three treebanks—Penn WSJ Treebank [Marcus, Santorini, and Marcinkiewicz1993] for English, Penn Chinese Treebank [Xue et al.2005] for Chinese and Prague Dependency Treebank [Hajič1998, Hajič et al.2001] for Czech, and we expect to achieve an improvement in parsing performance. We also give an error analysis on structural properties for the parsers trained by our framework and those trained by online learning algorithms. A free distribution of our implementation has been put on the Internet.^{1}^{1}1http://sourceforge.net/projects/maxparser/.
The remainder of this paper is structured as follows: Section 2 describes the probabilistic models and the algorithm framework for training the models. Related work is presented in Section 3. Section 4 presents the algorithms of different parsing models for computing partition functions and marginals. The details of experiments are reported in Section 5, and conclusions are in Section 6.
2 Dependency Parsing
2.1 Background of Dependency Parsing
Dependency trees represent syntactic relationships through labeled directed edges of words and their syntactic modifiers. For example, Figure 1 shows a dependency tree for the sentence, Economic news had little effect on financial markets, with the sentence’s rootsymbol as its root.
By considering the item of crossing dependencies, dependency trees fall into two categories—projective and nonprojective dependency trees. An equivalent and more convenient formulation of the projectivity constrain is that if a dependency tree can be written with all words in a predefined linear order and all edges drawn on the plane without crossing edges (see Figure 1(b)). The example in Figure 1 belongs to the class of projective dependency trees where crossing dependencies are not allowed.
Dependency trees are often typed with labels for each edge to represent additional syntactic information (see Figure 1(a)), such as and for verbsubject and verbobject headmodifier interactions, respectively. Sometimes, however, the dependency labels are omitted. Dependency trees are defined as labeled or unlabeled according to whether the dependency labels are included or dropped. In the remainder of this paper, we will focus on unlabeled dependency parsing for both theoretical and practical reasons. From theoretical respect, unlabeled parsers are easier to describe and understand, and algorithms for unlabeled parsing can usually be extended easily to the labeled case. From practical respect, algorithms of labeled parsing generally have higher computational complexity than them of unlabeled version, and are more difficult to implement and verify. Finally, the dependency labels can be accurately tagged by a twostage labeling method [McDonald2006], utilizing the unlabeled output parse.
2.2 Probabilistic Model
The symbols we used in this paper are denoted in what follows, represents a generic input sentence, and represents a generic dependency tree. is used to denote the set of possible dependency trees for sentence
. The probabilistic model for dependency parsing defines a family of conditional probability
over all given sentence , with a loglinear form:where are feature functions, are parameters of the model, and is a normalization factor, which is commonly referred to as the partition function:
2.3 Maximum Likelihood Parameter Inference
Maximum conditional likelihood estimation is used for model training (like a CRF). For a set of training data , the logarithm of the likelihood, knows as the loglikelihood, is given by:
Maximum likelihood training chooses parameters such that the loglikelihood is maximized. This optimization problem is typically solved using quasiNewton numerical methods such as LBFGS [Nash and Nocedal1991], which requires the gradient of the objective function:
The computation of and the second item in summation of Equation (2.3) are the difficult parts in model training. In the following, we will show how these can be computed efficiently using the proposed algorithms.
2.4 Problems of Training and Decoding
In order to train and decode dependency parsers, we have to solve three inference problems which are central to the algorithms proposed in this paper.
The first problem is the decoding problem of finding the best parse for a sentence when all the parameters of the probabilistic model have been given. According to decision theory, a reasonable solution for classification is the Bayes classifier
which classify to the most probable class, using the conditional distribution. Dependency parsing could be regarded as a classification problem, so decoding a dependency parser is equivalent to finding the dependency tree
which has the maximum conditional probability:(2)  
The second and third problems are the computation of the partition function and the gradient of the loglikelihood (see Equation (2.3)).
From the definition above, we can see that all three problems require an exhaustive search over to accomplish a maximization or summation. It is obvious that the cardinality of grows exponentially with the length of , thus it is impractical to perform the search directly. A common strategy is to factor dependency trees into sets of small parts that have limited interactions:
(3) 
That is, dependency tree is treated as a set of parts and each feature function is equal to the sum of all the features .
We denote the weight of each part as follows:
Based on Equation (3) and the definition of weight for each part, conditional probability has the the following form:
Furthermore, Equation (2) can be rewritten as:
and the partition function and the second item in the summation of Equation (2.3) are
and
where and is the set of all possible part for sentence . Note that the remaining problem for the computation of the gradient in Equation (2.3) is to compute the marginal probability for each part :
Then the three inference problems are as follows:

Problem 1: Decoding

Problem 2: Computing the Partition Function

Problem 3: Computing the Marginals
2.5 Discussion
It should be noted that for the parsers trained by online learning algorithms such as AP or MIRA, only the algorithm for solving the decoding problem is required. However, for the motivation of training parsers using offline parameter estimation methods such as maximum likelihood described above, we have to carefully design algorithms for the inference problem 2 and 3.
The proposed probabilistic model is capable of generalization to any types of parts , and can be learned by using the framework which solves the three inference problems. For different types of factored models, the algorithms to solve the three inference problems are different. Following Koo and Collins Koo:2010, the order of a part is defined as the number of dependencies it contains, and the order of a factorization or parsing algorithm is the maximum of the order of the parts it uses. In this paper, we focus on three factorizations: sibling and grandchild, two different secondorder parts, and grandsibling, a thirdorder part:
In this paper, we consider only projective trees, where crossing dependencies are not allowed, excluding nonprojective trees, where dependencies are allowed to cross. For projective parsing, efficient algorithms exist to solve the three problems, for certain factorizations with special structures. Nonprojective parsing with highorder factorizations is known to be NPhard in computation [McDonald and Pereira2006, McDonald and Satta2007]. In addition, our models capture multiroot trees, whose rootsymbols have one or more children. A multiroot parser is more robust to sentences that contain disconnected but coherent fragments, since it is allowed to split up its analysis into multiple pieces.
2.6 Labeled Parsing
Our probabilistic model are easily extended to include dependency labels. We denote as the set of all valid dependency labels. We change the feature functions to include label function:
where
is the vector of dependency labels of edges belonging to part
. We define the order of as the number of labels contains, and denote it as . It should be noted that the order of is not necessarily equal to the order of , since may contain labels of parts of edges in . For example, for the secondorder sibling model and the part , can be defined to contain only the label of edge from word to word .The weight function of each part is changed to:
(4) 
Based on Equation 4, Problem 2 and 3 are rewritten as follows:
and
This extension increases the computational complexity of time by factor of , where is the size of .
3 Related Work
3.1 Grammatical Bigram Probability Model
The probabilistic model described in Section 2.2 is a generalized formulation of the grammatical bigram probabilistic model proposed in Eisner eisn:1996, which is used by several works [Paskin2001, Koo et al.2007, Smith and Smith2007]. In fact, the grammatical bigram probabilistic model is a special case of our probabilistic model, by specifying the parts as individual edges. The grammatical bigram model is based on a strong independence assumption: that all the dependency edges of a tree are independent of one another, given the sentence .
For the firstorder model (part is an individual edge), a variant of the insideoutside algorithm, which was proposed by Baker Baker:1979 for probabilistic contextfree grammars, can be applied for the computation of partition function and marginals for projective dependency structures. This insideoutside algorithm is built on the semiring parsing framework [Goodman1999]. For nonprojective cases, Problems 2 and 3 can be solved by an adaptation of Kirchhoff’s MatrixTree Theorem [Koo et al.2007, Smith and Smith2007].
3.2 Algorithms of Decoding Problem for Different Factored Models
It should be noted that if the score of parts is defined as the logarithm of their weight:
then the decoding problem is equivalent to the form of graphbased dependency parsing with global linear model (GLM), and several parsing algorithms for different factorizations have been proposed in previous work. Figure 2 provides graphical specifications of these parsing algorithms.
McDonald et al. McDonald:2005 presented the firstorder dependency parser, which decomposes a dependency tree into a set of individual edges. A widelyused dynamic programming algorithm [Eisner2000] was used for decoding. This algorithm introduces two interrelated types of dynamic programming structures: complete spans, and incomplete spans [McDonald, Crammer, and Pereira2005]. Larger spans are created from two smaller, adjacent spans by recursive combination in a bottomup procedure.
The secondorder sibling parser [McDonald and Pereira2006] breaks up a dependency tree into sibling parts—pairs of adjacent edges with shared head. Koo and Collins Koo:2010 proposed a parser that factors each dependency tree into a set of grandchild parts. Formally, a grandchild part is a triple of indices where is the head of and is the head of . In order to parse this factorization, it is necessary to augment both complete and incomplete spans with grandparent indices. Following Koo and Collins Koo:2010, we refer to these augmented structures as gspans.
The secondorder parser proposed in Carreras cars:2007 is capable to score both sibling and grandchild parts with complexities of time and space. However, the parser suffers an crucial limitation that it can only evaluate events of grandchild parts for outermost grandchildren.
The thirdorder grandsibling parser, which encloses grandchild and sibling parts into a grandsibling part, was described in Koo and Collins Koo:2010. This factorization defines all grandchild and sibling parts and still requires time and space.
3.3 Transitionbased Parsing
Another category of dependency parsing systems is “transitionbased” parsing [Nivre and Scholz2004, Attardi2006, McDonald and Nivre2007] which parameterizes models over transitions from one state to another in an abstract statemachine. In these models, dependency trees are constructed by taking highest scoring transition at each state until a state for the termination is entered. Parameters in these models are typically learned using standard classification techniques to predict one transition from a set of possible transitions given a state history.
Recently, several approaches have been proposed to improve transitionbased dependency parsers. In the aspect of decoding, beam search [Johansson and Nugues2007, Huang, Jiang, and Liu2009] and partial dynamic programming [Huang and Sagae2010] have been applied to improve onebest search. In the aspect of training, global structural learning has been applied to replace local learning on each decision [Zhang and Clark2008, Huang, Jiang, and Liu2009].
4 Algorithms for Highorder Models
In this section, we describe our algorithms for problem 2 and 3 of three highorder factored models: grandchild and sibling, two secondorder models; and grandsibling, which is thirdorder. Our algorithms are built on the idea from the insideoutside algorithm [Paskin2001] for the firstorder projective parsing model. Following this, we define the inside probabilities and outside probabilities over spans :
where is a substructure of a tree and is the substructure of tree that belongs to span .
4.1 Model of Grandchild Factorization
In the secondorder grandchild model, each dependency tree is factored into a set of grandchild parts— pairs of dependencies connected headtotail. Formally, a grandchild part is a triple of indices where both and are dependencies.
In order to compute the partition function and marginals for this factorization, we augment both incomplete and complete spans with grandparent indices. This is similar to Koo and Collins Koo:2010 for the decoding algorithm of this grandchild factorization. Following Koo and Collins Koo:2010, we refer to these augmented structures as gspans, and denote an incomplete gspan as , where is a normal complete span and is the index of a grandparent lying outside the range , with the implication that is a dependency. Complete gspans are defined analogously and denoted as . In addition, we denote the weight of a grandchild part as for brevity.
The algorithm for the computation of inside probabilities is shown as Algorithm 1. The dynamic programming derivations resemble those of the decoding algorithm of this factorization, the only difference is to replace the maximization with summation. The reason is obvious, since the spans defined for the two algorithms are the same. Note that since our algorithm considers multiroot dependency trees, we should perform another recursive step to compute the inside probability for the complete span , after the computation of for all gspans.
Algorithm 2 illustrates the algorithm for computing outside probabilities . This is a topdown dynamic programming algorithm, and the key of this algorithm is to determine all the contributions to the final for each gspan; fortunately, this can be done deterministically for all cases. For example, the complete gspan with has two different contributions: combined with a gspan , of which , in the right side to build up a larger gspan ; or combined with a gspan , of which or , in the left side to form a larger gspan . So is the sum of two items, each of which corresponds to one of the two cases (See Algorithm 2). It should be noted that complete gspans with or are two special cases.
After the computation of and for all spans, we can get marginals using following equation:
Since the complexity of the both Algorithm 1 and Algorithm 2 is time and space, the complexity overall for training this model is time and space, which is the same as the decoding algorithm of this factorization.
4.2 Model of Sibling Factorization
In order to parse the sibling factorization, a new type of span: sibling spans, is defined [McDonald2006]. We denote a sibling span as where and are successive modifiers with a shared head. Formally, a sibling span represents the region between successive modifiers and of some head. The graphical specification of the secondorder sibling model for dynamicprogramming, which is in the original work of Eisner [Eisner1996], is shown in Figure 2. The key insight is that an incomplete span is constructed by combining a smaller incomplete span with a sibling span that covers the region between the two successive modifiers. The new way allows for the collection of pairs of sibling dependents in a single state. It is no surprise that the dynamicprogramming structures and derivations of the algorithm for computing is the same as that of the decoding algorithm, and we omit the pseudocode of this algorithm.
The algorithm for computing can be designed with the new dynamic programming structures. The pseudocode of this algorithm is illustrated in Algorithm 3. We use to denote the weight of a sibling part . The computation of marginals of sibling parts is quite different from that of the firstorder dependency or secondorder grandchild model. For the introduction of sibling spans, two different cases should be considered: the modifiers are at the left/right side of the head. In addition, the part , which represents that is the innermost modifier of , is a special case and should be treated specifically. We can get marginals for all sibling parts with as following:
Since each derivation is defined by a span and a split point, the complexity for training and decoding of the secondorder sibling model is time and space.
4.3 Model of GrandSibling Factorization
We now describe the algorithms of the thirdorder grandsibling model. In this model, each tree is decomposed into grandsibling parts, which enclose grandchild and sibling parts. Formally, a grandsibling is a 4tuple of indices where is a sibling part and is a grandchild part. The algorithm of this factorization can be designed based on the algorithms for grandchild and sibling models.
Like the extension of the secondorder sibling model to the firstorder dependency model, we define the sibling gspans , where is a normal sibling span and is the index of the head of and , which lies outside the region with the implication that forms a valid sibling part. This model can also be treated as an extension of the sibling model by augmenting it with a grandparent index for each span, like the behavior of the grandchild model for the firstorder dependency model. Figure 2 provides the graphical specification of this factorization for dynamicprogramming, too. The overall structures and derivations is similar to the secondorder sibling model, with the addition of grandparent indices. The same to the secondorder grandchild model, the grandparent indices can be set deterministically in all cases.
The pseudocode of the algorithm for the computation of the outside probability is illustrated in Algorithm 4. It should be noted that in this model there are two types of special cases—one is the siblinggspan with or , as the complete gspan with or in the secondorder grandchild model; another is the innermost modifier case as the secondorder sibling model. We use to denote the weight of a grandsibling part and the marginals for all grandsibling parts with can be computed as follows:
Despite the extension to thirdorder parts, each derivation is still defined by a gspan and a split point as in secondorder grandchild model, so training and decoding of the grandsibling model still requires time and space.
sections  

Training  221  39,832  843,029  
PTB  Dev  22  1,700  35,508 
Test  23  2,416  49,892  
Training  001815; 10011136  16,079  370,777  
CTB  Dev  886931; 11481151  804  17,426 
Test  816885; 11371147  1,915  42.773  
Training    73088  1,255,590  
PDT  Dev    7,318  126,028 
Test    7,507  125,713 
5 Experiments for Dependency Parsing
5.1 Data Sets
We implement and evaluate the proposed algorithms of the three factored models (sibling, grandchild and grandsibling) on the Penn English Treebank (PTB version 3.0) [Marcus, Santorini, and Marcinkiewicz1993], the Penn Chinese Treebank (CTB version 5.0) [Xue et al.2005] and Prague Dependency Treebank (PDT) [Hajič1998, Hajič et al.2001].
For English, the PTB data is prepared by using the standard split: sections 221 are used for training, section 22 is for development, and section 23 for test. Dependencies are extracted by using Penn2Malt^{2}^{2}2http://w3.msi.vxu.se/~nivre/research/Penn2Malt.html tool with standard head rules [Yamada and Matsumoto2003]. For Chinese, we adopt the data split from Zhang and Clark zhang:2009, and we also used the Penn2Malt tool to convert the data into dependency structures. Since the dependency trees for English and Chinese are extracted from phrase structures in Penn Treebanks, they contain no crossing edges by construction. For Czech, the PDT has a predefined training, developing and testing split. we "projectivized" the training data by finding bestmatch projective trees^{3}^{3}3Projective trees for training sentences are obtained by running the firstorder projective parser with an oracle model that assigns a score of +1 to correct edges and 1 otherwise..
All experiments were running using every single sentence in each set of data regardless of length. Parsing accuracy is measured with unlabeled attachment score (UAS): the percentage of words with the correct head, root accuracy (RA): the percentage of correctly identified root words, and the percentage of complete matches (CM). Following the standard of previous work, we did not include punctuation^{4}^{4}4English evaluation ignores any token whose goldstandard POS is one of ; Chinese evaluation ignores any token whose tag is “PU” in the calculation of accuracies for English and Chinese. The detailed information of each treebank is showed in Table 1.
dependency features for part  

unigram features  bigram features  context features  
L(s)P(s)  L(s)P(s)L(t)P(t)  P(s)P(t)P(s+1)P(t1)  
L(s)  L(s)P(s)P(t)  P(s)L(t)P(t)  P(s)P(t)P(s1)P(t1) 
P(s)  L(s)P(s)L(t)  L(s)L(t)P(t)  P(s)P(t)P(s+1)P(t+1) 
L(t)P(t)  L(s)L(t)  P(s)P(t)  P(s)P(t)P(s+1)P(t1) 
L(t)  in between features  
P(t)  L(s)L(b)L(t)  P(s)P(b)P(t)  
grandchild features for part  sibling features for part  
trigram features  backedoff features  trigram features  backedoff features 
L(g)L(s)L(t)  L(g)L(t)  L(s)L(r)L(t)  L(r)L(t) 
P(g)P(s)P(t)  P(g)P(t)  P(s)P(r)P(t)  P(r)P(t) 
L(g)P(g)P(s)P(t)  L(g)P(t)  L(s)P(s)P(r)P(t)  L(r)P(t) 
P(g)L(s)P(s)P(t)  P(g)L(t)  P(s)L(r)P(r)P(t)  P(r)L(t) 
P(g)P(s)L(t)P(t)  P(s)P(r)L(t)P(t)  
grandsibling features for part  
4gram features  context features  backedoff features  
L(g)P(s)P(r)P(t)  P(g)P(s)P(r)P(t)P(g+1)P(s+1)P(t+1)  L(g)P(r)P(t)  
P(g)L(s)P(r)P(t)  P(g)P(s)P(r)P(t)P(g1)P(s1)P(t1)  P(g)L(r)P(t)  
P(g)P(s)L(r) 
Comments
There are no comments yet.